跳到主要内容

(译) 文本布局是一种松散的分割层级结构

· 阅读需 37 分钟

本译文采用与原文相同的许可协议进行授权和传播。

本译文不会对原文做任何除格式调整和拼写错误以外的调整和修改,以确保原文内容的完整性,保证原文所要阐述的事实和思想不被曲解。

我喜欢文本布局,并且已经以不同形式和它打了 35 年交道。 然而,有关它的知识还是相当晦涩。 我不认为在某一个地方就能够详尽地阐述它,因为: 虽然基本的文本布局对 UI、游戏和其他语境非常重要, 但在如微软 Word 和现代 Web 浏览器这些复杂的系统中, 掌握文本布局是有许多「专业性」要求的。

要完整地讲清楚文本布局,至少得写满一小本书。 由于现在我也没办法写这本书,这篇博客文章就算是迈出的一小步, 特别是,以「松散的层级结构」这一概念框架来描述「大图景」的一种尝试。 本质上,文本布局引擎是将输入拆分成很细很细的小块, 再重新组装为一个适用于塑形、测量和命中测试的文本布局对象。

译注:有关「命中测试」的概念, 请参考 Hit-Testing in iOS (原文地址https://smnh.me/hit-testing-in-ios)。

主要层级结构关注的是将整个段落布局为一行文本。 换行也同样重要,但具有独立的、平行的层级结构。

主要文本布局层级结构

该层级结构为: 段落分割为最粗粒度,接着是富文本样式和 BiDi 分析, 然后是逐项(字体覆盖率),其次是 Unicode 脚本,最细粒度的是塑形簇。

diagram of layout hierarchy

段落分割

最粗糙,也是最简单的分割任务就是段落分割。 尽管,Unicode 以其无限的智慧指定了许多码位作为纯文本中的段落分隔符, 但大多数时候,段落仅需简单地用换行字符(U+000A)分隔即可。

  • U+000A LINE FEED
  • U+000B VERTICAL TAB
  • U+000C FORM FEED
  • U+000D CARRIAGE RETURN
  • U+000D U+000A (CR + LF)
  • U+0085 NEXT LINE
  • U+2008 LINE SEPARATOR
  • U+2009 PARAGRAPH SEPARATOR

在富文本中,段落通常由标记符表示,而不是特殊字符,例如HTML中的<p><br>。 但是在本文中,和大多数文本布局API一样,我们将富文本视为纯文本+属性跨度

富文本样式

富文本段落可能包含影响格式的跨度。 特别是字体的选择,字体粗细,斜体或非斜体, 以及其他一些属性也会影响文本布局。 因此,每个段落通常被分成若干个样式运行, 这样,在一个运行中样式是一致的。

注意,有些样式变化没有必要影响文本布局,最好的例子就是颜色。 所周知,Firefox 没有为颜色变化定义分割边界。 如果一种颜色处在连字的边界上, 它使用花哨的图形技术通过不同颜色来渲染连字的各个部分。 但这是一个不易察觉的改进,我认为对于基础的文本渲染是没有必要的。 有关更多细节,请参见 Text Rendering Hates You

双向文本分析

完全独立于样式跨度,一个段落通常可以包含从左到右和从右到左的文本。 对双向(BiDi)文本的需求无疑是使文本布局更加复杂的原因之一。

幸运的是,层级结构中的该部分已经有标准(UAX #9)做了定义, 并且有很多很好的实现。感兴趣的读者可以参考 Unicode Bidirectional Algorithm basics。 这里的关键结论是,BiDi 分析是在整个段落的纯文本上完成的, 分析结果是一系列级别运行,其中每个运行的级别定义了它是 LTR 还是 RTL。

然后,级别运行和样式运行被合并,这样在随后的阶段中,每个运行都具有一致的样式和方向。 因此,为了定义层级结构,可以将 BiDi 分析的结果视为隐式或派生的富文本跨度。

除了我认为是基本需求的 BiDi 之外,更复杂的文本布局引擎还将能够处理垂直 书写模式, 包括短字符串在垂直主方向内是水平的混合情况。 极其复杂的布局引擎也将能够处理拼音文本和其他用插入字符串注释主要文本流的方式。 关于许多复杂的布局要求的例子, 请参见 Requirements for Japanese Text Layout, 这篇文章的范围实际上是用户界面所需的基本文本布局。

逐项(字体覆盖率)

逐项是层级结构中最棘手和最不明确的部分。 它没有标准,也没有通用的实现。 相反,每个文本布局引擎都以自己特殊的方式处理它。

从本质上讲,逐项的结果是从字体集合中为运行选择一种具体的字体。 一般来说,字体集合由一个主字体(通过字体名称从系统字体中选择,或作为自定义资源加载)组成, 由一个后备栈作支持,后备栈通常是系统字体。 如果您不介意为这些资源花费几百兆字节, 通过 Noto 便可以将后备字体栈与应用程序捆绑在一起。

为什么会如此复杂?有几个原因,我稍后会涉及到。

首先,要确定一种字体是否可以呈现特定的文本字符串并不容易。 原因之一是 Unicode标准化。 例如,字符串 “é” 可以被编码为 U+00E9(NFC 编码)或 U+0065 U+0301(NFD 编码)。 由于 Unicode 等价原则, 这些字符应该完全相同地呈现, 但是在字符到字符索引映射(cmap)表中, 一种字体可能只覆盖其中一种。 塑形引擎具有处理这些情况的所有 Unicode 逻辑。

当然,覆盖拉丁文的现实字体将在 cmap 表中覆盖这两个特定序列, 但极端情况肯定会发生,无论是在扩展的拉丁文中,还是在其他脚本中, 比如韩文,都有复杂的规范化规则(部分原因是韩国的规范化标准与 Unicode 有点不一致)。 值得注意的是, DirectWrite对韩文规范化的处理是完全错误的

我相信阿拉伯语的表达形式也存在类似的情况。 有关更多细节,请参见 Developing Arabic fonts

由于这些棘手的规范化和表示问题,确定字体是否可以渲染字符串的最可靠的方法就是挨个试。 这是 LibreOffice 一段时间以来的工作方式, 2015 年 Chromium 也效仿了。 有关 Chromium 文本布局变化的更多背景信息,请参见 Eliminating Simple Text

另一类复杂的东西是表情(emoji)。 很多表情既可以通过文本也可以通过表情符号呈现, 并且没有硬性且快速的规则来选择其中之一。 一般来说,文本呈现使用符号字体,而表情符号呈现使用单独的颜色字体。 一个特别棘手的例子是微笑表情, 它在 Code page 437 中以 0x01 开始编码, 这是最初 IBM 个人电脑的标准 8 位字符编码,现在在 Unicode 中是 U+263A。 然而,建议的默认呈现是文本,这在需要颜色的世界中是行不通的。 iOS 上的苹果公司单方面选择了表情符号呈现, 因此许多文本栈都跟随苹果公司的做法。 (顺便说一句, 编码这种表情符号的最可靠的方法是添加一个变量选择器, 以确定呈现方式)

在尝试编写跨平台文本布局引擎时,另一个复杂的来源是查询系统字体。 阅读 Font fallback deep dive 以了解更多信息。

我应该提醒一件事,这可能有助于人们对传统文本栈进行考古: 它曾经常用于对文本布局进行“兼容性”格式处理,比如 NFKC 和 NFKD, 这可能会导致各种问题。 但是今天,通过提供一个具备大量 Unicode 覆盖范围的字体栈, 包括在相关兼容性范围内的所有码位,它更常用于解决特定的问题。

脚本

文本的塑形,或将一系列码位转换为一系列定位字符,都取决于脚本。 有些文字,如阿拉伯语和天城文,具有极其复杂的塑形规则, 而其他文字,如中文,则具有相当直接的从码位到字符的映射。 拉丁语介于两者之间,从一个简单的映射开始, 但是连字和字距也是高质量文本布局所必需的。

确定执行脚本相当简单 —— 许多字符都有一个 Unicode 脚本属性, 它唯一地标识了它们属于哪个脚本。 但是,有些字符,如空格,是“常见”字符, 因此指定的脚本只是继续前一次的运行。

一个简单的例子是 “hello мир”。 这个字符串被分成两个执行脚本:“hello” 是 Latn,“мир” 是 Cyrl

塑形(字符簇)

在这一点上,我们有一个固定的样式、字体、方向和脚本运行。 它已经准备进行塑形了。 塑形是一个将字符串(Unicode 码位序列)转换为定位字符的复杂过程。 出于这篇博客文章的目的,我们通常可以将它视为一个黑盒。 幸运的是,存在一个高质量的开源实现,即 HarfBuzz

不过,我们没有完全完成分割, 因为塑形是将输入中的子字符串分配给字符。 对应关系很大程度上取决于字体。 在拉丁语中,字符串“fi”通常被塑形成一个单一的字符(连字)。 对于像天神阁里这样复杂的脚本,一个簇通常是源文本中的一个音节,并且在簇内可能会发生复杂的重排。

簇对于点击测试很重要,即确定文本布局中物理光标位置与文本内偏移量之间的对应关系。 通常,如果文本只是被渲染,而不是被编辑(或被选择),那么它们可以被忽略。

注意,这些塑性簇与字形簇是不同的。 例如,“fi”有两个字形簇但只有一个塑性簇,因此字形簇边界可以切割塑性簇。 由于光标可以在“f”和“i”之间移动,一个棘手的问题是确定在这种情况下光标的位置。 字体确实有一个插入表, 但它的实现是参差不齐的。 更稳健的解决方案是将塑性簇的宽度平均分配给聚类中的每个字形簇。 关于字形簇的详细介绍,请参见 Let’s Stop Ascribing Meaning to Code Points

换行

较短的字符串可以看作是单行,较长的字符串则需要分成行。 正确地做到这一点是相当棘手的问题。 在这篇文章中,我们将其视为一个独立的(小)层级结构, 与上面的主要文本布局层级结构类似。

The problem can be factored into identifying line break candidates, then choosing a subset of those candidates as line breaks that satisfy the layout constraints. The main constraint is that lines should fit within the specified maximum width. It’s common to use a greedy algorithm, but high end typography tends to use an algorithm that minimizes a raggedness score for the paragraph. Knuth and Plass have a famous paper, Breaking Paragraphs into Lines, that describes the algorithm used in TeX in detail. But we’ll focus on the problems of determining candidates and measuring the widths, as these are tricky enough.

以下为关于计算机字体渲染相关的选段,请将其翻译为中文:

In theory, the Unicode Line Breaking Algorithm (UAX #14) identifies positions in a string that are candidate line breaks. In practice, there are some additional subtleties. For one, some languages (Thai is the most common) don’t use spaces to divide words, so need some kind of natural language processing (based on a dictionary) to identify word boundaries. For two, automatic hyphenation is often desirable, as it fills lines more efficiently and makes the right edge less ragged. Liang’s algorithm is most common for automatically inferring “soft hyphens”, and there are many good implementations of it.

Android’s line breaking implementation (in the Minikin library) applies an additional refinement: since email addresses and URLs are common in strings displayed on mobile devices, and since the UAX #14 rules give poor choices for those, it has an additional parser to detect those cases and apply different rules.

Finally, if words are very long or the maximum width is very narrow, it’s possible for a word to exceed that width. In some cases, the line can be “overfull”, but it’s more common to break the word at the last grapheme cluster boundary that still fits inside the line. In Android, these are known as “desperate breaks”.

So, to recap, after the paragraph segmentation (also known as “hard breaks”), there is a loose hierarchy of 3 line break candidates: word breaks as determined by UAX #14 (with possible “tailoring”), soft hyphens, and finally grapheme cluster boundaries. The first is preferred, but the other two may be used in order to satisfy the layout constraints.

This leaves another problem, which is suprisingly tricky to get fully right: how to measure the width of a line between two candidate breaks, in order to validate that it fits within the maximum width (or, in the more general case, to help compute a global raggedness score). For Latin text in a normal font, this seems almost ridiculously easy: just measure the width of each word, and add them up. But in the general case, things are nowhere nearly so simple.

First, while in Latin, most line break candidates are at space characters, in the fully general case they can cut anywhere in the text layout hierarchy, even in the middle of a cluster. An additional complication is that hyphenation can add a hyphen character.

Even without hyphenation, because shaping is Turing Complete, the width of a line (a substring between two line break candidates) can be any function. Of course, such extreme cases are rare; it’s most common for the widths to be exactly equal to the sum of the widths of the words, and even in the other cases this tends to be a good approximation.

So getting this exactly right in the general case is conceptually not difficult, but is horribly inefficient: for each candidate for the end of the line, perform text layout (mostly shaping) on the substring from the beginning of the line (possibly inserting a hyphen), and measure the width of that layout.

Very few text layout engines even try to handle this general case, using various heuristics and approximations which work well most of the time, but break down when presented with a font with shaping rules that change widths aggressively. DirectWrite does, however, using very clever techniques that took several years of iteration. The full story is in harfbuzz/harfbuzz#1463 (comment). Further analysis, towards a goal of getting this implemented in an open source text layout engine, is in yeslogic/allsorts#29. If and when either HarfBuzz or Allsorts implements the lower-level logic, I’ll probably want to write another blog post explaining in more detail how a higher level text layout engine can take advantage of it.

A great example of how line breaking can go wrong is Firefox bug 479829, in which an “f + soft hyphen + f” sequence in the text is shaped as the “ff” ligature, then the line is broken at the soft hyphen. Because Firefox reuses the existing shaping rather than reshaping the line, it actually renders with the ligature glyph split across lines:

Example of layout bug in Firefox

实现方案

While I still feel a need for a solid, high-level, cross-platform text layout engine, there are good implementations to study. In open source, on of my favorites (though I am biased), is the Android text stack, based on Minikin for its lower levels. It is fairly capable and efficient, and also makes a concerted effort to get “all of Unicode” right, including emoji. It is also reasonably simple and the code is accessible.

While not open source, DirectWrite is also well worth study, as it is without question one of the most capable engines, supporting Word and the previous iteration of Edge before it was abandonded in favor of Chromium. Note that there is a proposal for a cross-platform implementation and also potentially to take it open-source. If that were to happen, it would be something of a game changer.

Chromium and Firefox are a rich source as well, especially as they’ve driven a lot of the improvements in HarfBuzz. However, their text layout stacks are quite complex and do not have a clean, documented API boundary with the rest of the application, so they are not as suitable for study as the others I’ve chosen here.

Android

Paragraph and style segmentation (with BiDi) is done at higher levels, in Layout.java and StaticLayout.java. At that point, runs are handed to Minikin for lower-level processing. Most of the rest of the hierarchy is in Layout.cpp, and ultimately shaping is done by HarfBuzz.

Minikin also contains a sophisticated line breaking implementation, including Knuth-Plass style optimized breaking.

Android deals with shaping boundaries by using heuristics to further segment the text to implied word boundaries (which are also used as the grain for layout cache). If a font does shaping across these boundaries, the shaping context is simply lost. This is a reasonable compromise, especially in mobile, as results are always consistent, ie the width for measurement never mismatches the width for layout. And none of the fonts in the system stack have exotic behavior such as shaping across spaces.

Android does base its itemization on cmap coverage, and builds sophisticated bitmap structures for fast queries. As such, it can get normalization issues wrong, but overall this seems like a reasonable compromise. In particular, most of the time you’ll run into normalization issues is with Latin and the combining diacritical marks, both of which are supplied by Roboto, which in turn has massive Unicode coverage (and thus less need to rely on normalization logic). But with custom fonts, handling may be less than ideal, resulting in more fallback to Roboto than might actually be needed.

Note that Minikin was also the starting point for libTxt, the text layout library used in Flutter.

DirectWrite

Some notes on things I’ve found while studying the API; these observations are quite a bit in the weeds, but might be useful to people wanting to deeply understand or engage the API.

Hit testing in DirectWrite is based on leading/trailing positions, while in Android it’s based on primary and secondary. The latter is more useful for text edition, but leading/trailing is a more well-defined concept (for one, it doesn’t rely on paragraph direction). For more information on this topic, see linebender/piet#323. My take is that proper hit testing requires iterating through the text layout to access lower level structures.

While Core Text (see below) exposes a hierarchy of objects, DirectWrite uses the TextLayout as the primary interface, and exposes internal structure (even including lines) by iterating over a callback per run in the confusingly named Draw method. The granularity of this callback is a glyph run, which corresponds to “script” in the hierarchy above. Cluster information is provided in an associated glyph run description structure.

There are other ways to access lower level text layout capabilities, including TextAnalyzer, which computes BiDi and line break opportunities, script runs, and shaping. In fact, the various methods on that interface represents much of the internal structure of the text layout engine. Itemization, however, is done in the FontFallback interface, which was added later.

Core Text

Another high quality implementation is Core Text. I don’t personally find it as well designed as DirectWrite, but it does get the job done. In general, though, Core Text is considered a lower level interface, and applications are recommended to use a higher level mechanism (Cocoa text on macOS, Text Kit on iOS).

When doing text layout on macOS, it’s probably better to use the platform-provided itemization method (CTFontCreateForString), rather than getting the font list and doing itemization in the client. See linebender/skribo#14 for more information on this tradeoff.

Druid/Piet

At this point, the Druid GUI toolkit does not have its own native text layout engine, but rather does provide a cross-platform API which is delegated to platform text layout engines, DirectWrite and Core Text in particular.

The situation on Linux is currently unsatisfactory, as it’s based on the Cairo toy text API. There is work ongoing to improve this, but no promises when.

While the Piet text API is currently fairly basic, I do think it’s a good starting point for text layout, especially in the Rust community. While the complexity of Web text basically forces browsers to do all their text layout from scratch, for UI text there are serious advantages to using the platform text layout capabilities, including more consistency with native UI, and less code to compile and ship.

Pango

I should at least mention Pango, which provides text layout capabilities for Gtk and other software. It is open source and has a long history, but is more focused on the needs of Linux and in my opinion is less suitable as a cross-platform engine, though there is porting work for both Windows and macOS. As evidence it hasn’t been keeping quite up to date, the Windows integration is all based on GDI+ rather than the more recent Direct2D and DirectWrite, so capabilities are quite limited by modern standards.

关于接口层次的问题

A consistent theme in the design of text level APIs is: what level? Ideally the text layout engine provides a high level API, meaning that rich text (in some concrete representation) comes in, along with the fonts, and a text layout object comes out. However, this is not always adequate.

In particular, word processors and web browsers have vastly more complex layout requirements than can be expressed in a reasonable “attributed string” representation of rich text. F or these applications, it makes sense to break apart the task of text layout, and provide unbundled access to these lower levels. Often, that corresponds to lower levels in the hierarchy I’ve presented. A good choice of boundary is style runs (including BiDi), as it simiplifies the question of rich text representation; expressing the style of a single run is simpler than a data structure which can represent all formatting requirements for the rich text.

Until more recently, web browsers tended to use platform text capabilities for the lower levels, but ultimately they needed more control, so for the most part, they do all the layout themselves, deferring to the platform only when absolutely necessary, for example to enumerate the system fonts for fallback.

The desire to accommodate both UI and browser needs motivated the design of the skribo API, and explains why it only handles single style runs. Unfortunately, the lack of a complementary high level driver proved to be quite a mistake, as there was no easy way for applications to use the library. We will be rethinking some of these decisions in coming months.

其他资源

A book in progress on text layout is Fonts and Layout for Global Scripts by Simon Cozens. There is more emphasis on complex script shaping and fonts, but touches on some of the same concepts as here.

Another useful resources is Modern text rendering with Linux: Overview, which has a Linux focus and explains Pango in more detail. It also links the SIGGRAPH 2018 - Digital typography slide deck, which is quite informative.

Thanks to Chris Morgan for review and examples.

版权声明