Unicode Technology Workshop 2025

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs 

 Conference Session Notes - Unicode Technical Workshop 2025 Presenters: Microsoft & Apple Text Layout Teams 

 Overview 

 This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content. 

 Why Learn About Text Display? 

 Software Development : Working on text display software or browsers Writing Systems : Understanding how different scripts are implemented Unicode Encoding : Planning to propose new writing system encodings Localization : Testing scenarios with different languages and scripts Text Analysis : Understanding the relationship between encoded characters and visual output 

 Text Display & Fonts 

 Core Concepts & Terminology 

 Key Terms 

 Characters - Abstract units stored in data files Code Points - Numeric values representing characters in Unicode String - Sequence of characters Glyphs - Actual visual shapes rendered on screen Glyph Run - Sequence of positioned glyphs Font Family - Set of fonts sharing design traits (e.g., Arial) Font Style - Specific variant within family (e.g., Arial Bold) 

 Critical Distinction 

 Character : Capital letter "A" (abstract concept) Glyph : The specific visual shape of "A" from a particular font 

 Basic Text Layout Process 

 Simple Case: Single Line, Latin Characters 

 The most basic form of text layout involves: 

 Sequence of glyphs arranged on a baseline Each glyph positioned adjacent to the previous one Left-to-right progression 

 Font Data Structure 

 Font files are organized as databases containing: 

 Name Table : Strings describing font metadata Glyph Table : Actual glyph outline data Metrics : Measurements for font and individual glyphs CMap Table : Character-to-glyph mapping 

 All data is organized into tables with 4-character mnemonic names. 

 Character-to-Glyph Mapping 

 CMap Table provides initial character→glyph mapping Glyph IDs are arbitrary numbers assigned by font designer Not all characters may be supported by a given font This is called the "nominal mapping" or "default glyph mapping" 

 Glyph Positioning Basics 

 Each glyph has: 

 Origin Point : Where X=0, Y=baseline intersection Left Side Bearing : Distance from origin to left edge Advance Width : Distance to move for next glyph position Outline Data : Control points defining the shape 

 Layout Process: 

 Align glyph origin with current drawing position Render the glyph Move drawing position by advance width Repeat for next character 

 Advanced Layout Requirements 

 Simple character-by-character layout is insufficient for: 

 1. Kerning 

 Adjusting spacing between specific letter pairs for better visual balance 

 Example: "VA" or "To" - reducing space for optical balance 

 2. Contextual Positioning 

 Arabic Script Example: 

 Letters change shape based on position in word Connecting scripts require precise glyph alignment Marks above letters must adjust to letter height 

 3. Combining Marks 

 Diacritical Marks: 

 Must position accurately relative to base letters Avoid collisions with other marks Handle complex combinations (multiple accents) 

 4. Glyph Substitution 

 Contextual Forms: 

 Same character may need different glyphs based on context Arabic: initial, medial, final, isolated forms Complex scripts require cluster analysis 

 5. Ligature Substitution 

 Typographic Ligatures: 

 Replace character sequences with single composed glyphs Example: "ffi" → single ligature glyph Improves readability and aesthetics 

 6. Language-Specific Variants 

 Same character may have different appearances in different languages 

 Example: Cyrillic letters in Russian vs. Bulgarian 

 7. Bidirectional Text (BIDI) 

 Some writing systems require right-to-left text processing 

 Hebrew and Arabic written right-to-left Mixed direction content requires Unicode Bidirectional Algorithm Glyph reordering within clusters may be needed 

 Implications 

 Advanced line layout is required for high quality typography & many scripts Complex character-to-glyph associations - no longer one-to-one mapping Default glyph metrics alone don't determine final positions Additional software logic is required beyond basic font data Font-specific details drive advanced layout behavior 

 OpenType Layout System 

 Advanced Layout Engine Requirements 

 General Advanced Layout Logic Script-Specific Behavior Logic : Based on Unicode character properties Font-Specific Data : Substitution and positioning rules 

 OpenType Tables 

 GDEF (Glyph Definition Table) : Classifies glyphs by type (base, mark, ligature, component) GSUB (Glyph Substitution Table) : Defines character/glyph substitution rules, handles contextual forms, ligatures, alternates GPOS (Glyph Positioning Table) : Defines positioning adjustments, handles kerning, mark positioning, cursive attachment 

 Text Shaping Engines 

 Platform-specific implementations: 

 CoreText (macOS) DirectWrite (Windows) HarfBuzz (Linux/Cross-platform) 

 Text Processing Pipeline 

 1. Run Segmentation 

 Script Itemization: 

 Segment text by Unicode script properties Group characters requiring similar processing 

 BiDi Level Analysis: 

 Apply Unicode Bidirectional Algorithm Determine text direction runs Handle mixed left-to-right and right-to-left content 

 2. Shaping Process 

 For each text run: 

 Canonical Decomposition ( UAX #15 ): 

 Normalize character sequences Handle composed vs. decomposed forms 

 Cluster Analysis ( UAX #29 ): 

 Identify character clusters that must be processed together Critical for complex scripts like Devanagari, Arabic 

 Glyph Substitution: 

 Apply contextual forms Process ligatures Handle language-specific variants 

 3. Positioning 

 Apply kerning adjustments Position combining marks using anchor points Handle cursive attachment Calculate final glyph positions 

 Bidirectional Text Processing 

 Unicode Bidirectional Algorithm 

 Every character has a Bidi_Class property: 

 Strong LTR : Latin letters (L) Strong RTL : Arabic, Hebrew letters (R, AL) Neutral : Punctuation, symbols (neutrally directional) 

 Processing Steps: 

 Assign embedding levels based on character properties Create level runs of same directionality Reorder glyphs within and between runs Handle neutral characters based on context 

 Result: Text displays correctly regardless of storage order 

 Font Fallback 

 When Fallback Occurs 

 When primary font lacks required glyphs: 

 Individual characters missing Entire clusters unsupported Language-specific glyph variants needed 

 Context Considerations 

 User Preferences: 

 Language settings Input method indicators Markup language tags 

 Font Matching Criteria: 

 Classification : serif, sans-serif, cursive, monospace Some classifications are specified by fonts themselves Some are determined by other means (仿宋) Attributes : weight, width, italic/oblique Fallback font may not exactly match all attributes Variable fonts can be responsive to some attributes 

 Available Font Selection: 

 Platform-dependent font sets Application-specific font lists Privacy considerations (web fonts) 

 Display Emojis 

 Emoji processing requires: 

 Character property analysis Known sequence recognition Variation selector handling Color font format support 

 Color Font Formats 

 Bitmapped : sbix, CBDT/CBLC tables Vector : COLRv0/v1 tables SVG : SVG table 

 Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning 

 Multi-Line Layout 

 Line Breaking 

 Uses accumulated glyph width information to: 

 Determine text that fits in available width Find appropriate break points Handle bidirectional content wrapping 

 Vertical Spacing 

 Font Metrics: 

 Ascent : Distance above baseline Descent : Distance below baseline Line Gap : Additional spacing between lines 

 Applications may apply additional line spacing adjustments. 

 Common Display Problems 

 1. Invalid Clusters 

 Causes: 

 Incorrect character sequences for script Components in wrong order Unicode normalization issues 

 Symptoms: 

 Dotted circles indicating invalid combinations Missing or misplaced diacritical marks 

 2. Copy/Paste from PDF Issues 

 Problem: PDFs store glyph positions, not original text 

 Advanced layout information lost Character-to-glyph mapping may be irreversible Copy/paste produces garbled text 

 Solution: Ensure PDFs embed proper text extraction data 

 3. Font Style Mismatches 

 Causes: 

 Fallback font doesn't match original style Limited font selection available Font classification mismatches 

 Note: Fallback prioritizes legibility over style matching 

 4. Text Truncated Vertically 

 Causes: 

 Text controls sized for specific scripts Different writing systems have different vertical requirements Font metrics not properly accounted for 

 5. Encoding Errors 

 Not Text Layout Issues - Upstream Problems: 

 Transcoding Failures: 

 UTF-8 interpreted as legacy encoding Double-encoding artifacts Replacement characters (�) indicating conversion failure 

 Legacy Software: 

 Non-Unicode capable applications Question marks for unsupported characters Incomplete UTF-16 surrogate handling 

 6. Incorrect Parsing of UTF-8 or UTF-16 Sequences 

 Software incorrectly assumes encoding format, leading to garbage text display 

 Implementation Implications 

 Performance Considerations 

 Text display is extremely common operation Software optimized for efficient processing Incremental updates for document editing Constraint analysis to minimize re-layout 

 Complexity Management 

 Simple Scripts : May use optimized basic layout paths Complex Scripts : Require full advanced layout pipeline Modern Approach : Apply advanced layout universally for consistent typography 

 Development Guidelines 

 Don't rely on font fallback for proper localization Test with target languages early in development Understand platform differences in text processing Plan for complex script requirements from the beginning 

 Debugging Text Display Issues 

 Diagnostic Approach 

 Identify the problem type: Layout/positioning issue Font fallback problem Encoding/conversion error Platform/software limitation Gather information: What font was actually used? What text processing occurred? What are the original character codes? What platform/software environment? Consult experts: Text layout engineers Language/script experts Platform documentation 

 Tools and Resources 

 Unicode Character Database Script-specific documentation Platform text layout APIs Font inspection tools Text encoding validators 

 Key Takeaways 

 Text display is complex - What you see is the result of sophisticated processing Character ≠ Glyph - One-to-many relationships are common Context matters - Same characters may render differently based on surrounding text Scripts vary widely - Solutions must accommodate diverse writing systems Font data drives behavior - Advanced layout depends on font-provided rules Testing is crucial - Problems often surface only with real-world multilingual content 

 Further Reading 

 Unicode Standard : unicode.org UAX #15 : Unicode Normalization Forms UAX #29 : Unicode Text Segmentation UAX #9 : Unicode Bidirectional Algorithm OpenType Specification : Microsoft Typography documentation Platform APIs : CoreText (Apple), DirectWrite (Microsoft), HarfBuzz documentation 

 Session recorded at Unicode Technical Workshop 2025 Notes compiled from presentation materials and transcript

Getting started with ICU4X
Needs 

 Low latency requirements Data heavy algorithms Privacy implications Rich uX Network degration resilience 

 ICU4X

Grammatical Agreement with Unicode Inflection
AI 

 high computational cost and network latency 

 top tier languages are covered but torso and tail lack data 

 bias based language 

 

 Where AI can help 

 Offline processing where latency or resources are not critical, like grammar fixing, lexicon generation/expansion smaller, older, less costly models could be used for higher coverage or better accuracy, like LSTMs Coverage for language we don't have experts to generate rules or with high grammar complexity Client side support is slowly improving with nano models 

 Concept of Lemmaless inflection

Segmenting Complex Scripts with Machine Learning
Line and word breaks 

 Word breaks 

 Dictionary based segmentation 

 where it fall short? 

 size is too large new or specialized words are not easily recognized (xx-ing) longest match can fail by missing correct shorter words 

 2 Board cases needed difference solutions 

 south east asian SEA East Asian CJK 

 CJK: 

 Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries 

 RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation. 

 BudoX/RAdaBoost 

 AdaBoost learners 

 ICU dic 2.0M 

 BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb

Links with Non-ASCII: Unicode Detection and Display
Basically Unreadable with Percentage Codes 

 Draft UTS#58 Link Detection and Formatting

Automated I18n Quality for Enterprise Platforms
Globalization Readiness 

 Linguistic Quality Extensibility Maintainability Time to market Portability (standard based) 

 Reactive vs Proactive 

 re: fix bugs, correct translations, troubleshoot, but costumer will find issues before you 

 Prevent bugs, establish best practices that are global ready 

 Using AI out of the box 

 goose: Agentic vibe coding, but it dose not use ICU, does not deal with data ready for i18n. 

 LLM->most common, but statistically wrong. 

 Not using standard region codes. Assumes only one language per region Assumes only two forms for plural Sloppy plural(s) construct in some languages No gender handling Embeds formatting and layout with content Content for all locales in a single file (not shown) Poor phone structure as raw text No attempt to find or use libraries for phone, address, or to CU or CLDR 

 Detect Issues in source content 

 Before antering the translation pipeline Within Atlas, a plafform for managing localization workflows Rulebased linting Using 3rd party lib: ilib-lint 

 Github -> CI（自动实行构建） -> AWS -> Management platform ->Github/CI/Translator vendor 

 Detect issues in source code 

 Independent of translatable content Much larger dataset Build a custom scanner Static Analysis + AI Many programming language Custom integrations 

 i18n using AI + Self-Healing 

 Sourcecode I18n self healing using AI study 

 Scan-train-refine Knowend and discovered 

 Going forward with AI 

 i18n anti patter development Scanning tool development Fine tuning results AI Training Self-healing training CI/CD Intergration

End-to-end i18n system by TikTok
Part of TikTok Design System SDK supporting Tiktok locales and 200+ CLDR Locales Real business needs embedded