Unicode Technology Workshop 2025
- Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
- Getting started with ICU4X
- Grammatical Agreement with Unicode Inflection
- Segmenting Complex Scripts with Machine Learning
- Links with Non-ASCII: Unicode Detection and Display
- Automated I18n Quality for Enterprise Platforms
- End-to-end i18n system by TikTok
Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Conference Session Notes - Unicode Technical Workshop 2025
Presenters: Microsoft & Apple Text Layout Teams
Overview
This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content.
Why Learn About Text Display?
- Software Development: Working on text display software or browsers
- Writing Systems: Understanding how different scripts are implemented
- Unicode Encoding: Planning to propose new writing system encodings
- Localization: Testing scenarios with different languages and scripts
- Text Analysis: Understanding the relationship between encoded characters and visual output
Text Display & Fonts
Core Concepts & Terminology
Key Terms
- Characters - Abstract units stored in data files
- Code Points - Numeric values representing characters in Unicode
- String - Sequence of characters
- Glyphs - Actual visual shapes rendered on screen
- Glyph Run - Sequence of positioned glyphs
- Font Family - Set of fonts sharing design traits (e.g., Arial)
- Font Style - Specific variant within family (e.g., Arial Bold)
Critical Distinction
- Character: Capital letter "A" (abstract concept)
- Glyph: The specific visual shape of "A" from a particular font
Basic Text Layout Process
Simple Case: Single Line, Latin Characters
The most basic form of text layout involves:
- Sequence of glyphs arranged on a baseline
- Each glyph positioned adjacent to the previous one
- Left-to-right progression
Font Data Structure
Font files are organized as databases containing:
- Name Table: Strings describing font metadata
- Glyph Table: Actual glyph outline data
- Metrics: Measurements for font and individual glyphs
- CMap Table: Character-to-glyph mapping
All data is organized into tables with 4-character mnemonic names.
Character-to-Glyph Mapping
- CMap Table provides initial character→glyph mapping
- Glyph IDs are arbitrary numbers assigned by font designer
- Not all characters may be supported by a given font
- This is called the "nominal mapping" or "default glyph mapping"
Glyph Positioning Basics
Each glyph has:
- Origin Point: Where X=0, Y=baseline intersection
- Left Side Bearing: Distance from origin to left edge
- Advance Width: Distance to move for next glyph position
- Outline Data: Control points defining the shape
Layout Process:
- Align glyph origin with current drawing position
- Render the glyph
- Move drawing position by advance width
- Repeat for next character
Advanced Layout Requirements
Simple character-by-character layout is insufficient for:
1. Kerning
Adjusting spacing between specific letter pairs for better visual balance
- Example: "VA" or "To" - reducing space for optical balance
2. Contextual Positioning
Arabic Script Example:
- Letters change shape based on position in word
- Connecting scripts require precise glyph alignment
- Marks above letters must adjust to letter height
3. Combining Marks
Diacritical Marks:
- Must position accurately relative to base letters
- Avoid collisions with other marks
- Handle complex combinations (multiple accents)
4. Glyph Substitution
Contextual Forms:
- Same character may need different glyphs based on context
- Arabic: initial, medial, final, isolated forms
- Complex scripts require cluster analysis
5. Ligature Substitution
Typographic Ligatures:
- Replace character sequences with single composed glyphs
- Example: "ffi" → single ligature glyph
- Improves readability and aesthetics
6. Language-Specific Variants
Same character may have different appearances in different languages
- Example: Cyrillic letters in Russian vs. Bulgarian
7. Bidirectional Text (BIDI)
Some writing systems require right-to-left text processing
- Hebrew and Arabic written right-to-left
- Mixed direction content requires Unicode Bidirectional Algorithm
- Glyph reordering within clusters may be needed
Implications
- Advanced line layout is required for high quality typography & many scripts
- Complex character-to-glyph associations - no longer one-to-one mapping
- Default glyph metrics alone don't determine final positions
- Additional software logic is required beyond basic font data
- Font-specific details drive advanced layout behavior
OpenType Layout System
Advanced Layout Engine Requirements
- General Advanced Layout Logic
- Script-Specific Behavior Logic: Based on Unicode character properties
- Font-Specific Data: Substitution and positioning rules
OpenType Tables
- GDEF (Glyph Definition Table): Classifies glyphs by type (base, mark, ligature, component)
- GSUB (Glyph Substitution Table): Defines character/glyph substitution rules, handles contextual forms, ligatures, alternates
- GPOS (Glyph Positioning Table): Defines positioning adjustments, handles kerning, mark positioning, cursive attachment
Text Shaping Engines
Platform-specific implementations:
- CoreText (macOS)
- DirectWrite (Windows)
- HarfBuzz (Linux/Cross-platform)
Text Processing Pipeline
1. Run Segmentation
Script Itemization:
- Segment text by Unicode script properties
- Group characters requiring similar processing
BiDi Level Analysis:
- Apply Unicode Bidirectional Algorithm
- Determine text direction runs
- Handle mixed left-to-right and right-to-left content
2. Shaping Process
For each text run:
Canonical Decomposition (UAX #15):
- Normalize character sequences
- Handle composed vs. decomposed forms
Cluster Analysis (UAX #29):
- Identify character clusters that must be processed together
- Critical for complex scripts like Devanagari, Arabic
Glyph Substitution:
- Apply contextual forms
- Process ligatures
- Handle language-specific variants
3. Positioning
- Apply kerning adjustments
- Position combining marks using anchor points
- Handle cursive attachment
- Calculate final glyph positions
Bidirectional Text Processing
Unicode Bidirectional Algorithm
Every character has a Bidi_Class property:
- Strong LTR: Latin letters (L)
- Strong RTL: Arabic, Hebrew letters (R, AL)
- Neutral: Punctuation, symbols (neutrally directional)
Processing Steps:
- Assign embedding levels based on character properties
- Create level runs of same directionality
- Reorder glyphs within and between runs
- Handle neutral characters based on context
Result: Text displays correctly regardless of storage order
Font Fallback
When Fallback Occurs
When primary font lacks required glyphs:
- Individual characters missing
- Entire clusters unsupported
- Language-specific glyph variants needed
Context Considerations
User Preferences:
- Language settings
- Input method indicators
- Markup language tags
Font Matching Criteria:
- Classification: serif, sans-serif, cursive, monospace
- Some classifications are specified by fonts themselves
- Some are determined by other means (仿宋)
- Attributes: weight, width, italic/oblique
- Fallback font may not exactly match all attributes
- Variable fonts can be responsive to some attributes
Available Font Selection:
- Platform-dependent font sets
- Application-specific font lists
- Privacy considerations (web fonts)
Display Emojis
Emoji processing requires:
- Character property analysis
- Known sequence recognition
- Variation selector handling
- Color font format support
Color Font Formats
- Bitmapped: sbix, CBDT/CBLC tables
- Vector: COLRv0/v1 tables
- SVG: SVG table
Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning
Multi-Line Layout
Line Breaking
Uses accumulated glyph width information to:
- Determine text that fits in available width
- Find appropriate break points
- Handle bidirectional content wrapping
Vertical Spacing
Font Metrics:
- Ascent: Distance above baseline
- Descent: Distance below baseline
- Line Gap: Additional spacing between lines
Applications may apply additional line spacing adjustments.
Common Display Problems
1. Invalid Clusters
Causes:
- Incorrect character sequences for script
- Components in wrong order
- Unicode normalization issues
Symptoms:
- Dotted circles indicating invalid combinations
- Missing or misplaced diacritical marks
2. Copy/Paste from PDF Issues
Problem: PDFs store glyph positions, not original text
- Advanced layout information lost
- Character-to-glyph mapping may be irreversible
- Copy/paste produces garbled text
Solution: Ensure PDFs embed proper text extraction data
3. Font Style Mismatches
Causes:
- Fallback font doesn't match original style
- Limited font selection available
- Font classification mismatches
Note: Fallback prioritizes legibility over style matching
4. Text Truncated Vertically
Causes:
- Text controls sized for specific scripts
- Different writing systems have different vertical requirements
- Font metrics not properly accounted for
5. Encoding Errors
Not Text Layout Issues - Upstream Problems:
Transcoding Failures:
- UTF-8 interpreted as legacy encoding
- Double-encoding artifacts
- Replacement characters (�) indicating conversion failure
Legacy Software:
- Non-Unicode capable applications
- Question marks for unsupported characters
- Incomplete UTF-16 surrogate handling
6. Incorrect Parsing of UTF-8 or UTF-16 Sequences
Software incorrectly assumes encoding format, leading to garbage text display
Implementation Implications
Performance Considerations
- Text display is extremely common operation
- Software optimized for efficient processing
- Incremental updates for document editing
- Constraint analysis to minimize re-layout
Complexity Management
- Simple Scripts: May use optimized basic layout paths
- Complex Scripts: Require full advanced layout pipeline
- Modern Approach: Apply advanced layout universally for consistent typography
Development Guidelines
- Don't rely on font fallback for proper localization
- Test with target languages early in development
- Understand platform differences in text processing
- Plan for complex script requirements from the beginning
Debugging Text Display Issues
Diagnostic Approach
- Identify the problem type:
- Layout/positioning issue
- Font fallback problem
- Encoding/conversion error
- Platform/software limitation
- Gather information:
- What font was actually used?
- What text processing occurred?
- What are the original character codes?
- What platform/software environment?
- Consult experts:
- Text layout engineers
- Language/script experts
- Platform documentation
Tools and Resources
- Unicode Character Database
- Script-specific documentation
- Platform text layout APIs
- Font inspection tools
- Text encoding validators
Key Takeaways
- Text display is complex - What you see is the result of sophisticated processing
- Character ≠ Glyph - One-to-many relationships are common
- Context matters - Same characters may render differently based on surrounding text
- Scripts vary widely - Solutions must accommodate diverse writing systems
- Font data drives behavior - Advanced layout depends on font-provided rules
- Testing is crucial - Problems often surface only with real-world multilingual content
Further Reading
- Unicode Standard: unicode.org
- UAX #15: Unicode Normalization Forms
- UAX #29: Unicode Text Segmentation
- UAX #9: Unicode Bidirectional Algorithm
- OpenType Specification: Microsoft Typography documentation
- Platform APIs: CoreText (Apple), DirectWrite (Microsoft), HarfBuzz documentation
Session recorded at Unicode Technical Workshop 2025
Notes compiled from presentation materials and transcript
Getting started with ICU4X
Needs
- Low latency requirements
- Data heavy algorithms
- Privacy implications
- Rich uX
- Network degration resilience
ICU4X
Grammatical Agreement with Unicode Inflection
AI
high computational cost and network latency
top tier languages are covered but torso and tail lack data
bias based language
Where AI can help
- Offline processing where latency or resources are not critical, like grammar fixing, lexicon generation/expansion
- smaller, older, less costly models could be used for higher coverage or better accuracy, like LSTMs
- Coverage for language we don't have experts to generate rules or with high grammar complexity
- Client side support is slowly improving with nano models
Concept of Lemmaless inflection
Segmenting Complex Scripts with Machine Learning
Line and word breaks
Word breaks
Dictionary based segmentation
where it fall short?
- size is too large
- new or specialized words are not easily recognized (xx-ing)
- longest match can fail by missing correct shorter words
2 Board cases needed difference solutions
- south east asian SEA
- East Asian CJK
CJK:
Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries
RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.
BudoX/RAdaBoost
AdaBoost learners
ICU dic 2.0M
BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb
Links with Non-ASCII: Unicode Detection and Display
Basically Unreadable with Percentage Codes
Draft UTS#58 Link Detection and Formatting
Automated I18n Quality for Enterprise Platforms
Globalization Readiness
- Linguistic Quality
- Extensibility
- Maintainability
- Time to market
- Portability (standard based)
Reactive vs Proactive
re: fix bugs, correct translations, troubleshoot, but costumer will find issues before you
Prevent bugs, establish best practices that are global ready
Using AI out of the box
goose: Agentic vibe coding, but it dose not use ICU, does not deal with data ready for i18n.
LLM->most common, but statistically wrong.
- Not using standard region codes.
- Assumes only one language per region
- Assumes only two forms for plural
- Sloppy plural(s) construct in some languages
- No gender handling
- Embeds formatting and layout with content
- Content for all locales in a single file
- (not shown)
- Poor phone structure as raw text
- No attempt to find or use libraries for phone, address, or to CU or CLDR
Detect Issues in source content
- Before antering the translation pipeline
- Within Atlas, a plafform for managing localization workflows
- Rulebased linting
- Using 3rd party lib: ilib-lint
Github -> CI(自动实行构建) -> AWS -> Management platform ->Github/CI/Translator vendor
Detect issues in source code
- Independent of translatable content
- Much larger dataset
- Build a custom scanner
- Static Analysis + AI
- Many programming language
- Custom integrations
i18n using AI + Self-Healing
Sourcecode I18n self healing using AI study
- Scan-train-refine
- Knowend and discovered
Going forward with AI
- i18n anti patter development
- Scanning tool development
- Fine tuning results
- AI Training
- Self-healing training
- CI/CD Intergration
End-to-end i18n system by TikTok
- Part of TikTok Design System
- SDK supporting Tiktok locales and 200+ CLDR Locales
- Real business needs embedded