Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Conference Session Notes - Unicode Technical Workshop 2025
Presenters: Microsoft & Apple Text Layout Teams
Overview
This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content.
Why Learn About Text Display?
- Software Development: Working on text display software or browsers
- Writing Systems: Understanding how different scripts are implemented
- Unicode Encoding: Planning to propose new writing system encodings
- Localization: Testing scenarios with different languages and scripts
- Text Analysis: Understanding the relationship between encoded characters and visual output
Text displayDisplay & Fonts
Core Concepts & Terminology
Key Terms
- Characters - Abstract units stored in data files
- Code
pointsPoints - Numeric values representing characters in Unicode - String - Sequence of characters
- Glyphs - Actual visual shapes rendered on screen
- Glyph
runRun - Sequence of positioned glyphs - Font
familyFamily - Set of fonts sharing design traits (e.g., Arial) - Font
styleStyle - Specific variant within family (e.g., Arial Bold)
Critical Distinction
- Character: Capital letter "A" (abstract concept)
- Glyph: The specific visual shape of "A" from a particular font
Basic glyphText layout:Layout Process
Simple Case: Single Line, Latin Characters
Common:The single line,most basic latinform charactersof onlytext layout involves:
How
- Sequence
toofdrawglyphs arranged on a baseline - Each glyph
-positionedsingle line: starting point align with ending pointMap from glyphsadjacent to theoriginalpreviouscharactersone - Left-to-right
& get the dimensions for a glyph run.progression
Font data:Data Structure
Font files are organized as databases containing:
- Name Table: Strings describing font metadata
- Glyph Table: Actual glyph outline data
- Metrics: Measurements for font and individual glyphs
- CMap Table: Character-to-glyph mapping
All data is organized into tables with 4-character mnemonic names.
Character-to-Glyph Mapping
FontCMapfileTable(containsprovidesstringsinitial character→glyph mapping- Glyph IDs are arbitrary numbers assigned by font designer
- Not all characters may be supported by a given font
- This is called the "nominal mapping" or "default glyph mapping"
Glyph Positioning Basics
Each glyph has:
- Origin Point: Where X=0, Y=baseline intersection
- Left Side Bearing: Distance from origin to left edge
- Advance Width: Distance to move for
name, vendor...;next glyphdat, metric data...position dataOutlinetableData: Control points defining the shape
Layout Process:
- Align glyph origin with current drawing position
- Render the glyph
- Move drawing position by advance width
- Repeat for next character
Advanced Layout Requirements
Simple character-by-character layout is insufficient for:
1. Kerning
Adjusting spacing between specific letter pairs for better visual balance
- Example: "VA" or "To" - reducing space for optical balance
2. Contextual Positioning
Arabic Script Example:
- Letters change shape based on position in word
- Connecting scripts require precise glyph alignment
- Marks above letters must adjust to letter height
3. Combining Marks
Diacritical Marks:
- Must position accurately relative to base letters
- Avoid collisions with other marks
- Handle complex combinations (multiple accents)
4. Glyph Substitution
Contextual Forms:
- Same character may need different glyphs based on context
- Arabic: initial, medial, final, isolated forms
- Complex scripts require cluster analysis
5. Ligature Substitution
Typographic Ligatures:
- Replace character sequences with single composed glyphs
- Example: "ffi" → single ligature glyph
- Improves readability and aesthetics
6. Language-Specific Variants
Same character may have different appearances in different languages
- Example: Cyrillic letters in Russian vs. Bulgarian
7. Bidirectional Text (BIDI)
Some writing systems require right-to-left text processing
- Hebrew and Arabic written right-to-left
- Mixed direction content requires Unicode Bidirectional Algorithm
- Glyph
datareordering within clusters may be needed
Mapping code points to glyphs: map a single unicode code point to a glyph ID
Advanced layout requirments
Kerningword position context (arabic)cluster sequence contextglyph-glyph context
Alternate glyph substitution (cont'd)
Typographic ligaturesTypographic small caps (synthetic small caps vs Alternate small caps)
BIDI
Implications
- Advanced line layout is required for high quality
typography.typography & many scripts - Complex character-to-glyph associations - no longer one-to-one mapping
- Default glyph metrics alone don't determine final positions
- Additional software logic is required beyond basic font data
FontFont-specificdetailes
Opentype Layout
Requires...
Generaldrive advanced layoutlogicbehavior
OpenType Layout System
Advanced Layout Engine Requirements
- General Advanced Layout Logic
CertainScript-SpecificscriptBehaviorbehaviourLogic:logic:Baseddetailesonthat can be informed by the unicodeUnicode charactersequenceproperties- Font-Specific
independentData:ofSubstitutiontheandfontpositioningused & text shaping engine.rules
AdvancedOpenType layout font tablesTables
GlpyphGDEF (Glyph DefinitionTableTable):GDEFClassifies glyphs by type (base, mark, ligature, component)- GSUB (Glyph Substitution Table): Defines character/glyph substitution
tablerules,GSUBhandles contextual forms, ligatures, alternates - GPOS (Glyph Positioning Table): Defines positioning
tableadjustments,GPOShandles kerning, mark positioning, cursive attachment
Text Shaping Engines
Platform-specific implementations:
- CoreText (macOS)
- DirectWrite (Windows)
- HarfBuzz (Linux/Cross-platform)
Text Processing Pipeline
1. Run Segmentation
Script Itemization:
- Segment text by Unicode script properties
- Group characters requiring similar processing
Substitution/positioningBiDi actionsLevel Analysis:
- Apply Unicode Bidirectional Algorithm
- Determine text direction runs
- Handle mixed left-to-right and right-to-left content
2. Shaping Process
For each text run:
CoreText/DirectWrite/HalfbuzzCanonical Decomposition (apple/win/linux)UAX #15):
- Normalize character sequences
- Handle composed vs. decomposed forms
Cluster Analysis (UAX #29):
- Identify character clusters that must be processed together
- Critical for complex scripts like Devanagari, Arabic
Glyph Substitution:
- Apply contextual forms
- Process ligatures
- Handle language-specific variants
Run3. SegmentationPositioning
- Apply kerning adjustments
- Position combining marks using anchor points
- Handle cursive attachment
- Calculate final glyph positions
Bidirectional Text Processing
Unicode Bidirectional Algorithm
SegmentEvery thecharacter stringhas intoa seperateBidi_Class runsproperty:
- Strong LTR: Latin letters (L)
- Strong RTL: Arabic, Hebrew letters (R, AL)
- Neutral: Punctuation, symbols (neutrally directional)
ScriptProcessing runs. "itemization"Steps:
BIDI
- Assign
algorithmembeddingtolevelsgetbasedBIDIon character properties - Create level runs
ofScriptsameItemization...Bidi level run segmentation...Unicode bidi algorithm uses Bidi_Class char propertiesShaping....Canonical Decomposition[UAX #15]directionalityClusterReorderAnalysisglyphs[UAXwithin#29]and between runs- Handle neutral characters based on context
PositioningResult:
Text...
displaysConclusion
storageWecorrectlycanregardlessmapoffrom...
orderFont Fallback
When Fallback Occurs
When primary font lacks required glyphs:- Individual characters missing
- Entire clusters unsupported
- Language-specific glyph variants needed
Context Considerations
User Preferences:
- Language settings
- Input method indicators
- Markup language tags
Font Matching Criteria:
- Classification: serif, sans-serif, cursive, monospace
- Some classifications are specified by fonts themselves
- Some are determined by other means (仿宋)
- Attributes: weight, width, italic/oblique
- Fallback font may not exactly match all attributes
- Variable fonts can be responsive to some attributes
Available Font Selection:
- Platform-dependent font sets
- Application-specific font lists
- Privacy considerations (web fonts)
Display Emojis
Emoji processing requires:
- Character property analysis
- Known sequence recognition
- Variation selector handling
- Color font format support
Color Font Formats
- Bitmapped: sbix, CBDT/CBLC tables
- Vector: COLRv0/v1 tables
- SVG: SVG table
Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning
Multi-Line Layout
Line Breaking
Uses accumulated glyph width information to:
- Determine text that fits in available width
- Find appropriate break points
- Handle bidirectional content wrapping
Vertical Spacing
Font Metrics:
- Ascent: Distance above baseline
- Descent: Distance below baseline
- Line Gap: Additional spacing between lines
Applications may apply additional line spacing adjustments.
Common Display Problems
1. Invalid Clusters
Causes:
- Incorrect character sequences for script
- Components in wrong order
- Unicode normalization issues
Symptoms:
- Dotted circles indicating invalid combinations
- Missing or misplaced diacritical marks
2. Copy/Paste from PDF Issues
Problem: PDFs store glyph positions, not original text
- Advanced layout information lost
- Character-to-glyph mapping may be irreversible
- Copy/paste produces garbled text
Solution: Ensure PDFs embed proper text extraction data
3. Font Style Mismatches
Causes:
- Fallback font doesn't match original style
- Limited font selection available
- Font classification mismatches
Note: Fallback prioritizes legibility over style matching
4. Text Truncated Vertically
Causes:
- Text controls sized for specific scripts
- Different writing systems have different vertical requirements
- Font metrics not properly accounted for
5. Encoding Errors
Not Text Layout Issues - Upstream Problems:
Transcoding Failures:
- UTF-8 interpreted as legacy encoding
- Double-encoding artifacts
- Replacement characters (�) indicating conversion failure
Legacy Software:
- Non-Unicode capable applications
- Question marks for unsupported characters
- Incomplete UTF-16 surrogate handling
6. Incorrect Parsing of UTF-8 or UTF-16 Sequences
Software incorrectly assumes encoding format, leading to garbage text display
Implementation Implications
Performance Considerations
- Text display is extremely common operation
- Software optimized for efficient processing
- Incremental updates for document editing
- Constraint analysis to minimize re-layout
Complexity Management
- Simple Scripts: May use optimized basic layout paths
- Complex Scripts: Require full advanced layout pipeline
- Modern Approach: Apply advanced layout universally for consistent typography
Development Guidelines
- Don't rely on font fallback for proper localization
- Test with target languages early in development
- Understand platform differences in text processing
- Plan for complex script requirements from the beginning
Debugging Text Display Issues
Diagnostic Approach
- Identify the problem type:
- Layout/positioning issue
- Font fallback problem
- Encoding/conversion error
- Platform/software limitation
- Gather information:
- What font was actually used?
- What text processing occurred?
- What are the original character codes?
- What platform/software environment?
- Consult experts:
- Text layout engineers
- Language/script experts
- Platform documentation
Tools and Resources
- Unicode Character Database
- Script-specific documentation
- Platform text layout APIs
- Font inspection tools
- Text encoding validators
Key Takeaways
- Text display is complex - What you see is the result of sophisticated processing
- Character ≠ Glyph - One-to-many relationships are common
- Context matters - Same characters may render differently based on surrounding text
- Scripts vary widely - Solutions must accommodate diverse writing systems
- Font data drives behavior - Advanced layout depends on font-provided rules
- Testing is crucial - Problems often surface only with real-world multilingual content
Further Reading
- Unicode Standard: unicode.org
- UAX #15: Unicode Normalization Forms
- UAX #29: Unicode Text Segmentation
- UAX #9: Unicode Bidirectional Algorithm
- OpenType Specification: Microsoft Typography documentation
- Platform APIs: CoreText (Apple), DirectWrite (Microsoft), HarfBuzz documentation
Session recorded at Unicode Technical Workshop 2025
Notes compiled from presentation materials and transcript