Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Conference Session Notes - Unicode Technical Workshop 2025
Presenters: Microsoft & Apple Text Layout Teams

Overview

This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content.

Why Learn About Text Display?

Software Development: Working on text display software or browsers
Writing Systems: Understanding how different scripts are implemented
Unicode Encoding: Planning to propose new writing system encodings
Localization: Testing scenarios with different languages and scripts
Text Analysis: Understanding the relationship between encoded characters and visual output

Text displayDisplay & Fonts

Core Concepts & Terminology

Key Terms

Characters - Abstract units stored in data files
Code ~~points~~Points - Numeric values representing characters in Unicode
String - Sequence of characters
Glyphs - Actual visual shapes rendered on screen
Glyph ~~run~~Run - Sequence of positioned glyphs
Font ~~family~~Family - Set of fonts sharing design traits (e.g., Arial)
Font ~~style~~Style - Specific variant within family (e.g., Arial Bold)

Critical Distinction

Character: Capital letter "A" (abstract concept)
Glyph: The specific visual shape of "A" from a particular font

Basic glyphText layout:Layout Process

Simple Case: Single Line, Latin Characters

~~Common:~~The ~~single line,~~most basic ~~latin~~form ~~characters~~of ~~only~~text layout involves:

~~How~~

Sequence toof ~~draw~~glyphs arranged on a baseline
Each glyph -positioned ~~single line: starting point align with ending point~~

~~Map from glyphs~~adjacent to the ~~original~~previous ~~characters~~one
Left-to-right ~~& get the dimensions for a glyph run.~~
progression

Font data:Data Structure

Font files are organized as databases containing:

Name Table: Strings describing font metadata
Glyph Table: Actual glyph outline data
Metrics: Measurements for font and individual glyphs
CMap Table: Character-to-glyph mapping

All data is organized into tables with 4-character mnemonic names.

Character-to-Glyph Mapping

~~Font~~CMap ~~file~~Table ~~(contains~~provides ~~strings~~initial character→glyph mapping
Glyph IDs are arbitrary numbers assigned by font designer
Not all characters may be supported by a given font
This is called the "nominal mapping" or "default glyph mapping"

Glyph Positioning Basics

Each glyph has:

Origin Point: Where X=0, Y=baseline intersection
Left Side Bearing: Distance from origin to left edge
Advance Width: Distance to move for ~~name, vendor...;~~next glyph ~~dat, metric data...~~position
~~data~~Outline ~~table~~Data: Control points defining the shape

Layout Process:

Align glyph origin with current drawing position
Render the glyph
Move drawing position by advance width
Repeat for next character

Advanced Layout Requirements

Simple character-by-character layout is insufficient for:

1. Kerning

Adjusting spacing between specific letter pairs for better visual balance

Example: "VA" or "To" - reducing space for optical balance

2. Contextual Positioning

Arabic Script Example:

Letters change shape based on position in word
Connecting scripts require precise glyph alignment
Marks above letters must adjust to letter height

3. Combining Marks

Diacritical Marks:

Must position accurately relative to base letters
Avoid collisions with other marks
Handle complex combinations (multiple accents)

4. Glyph Substitution

Contextual Forms:

Same character may need different glyphs based on context
Arabic: initial, medial, final, isolated forms
Complex scripts require cluster analysis

5. Ligature Substitution

Typographic Ligatures:

Replace character sequences with single composed glyphs
Example: "ffi" → single ligature glyph
Improves readability and aesthetics

6. Language-Specific Variants

Same character may have different appearances in different languages

Example: Cyrillic letters in Russian vs. Bulgarian

7. Bidirectional Text (BIDI)

Some writing systems require right-to-left text processing

Hebrew and Arabic written right-to-left
Mixed direction content requires Unicode Bidirectional Algorithm
Glyph ~~data~~reordering within clusters may be needed

~~Mapping code points to glyphs: map a single unicode code point to a glyph ID~~

Advanced layout requirments

~~Kerning~~
~~word position context (arabic)~~
~~cluster sequence context~~
~~glyph-glyph context~~

~~Alternate glyph substitution (cont'd)~~

~~Typographic ligatures~~
~~Typographic small caps (synthetic small caps vs Alternate small caps)~~

BIDI

Implications

Advanced line layout is required for high quality ~~typography.~~typography & many scripts
Complex character-to-glyph associations - no longer one-to-one mapping
Default glyph metrics alone don't determine final positions
Additional software logic is required beyond basic font data
~~Font~~ Font-specific ~~detailes~~

details

Opentype Layout

Requires...

~~General~~drive advanced layout ~~logic~~behavior

OpenType Layout System

Advanced Layout Engine Requirements

General Advanced Layout Logic
~~Certain~~Script-Specific ~~script~~Behavior ~~behaviour~~Logic: ~~logic:~~Based ~~detailes~~on ~~that can be informed by the unicode~~Unicode character ~~sequence~~properties
Font-Specific ~~independent~~Data: ofSubstitution ~~the~~and ~~font~~positioning ~~used & text shaping engine.~~rules

AdvancedOpenType layout font tablesTables

~~Glpyph~~GDEF (Glyph Definition ~~Table~~Table): ~~GDEF~~Classifies glyphs by type (base, mark, ligature, component)
GSUB (Glyph Substitution Table): Defines character/glyph substitution ~~table~~rules, ~~GSUB~~handles contextual forms, ligatures, alternates
GPOS (Glyph Positioning Table): Defines positioning ~~table~~adjustments, ~~GPOS~~handles kerning, mark positioning, cursive attachment

Text Shaping Engines

Platform-specific implementations:

CoreText (macOS)
DirectWrite (Windows)
HarfBuzz (Linux/Cross-platform)

Text Processing Pipeline

1. Run Segmentation

Script Itemization:

Segment text by Unicode script properties
Group characters requiring similar processing

~~Substitution/positioning~~BiDi ~~actions~~Level Analysis:

Apply Unicode Bidirectional Algorithm
Determine text direction runs
Handle mixed left-to-right and right-to-left content

2. Shaping Process

For each text run:

~~CoreText/DirectWrite/Halfbuzz~~Canonical Decomposition (~~apple/win/linux)~~UAX #15):

Normalize character sequences
Handle composed vs. decomposed forms

Cluster Analysis (UAX #29):

Identify character clusters that must be processed together
Critical for complex scripts like Devanagari, Arabic

Glyph Substitution:

Apply contextual forms
Process ligatures
Handle language-specific variants

Run3. SegmentationPositioning

Apply kerning adjustments
Position combining marks using anchor points
Handle cursive attachment
Calculate final glyph positions

Bidirectional Text Processing

Unicode Bidirectional Algorithm

~~Segment~~Every ~~the~~character ~~string~~has ~~into~~a ~~seperate~~Bidi_Class ~~runs~~property:

Strong LTR: Latin letters (L)
Strong RTL: Arabic, Hebrew letters (R, AL)
Neutral: Punctuation, symbols (neutrally directional)

~~Script~~Processing ~~runs. "itemization"~~Steps:

~~BIDI~~

Assign ~~algorithm~~embedding tolevels ~~get~~based ~~BIDI~~on character properties
Create level runs

of
~~Script~~same ~~Itemization~~

~~...~~

~~Bidi level run segmentation~~

~~...~~

~~Unicode bidi algorithm uses Bidi_Class char properties~~

~~Shaping~~

~~....~~
1. ~~Canonical Decomposition~~ ~~[UAX #15]~~directionality
2. ~~Cluster~~Reorder ~~Analysis~~glyphs ~~[UAX~~within ~~#29]~~and between runs
3. Handle neutral characters based on context
~~Positioning~~
Result:
~~...~~
Text
~~Conclusion~~
displays
Wecorrectly ~~can~~regardless ~~map~~of ~~from...~~
storage

order

Font Fallback

When Fallback Occurs

When primary font lacks required glyphs:
- Individual characters missing
- Entire clusters unsupported
- Language-specific glyph variants needed
Context Considerations

User Preferences:
- Language settings
- Input method indicators
- Markup language tags
Font Matching Criteria:
- Classification: serif, sans-serif, cursive, monospace
  - Some classifications are specified by fonts themselves
  - Some are determined by other means (仿宋)
- Attributes: weight, width, italic/oblique
  - Fallback font may not exactly match all attributes
  - Variable fonts can be responsive to some attributes
Available Font Selection:
- Platform-dependent font sets
- Application-specific font lists
- Privacy considerations (web fonts)
Display Emojis

Emoji processing requires:
- Character property analysis
- Known sequence recognition
- Variation selector handling
- Color font format support
Color Font Formats
- Bitmapped: sbix, CBDT/CBLC tables
- Vector: COLRv0/v1 tables
- SVG: SVG table
Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning

Multi-Line Layout

Line Breaking

Uses accumulated glyph width information to:
- Determine text that fits in available width
- Find appropriate break points
- Handle bidirectional content wrapping
Vertical Spacing

Font Metrics:
- Ascent: Distance above baseline
- Descent: Distance below baseline
- Line Gap: Additional spacing between lines
Applications may apply additional line spacing adjustments.

Common Display Problems

1. Invalid Clusters

Causes:
- Incorrect character sequences for script
- Components in wrong order
- Unicode normalization issues
Symptoms:
- Dotted circles indicating invalid combinations
- Missing or misplaced diacritical marks
2. Copy/Paste from PDF Issues

Problem: PDFs store glyph positions, not original text
- Advanced layout information lost
- Character-to-glyph mapping may be irreversible
- Copy/paste produces garbled text
Solution: Ensure PDFs embed proper text extraction data

3. Font Style Mismatches

Causes:
- Fallback font doesn't match original style
- Limited font selection available
- Font classification mismatches
Note: Fallback prioritizes legibility over style matching

4. Text Truncated Vertically

Causes:
- Text controls sized for specific scripts
- Different writing systems have different vertical requirements
- Font metrics not properly accounted for
5. Encoding Errors

Not Text Layout Issues - Upstream Problems:

Transcoding Failures:
- UTF-8 interpreted as legacy encoding
- Double-encoding artifacts
- Replacement characters (�) indicating conversion failure
Legacy Software:
- Non-Unicode capable applications
- Question marks for unsupported characters
- Incomplete UTF-16 surrogate handling
6. Incorrect Parsing of UTF-8 or UTF-16 Sequences

Software incorrectly assumes encoding format, leading to garbage text display

Implementation Implications

Performance Considerations
- Text display is extremely common operation
- Software optimized for efficient processing
- Incremental updates for document editing
- Constraint analysis to minimize re-layout
Complexity Management
- Simple Scripts: May use optimized basic layout paths
- Complex Scripts: Require full advanced layout pipeline
- Modern Approach: Apply advanced layout universally for consistent typography
Development Guidelines
1. Don't rely on font fallback for proper localization
2. Test with target languages early in development
3. Understand platform differences in text processing
4. Plan for complex script requirements from the beginning
Debugging Text Display Issues

Diagnostic Approach
1. Identify the problem type:
  - Layout/positioning issue
  - Font fallback problem
  - Encoding/conversion error
  - Platform/software limitation
2. Gather information:
  - What font was actually used?
  - What text processing occurred?
  - What are the original character codes?
  - What platform/software environment?
3. Consult experts:
  - Text layout engineers
  - Language/script experts
  - Platform documentation
Tools and Resources
- Unicode Character Database
- Script-specific documentation
- Platform text layout APIs
- Font inspection tools
- Text encoding validators
Key Takeaways
1. Text display is complex - What you see is the result of sophisticated processing
2. Character ≠ Glyph - One-to-many relationships are common
3. Context matters - Same characters may render differently based on surrounding text
4. Scripts vary widely - Solutions must accommodate diverse writing systems
5. Font data drives behavior - Advanced layout depends on font-provided rules
6. Testing is crucial - Problems often surface only with real-world multilingual content
Further Reading
- Unicode Standard: unicode.org
- UAX #15: Unicode Normalization Forms
- UAX #29: Unicode Text Segmentation
- UAX #9: Unicode Bidirectional Algorithm
- OpenType Specification: Microsoft Typography documentation
- Platform APIs: CoreText (Apple), DirectWrite (Microsoft), HarfBuzz documentation
Session recorded at Unicode Technical Workshop 2025
Notes compiled from presentation materials and transcript

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs