Unicode Technology Workshop 2025

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs
Getting started with ICU4X
Grammatical Agreement with Unicode Inflection
Segmenting Complex Scripts with Machine Learning
Links with Non-ASCII: Unicode Detection and Display
Automated I18n Quality for Enterprise Platforms
End-to-end i18n system by TikTok

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Conference Session Notes - Unicode Technical Workshop 2025
Presenters: Microsoft & Apple Text Layout Teams

Overview

This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content.

Why Learn About Text Display?

Software Development: Working on text display software or browsers
Writing Systems: Understanding how different scripts are implemented
Unicode Encoding: Planning to propose new writing system encodings
Localization: Testing scenarios with different languages and scripts
Text Analysis: Understanding the relationship between encoded characters and visual output

Text Display & Fonts

Core Concepts & Terminology

Key Terms

Characters - Abstract units stored in data files
Code Points - Numeric values representing characters in Unicode
String - Sequence of characters
Glyphs - Actual visual shapes rendered on screen
Glyph Run - Sequence of positioned glyphs
Font Family - Set of fonts sharing design traits (e.g., Arial)
Font Style - Specific variant within family (e.g., Arial Bold)

Critical Distinction

Character: Capital letter "A" (abstract concept)
Glyph: The specific visual shape of "A" from a particular font

Basic Text Layout Process

Simple Case: Single Line, Latin Characters

The most basic form of text layout involves:

Sequence of glyphs arranged on a baseline
Each glyph positioned adjacent to the previous one
Left-to-right progression

Font Data Structure

Font files are organized as databases containing:

Name Table: Strings describing font metadata
Glyph Table: Actual glyph outline data
Metrics: Measurements for font and individual glyphs
CMap Table: Character-to-glyph mapping

All data is organized into tables with 4-character mnemonic names.

Character-to-Glyph Mapping

CMap Table provides initial character→glyph mapping
Glyph IDs are arbitrary numbers assigned by font designer
Not all characters may be supported by a given font
This is called the "nominal mapping" or "default glyph mapping"

Glyph Positioning Basics

Each glyph has:

Origin Point: Where X=0, Y=baseline intersection
Left Side Bearing: Distance from origin to left edge
Advance Width: Distance to move for next glyph position
Outline Data: Control points defining the shape

Layout Process:

Align glyph origin with current drawing position
Render the glyph
Move drawing position by advance width
Repeat for next character

Advanced Layout Requirements

Simple character-by-character layout is insufficient for:

1. Kerning

Adjusting spacing between specific letter pairs for better visual balance

Example: "VA" or "To" - reducing space for optical balance

2. Contextual Positioning

Arabic Script Example:

Letters change shape based on position in word
Connecting scripts require precise glyph alignment
Marks above letters must adjust to letter height

3. Combining Marks

Diacritical Marks:

Must position accurately relative to base letters
Avoid collisions with other marks
Handle complex combinations (multiple accents)

4. Glyph Substitution

Contextual Forms:

Same character may need different glyphs based on context
Arabic: initial, medial, final, isolated forms
Complex scripts require cluster analysis

5. Ligature Substitution

Typographic Ligatures:

Replace character sequences with single composed glyphs
Example: "ffi" → single ligature glyph
Improves readability and aesthetics

6. Language-Specific Variants

Same character may have different appearances in different languages

Example: Cyrillic letters in Russian vs. Bulgarian

7. Bidirectional Text (BIDI)

Some writing systems require right-to-left text processing

Hebrew and Arabic written right-to-left
Mixed direction content requires Unicode Bidirectional Algorithm
Glyph reordering within clusters may be needed

Implications

Advanced line layout is required for high quality typography & many scripts
Complex character-to-glyph associations - no longer one-to-one mapping
Default glyph metrics alone don't determine final positions
Additional software logic is required beyond basic font data
Font-specific details drive advanced layout behavior

OpenType Layout System

Advanced Layout Engine Requirements

General Advanced Layout Logic
Script-Specific Behavior Logic: Based on Unicode character properties
Font-Specific Data: Substitution and positioning rules

OpenType Tables

GDEF (Glyph Definition Table): Classifies glyphs by type (base, mark, ligature, component)
GSUB (Glyph Substitution Table): Defines character/glyph substitution rules, handles contextual forms, ligatures, alternates
GPOS (Glyph Positioning Table): Defines positioning adjustments, handles kerning, mark positioning, cursive attachment

Text Shaping Engines

Platform-specific implementations:

CoreText (macOS)
DirectWrite (Windows)
HarfBuzz (Linux/Cross-platform)

Text Processing Pipeline

1. Run Segmentation

Script Itemization:

Segment text by Unicode script properties
Group characters requiring similar processing

BiDi Level Analysis:

Apply Unicode Bidirectional Algorithm
Determine text direction runs
Handle mixed left-to-right and right-to-left content

2. Shaping Process

For each text run:

Canonical Decomposition (UAX #15):

Normalize character sequences
Handle composed vs. decomposed forms

Cluster Analysis (UAX #29):

Identify character clusters that must be processed together
Critical for complex scripts like Devanagari, Arabic

Glyph Substitution:

Apply contextual forms
Process ligatures
Handle language-specific variants

3. Positioning

Apply kerning adjustments
Position combining marks using anchor points
Handle cursive attachment
Calculate final glyph positions

Bidirectional Text Processing

Unicode Bidirectional Algorithm

Every character has a Bidi_Class property:

Strong LTR: Latin letters (L)
Strong RTL: Arabic, Hebrew letters (R, AL)
Neutral: Punctuation, symbols (neutrally directional)

Processing Steps:

Assign embedding levels based on character properties
Create level runs of same directionality
Reorder glyphs within and between runs
Handle neutral characters based on context

Result: Text displays correctly regardless of storage order

Font Fallback

When Fallback Occurs

When primary font lacks required glyphs:

Individual characters missing
Entire clusters unsupported
Language-specific glyph variants needed

Context Considerations

User Preferences:

Language settings
Input method indicators
Markup language tags

Font Matching Criteria:

Classification: serif, sans-serif, cursive, monospace
- Some classifications are specified by fonts themselves
- Some are determined by other means (仿宋)
Attributes: weight, width, italic/oblique
- Fallback font may not exactly match all attributes
- Variable fonts can be responsive to some attributes

Available Font Selection:

Platform-dependent font sets
Application-specific font lists
Privacy considerations (web fonts)

Display Emojis

Emoji processing requires:

Character property analysis
Known sequence recognition
Variation selector handling
Color font format support

Color Font Formats

Bitmapped: sbix, CBDT/CBLC tables
Vector: COLRv0/v1 tables
SVG: SVG table

Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning

Multi-Line Layout

Line Breaking

Uses accumulated glyph width information to:

Determine text that fits in available width
Find appropriate break points
Handle bidirectional content wrapping

Vertical Spacing

Font Metrics:

Ascent: Distance above baseline
Descent: Distance below baseline
Line Gap: Additional spacing between lines

Applications may apply additional line spacing adjustments.

Common Display Problems

1. Invalid Clusters

Causes:

Incorrect character sequences for script
Components in wrong order
Unicode normalization issues

Symptoms:

Dotted circles indicating invalid combinations
Missing or misplaced diacritical marks

2. Copy/Paste from PDF Issues

Problem: PDFs store glyph positions, not original text

Advanced layout information lost
Character-to-glyph mapping may be irreversible
Copy/paste produces garbled text

Solution: Ensure PDFs embed proper text extraction data

3. Font Style Mismatches

Causes:

Fallback font doesn't match original style
Limited font selection available
Font classification mismatches

Note: Fallback prioritizes legibility over style matching

4. Text Truncated Vertically

Causes:

Text controls sized for specific scripts
Different writing systems have different vertical requirements
Font metrics not properly accounted for

5. Encoding Errors

Not Text Layout Issues - Upstream Problems:

Transcoding Failures:

UTF-8 interpreted as legacy encoding
Double-encoding artifacts
Replacement characters (�) indicating conversion failure

Legacy Software:

Non-Unicode capable applications
Question marks for unsupported characters
Incomplete UTF-16 surrogate handling

6. Incorrect Parsing of UTF-8 or UTF-16 Sequences

Software incorrectly assumes encoding format, leading to garbage text display

Implementation Implications

Performance Considerations

Text display is extremely common operation
Software optimized for efficient processing
Incremental updates for document editing
Constraint analysis to minimize re-layout

Complexity Management

Simple Scripts: May use optimized basic layout paths
Complex Scripts: Require full advanced layout pipeline
Modern Approach: Apply advanced layout universally for consistent typography

Development Guidelines

Don't rely on font fallback for proper localization
Test with target languages early in development
Understand platform differences in text processing
Plan for complex script requirements from the beginning

Debugging Text Display Issues

Diagnostic Approach

Identify the problem type:
- Layout/positioning issue
- Font fallback problem
- Encoding/conversion error
- Platform/software limitation
Gather information:
- What font was actually used?
- What text processing occurred?
- What are the original character codes?
- What platform/software environment?
Consult experts:
- Text layout engineers
- Language/script experts
- Platform documentation

Tools and Resources

Unicode Character Database
Script-specific documentation
Platform text layout APIs
Font inspection tools
Text encoding validators

Key Takeaways

Text display is complex - What you see is the result of sophisticated processing
Character ≠ Glyph - One-to-many relationships are common
Context matters - Same characters may render differently based on surrounding text
Scripts vary widely - Solutions must accommodate diverse writing systems
Font data drives behavior - Advanced layout depends on font-provided rules
Testing is crucial - Problems often surface only with real-world multilingual content

Getting started with ICU4X

Needs

Low latency requirements
Data heavy algorithms
Privacy implications
Rich uX
Network degration resilience

ICU4X

Grammatical Agreement with Unicode Inflection

high computational cost and network latency

top tier languages are covered but torso and tail lack data

bias based language

Where AI can help

Offline processing where latency or resources are not critical, like grammar fixing, lexicon generation/expansion
smaller, older, less costly models could be used for higher coverage or better accuracy, like LSTMs
Coverage for language we don't have experts to generate rules or with high grammar complexity
Client side support is slowly improving with nano models

Concept of Lemmaless inflection

Segmenting Complex Scripts with Machine Learning

Line and word breaks

Word breaks

Dictionary based segmentation

where it fall short?

size is too large
new or specialized words are not easily recognized (xx-ing)
longest match can fail by missing correct shorter words

2 Board cases needed difference solutions

south east asian SEA
East Asian CJK

CJK:

Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries

RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.

BudoX/RAdaBoost

AdaBoost learners

ICU dic 2.0M

BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb

Links with Non-ASCII: Unicode Detection and Display

Basically Unreadable with Percentage Codes

Draft UTS#58 Link Detection and Formatting

Automated I18n Quality for Enterprise Platforms

Globalization Readiness

Linguistic Quality
Extensibility
Maintainability
Time to market
Portability (standard based)

Reactive vs Proactive

re: fix bugs, correct translations, troubleshoot, but costumer will find issues before you

Prevent bugs, establish best practices that are global ready

Using AI out of the box

goose: Agentic vibe coding, but it dose not use ICU, does not deal with data ready for i18n.

LLM->most common, but statistically wrong.

Not using standard region codes.
Assumes only one language per region
Assumes only two forms for plural
Sloppy plural(s) construct in some languages
No gender handling
Embeds formatting and layout with content
Content for all locales in a single file
(not shown)
Poor phone structure as raw text
No attempt to find or use libraries for phone, address, or to CU or CLDR

Detect Issues in source content

Before antering the translation pipeline
Within Atlas, a plafform for managing localization workflows
Rulebased linting
Using 3rd party lib: ilib-lint

Github -> CI（自动实行构建） -> AWS -> Management platform ->Github/CI/Translator vendor

Detect issues in source code

Independent of translatable content
Much larger dataset
Build a custom scanner
Static Analysis + AI
Many programming language
Custom integrations

i18n using AI + Self-Healing

Sourcecode I18n self healing using AI study

Scan-train-refine
Knowend and discovered

Going forward with AI

i18n anti patter development
Scanning tool development
Fine tuning results
AI Training
Self-healing training
CI/CD Intergration

End-to-end i18n system by TikTok

Part of TikTok Design System
SDK supporting Tiktok locales and 200+ CLDR Locales
Real business needs embedded

Unicode Technology Workshop 2025

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Overview

Why Learn About Text Display?

Text Display & Fonts

Core Concepts & Terminology

Key Terms

Critical Distinction

Basic Text Layout Process

Simple Case: Single Line, Latin Characters

Font Data Structure

Character-to-Glyph Mapping

Glyph Positioning Basics

Advanced Layout Requirements

1. Kerning

2. Contextual Positioning

3. Combining Marks

4. Glyph Substitution

5. Ligature Substitution

6. Language-Specific Variants

7. Bidirectional Text (BIDI)

Implications

OpenType Layout System

Advanced Layout Engine Requirements

OpenType Tables

Text Shaping Engines

Text Processing Pipeline

1. Run Segmentation

2. Shaping Process

3. Positioning

Bidirectional Text Processing

Unicode Bidirectional Algorithm

Font Fallback

When Fallback Occurs

Context Considerations

Display Emojis

Color Font Formats

Multi-Line Layout

Line Breaking

Vertical Spacing

Common Display Problems

1. Invalid Clusters

2. Copy/Paste from PDF Issues

3. Font Style Mismatches

4. Text Truncated Vertically

5. Encoding Errors

6. Incorrect Parsing of UTF-8 or UTF-16 Sequences

Implementation Implications

Performance Considerations

Complexity Management

Development Guidelines

Debugging Text Display Issues

Diagnostic Approach

Tools and Resources

Key Takeaways

Further Reading

Getting started with ICU4X

Needs

ICU4X

Grammatical Agreement with Unicode Inflection

Where AI can help

Concept of Lemmaless inflection

Segmenting Complex Scripts with Machine Learning

Dictionary based segmentation

2 Board cases needed difference solutions

CJK:

BudoX/RAdaBoost

Links with Non-ASCII: Unicode Detection and Display

Basically Unreadable with Percentage Codes

Automated I18n Quality for Enterprise Platforms

Globalization Readiness

Reactive vs Proactive

Using AI out of the box

Detect Issues in source content

Detect issues in source code

i18n using AI + Self-Healing

Going forward with AI

End-to-end i18n system by TikTok