Unicode Technology Workshop 2025

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Demystifying Unicode Text Display: From Unicode Code Points to Positioned Glyphs

Conference Session Notes - Unicode Technical Workshop 2025
Presenters: Microsoft & Apple Text Layout Teams

Overview

This session explores the complex journey from Unicode characters in a file to the positioned glyphs you see on screen. Understanding this process is crucial for developers working on internationalization, text analysis, localization testing, and anyone dealing with multi-script content.

Why Learn About Text Display?

Text Display & Fonts

Core Concepts & Terminology

Key Terms
Critical Distinction

Basic Text Layout Process

Simple Case: Single Line, Latin Characters

The most basic form of text layout involves:

Font Data Structure

Font files are organized as databases containing:

All data is organized into tables with 4-character mnemonic names.

Character-to-Glyph Mapping
Glyph Positioning Basics

Each glyph has:

Layout Process:

  1. Align glyph origin with current drawing position
  2. Render the glyph
  3. Move drawing position by advance width
  4. Repeat for next character

Advanced Layout Requirements

Simple character-by-character layout is insufficient for:

1. Kerning

Adjusting spacing between specific letter pairs for better visual balance

2. Contextual Positioning

Arabic Script Example:

3. Combining Marks

Diacritical Marks:

4. Glyph Substitution

Contextual Forms:

5. Ligature Substitution

Typographic Ligatures:

6. Language-Specific Variants

Same character may have different appearances in different languages

7. Bidirectional Text (BIDI)

Some writing systems require right-to-left text processing

Implications

OpenType Layout System

Advanced Layout Engine Requirements
OpenType Tables
Text Shaping Engines

Platform-specific implementations:

Text Processing Pipeline

1. Run Segmentation

Script Itemization:

BiDi Level Analysis:

2. Shaping Process

For each text run:

Canonical Decomposition (UAX #15):

Cluster Analysis (UAX #29):

Glyph Substitution:

3. Positioning

Bidirectional Text Processing

Unicode Bidirectional Algorithm

Every character has a Bidi_Class property:

Processing Steps:

  1. Assign embedding levels based on character properties
  2. Create level runs of same directionality
  3. Reorder glyphs within and between runs
  4. Handle neutral characters based on context

Result: Text displays correctly regardless of storage order

Font Fallback

When Fallback Occurs

When primary font lacks required glyphs:

Context Considerations

User Preferences:

Font Matching Criteria:

Available Font Selection:

Display Emojis

Emoji processing requires:

Color Font Formats

Not necessarily one glyph per emoji - complex emoji may use multiple glyphs with positioning

Multi-Line Layout

Line Breaking

Uses accumulated glyph width information to:

Vertical Spacing

Font Metrics:

Applications may apply additional line spacing adjustments.

Common Display Problems

1. Invalid Clusters

Causes:

Symptoms:

2. Copy/Paste from PDF Issues

Problem: PDFs store glyph positions, not original text

Solution: Ensure PDFs embed proper text extraction data

3. Font Style Mismatches

Causes:

Note: Fallback prioritizes legibility over style matching

4. Text Truncated Vertically

Causes:

5. Encoding Errors

Not Text Layout Issues - Upstream Problems:

Transcoding Failures:

Legacy Software:

6. Incorrect Parsing of UTF-8 or UTF-16 Sequences

Software incorrectly assumes encoding format, leading to garbage text display

Implementation Implications

Performance Considerations
Complexity Management
Development Guidelines
  1. Don't rely on font fallback for proper localization
  2. Test with target languages early in development
  3. Understand platform differences in text processing
  4. Plan for complex script requirements from the beginning

Debugging Text Display Issues

Diagnostic Approach
  1. Identify the problem type:
    • Layout/positioning issue
    • Font fallback problem
    • Encoding/conversion error
    • Platform/software limitation
  2. Gather information:
    • What font was actually used?
    • What text processing occurred?
    • What are the original character codes?
    • What platform/software environment?
  3. Consult experts:
    • Text layout engineers
    • Language/script experts
    • Platform documentation
Tools and Resources

Key Takeaways

  1. Text display is complex - What you see is the result of sophisticated processing
  2. Character ≠ Glyph - One-to-many relationships are common
  3. Context matters - Same characters may render differently based on surrounding text
  4. Scripts vary widely - Solutions must accommodate diverse writing systems
  5. Font data drives behavior - Advanced layout depends on font-provided rules
  6. Testing is crucial - Problems often surface only with real-world multilingual content

Further Reading



Session recorded at Unicode Technical Workshop 2025
Notes compiled from presentation materials and transcript

Getting started with ICU4X

Needs

ICU4X



Grammatical Agreement with Unicode Inflection

AI

high computational cost and network latency

top tier languages are covered but torso and tail lack data

bias based language


Where AI can help
Concept of Lemmaless inflection


Segmenting Complex Scripts with Machine Learning

Line and word breaks

Word breaks

Dictionary based segmentation

where it fall short?

2 Board cases needed difference solutions

CJK:

Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries

RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.

BudoX/RAdaBoost

AdaBoost learners

ICU dic 2.0M

BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb




Links with Non-ASCII: Unicode Detection and Display

Basically Unreadable with Percentage Codes

Automated I18n Quality for Enterprise Platforms

Globalization Readiness
Reactive vs Proactive

re: fix bugs, correct translations, troubleshoot, but costumer will find issues before you

Prevent bugs, establish best practices that are global ready

Using AI out of the box

goose: Agentic vibe coding, but it dose not use ICU, does not deal with data ready for i18n.

LLM->most common, but statistically wrong.

Detect Issues in source content

Github -> CI(自动实行构建) -> AWS -> Management platform ->Github/CI/Translator vendor

Detect issues in source code
i18n using AI + Self-Healing

Sourcecode I18n self healing using AI study

Going forward with AI

End-to-end i18n system by TikTok