# Segmenting Complex Scripts with Machine Learning

Line and word breaks

Word breaks

#### Dictionary based segmentation

where it fall short?

- size is too large
- new or specialized words are not easily recognized (xx-ing)
- longest match can fail by missing correct shorter words

#### 2 Board cases needed difference solutions

- south east asian SEA
- East Asian CJK

#### CJK:

Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries

RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.

##### **BudoX/RAdaBoost**

AdaBoost learners

ICU dic 2.0M

BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb