Segmenting Complex Scripts with Machine Learning
Line and word breaks
Word breaks
Dictionary based segmentation
where it fall short?
- size is too large
- new or specialized words are not easily recognized (xx-ing)
- longest match can fail by missing correct shorter words
2 Board cases needed difference solutions
- south east asian SEA
- East Asian CJK
CJK:
Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries
RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.
BudoX/RAdaBoost
AdaBoost learners
ICU dic 2.0M
BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb