Skip to main content

Segmenting Complex Scripts with Machine Learning

Line and word breaks

Word breaks

Dictionary based segmentation

where it fall short?

  • size is too large
  • new or specialized words are not easily recognized (xx-ing)
  • longest match can fail by missing correct shorter words

2 Board cases needed difference solutions

  • south east asian SEA
  • East Asian CJK

CJK:

Adaboost: many tiny rules each vote on whether a break is good, combined votes decide word boundaries

RadaBoost: Radicals are the components of Han characters. Certain radicals frequently appear together, provides useful cues for word segmentation.

BudoX/RAdaBoost

AdaBoost learners

ICU dic 2.0M

BudoX zh-hant 64kb. zh-hans 63kb, Radical (all zh variants) 60kb