OpenJDK's Thai Tokenizer explained

11 Jul 2014

This week is the Twitter's Hackweek. Basically, we are allowed to do/build whatever we would like to. So, I've decided to learn more about Thai tokenization (or Thai word segmentation) in OpenJDK.

Here is how you can tokenize a Thai sentence:

private static final BreakIterator proto = BreakIterator.getWordInstance(new Locale("th")); // And just use proto normally

And you can use it out-of-the-box. The result is pretty decent. OpenJDK embeds Thai dictionary with rules to tokenize a Thai sentence within itself. (Yeah, Thai language is that special)

Here is the high-level overview of how it works:

  1. BreakIterator for Thai language asks BreakIteratorInfo_th about how to instantiate a BreakIterator.
  2. DictionaryBasedBreakIterator is used with thai_dict as the dictionary file, and BreakIteratorRules_th as its rules. DictionaryBasedBreakIterator inherits from RuleBasedBreakIterator. It will use the rules to tokenize first. Then, it'll use dictionary to further tokenize the consecutive characters.

Please note:

  • BreakIteratorRules_th is used to generate the file: WordBreakIteratorData_th, which is used by RuleBasedBreakIterator.
  • I cannot tell how thai_dict is generated. I am guessing that it's from thaidict.txt from ICU.
  • For more detail on how RuleBasedBreakIterator and DictionaryBasedBreakIterator work: here

I am still chasing how the file thai_dict is generated in OpenJDK…