OpenJDK's Thai Tokenizer explained
11 Jul 2014
This week is the Twitter's Hackweek. Basically, we are allowed to do/build whatever we would like to. So, I've decided to learn more about Thai tokenization (or Thai word segmentation) in OpenJDK.
Here is how you can tokenize a Thai sentence:
private static final BreakIterator proto = BreakIterator.getWordInstance(new Locale("th"));
// And just use proto normally
And you can use it out-of-the-box. The result is pretty decent. OpenJDK embeds Thai dictionary with rules to tokenize a Thai sentence within itself. (Yeah, Thai language is that special)
Here is the high-level overview of how it works:
- BreakIterator for Thai language asks BreakIteratorInfo_th about how to instantiate a BreakIterator.
- DictionaryBasedBreakIterator is used with thai_dict as the dictionary file, and BreakIteratorRules_th as its rules. DictionaryBasedBreakIterator inherits from RuleBasedBreakIterator. It will use the rules to tokenize first. Then, it'll use dictionary to further tokenize the consecutive characters.
- BreakIteratorRules_th is used to generate the file: WordBreakIteratorData_th, which is used by RuleBasedBreakIterator.
- I cannot tell how thai_dict is generated. I am guessing that it's from thaidict.txt from ICU.
- For more detail on how RuleBasedBreakIterator and DictionaryBasedBreakIterator work: here
I am still chasing how the file thai_dict is generated in OpenJDK…