Quick-win on OpenJDK's Thai Tokenization

13 Jul 2014

OpenJDK has pretty decent Thai tokenization library. Nevertheless, there are some quick wins that can be implemented.

For example, there are some patterns that are not sensible whatsoever. See this result from OpenJDK's BreakIterator:

ประเทศ|ไทย|มี|บริการ|เท|เลเท็ก|ซ์โดย|ไม่|คิด|มูลค่า|ทาง|ช่อง| |5| |มา|นาน|กว่า| |4| |ปี|แล้ว

ซ์ cannot possibly be the start of a word. Or this result:

ยาง|มัด|ผม|ห่อ|นี้| |30| |บาท| |คือ|แพง|แล้ว|นะ|สำหรับ|สำ|เพ็ง| |ถ้า|เป็น|ชลบุรี|หรอ| |สี|ละ| |20| |อ่ะ| |แพง|จุ|ง

ง (a single character) cannot possibly be a word.

OpenJDK uses a recursive dictionary matching in the sentence to tokenize it. I'm pretty sure it's as fast as recursion, since it can back up the word boundary and consider another possibility. The algorithm is explained here

With an elastic matching approach, we can simply incorporate these forbidden rules it. It should not be slower than the current algorithm.

Here is the high-level overview:

Let a1...an be a string d(a1…an) = n if it matches the dictionary // using n to prefer longer word 0 if it doesn’t match dictionary d(a1…an) = -infinity if it is forbidden s[i] = max { d(a0..ai) and { for all k < i, s[k] + d(ak+1..ai }) } }

And that's it.