A UTF-8 encoded file contains Byte Order Mark

18 Oct 2012

But a UTF-8 encoded string doesn't contain it.

Today I loaded a file containing a UTF-8-encoded string, and tried to parse it with CSV.

I couldn't really succeed, even I saw that there was nothing wrong with the string printed of the screen.

It turns out that, if a file containing a UTF-8 string, there will be Byte Order Mark for 3 bytes.

Thanks to Cesar for pointing it out. He also explains the importance of Byte Order Mark (BOM). Without it, there would be no way to distinguish between a file containing UTF-8 string and a file containing LATIN string. They would look the same.

So, before processing the string with CSV library, just cut out the first 3 bytes…