Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's certainly one important source of errors. An obvious example would be treating UTF-32 as a fixed-width encoding, which is bad because you might end up cutting grapheme clusters in half, and you can easily forget about normalization if you think about it that way.

Then, it's possible to make mistakes when converting between representations, eg getting endianness wrong.

Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate about Han unification - but as far as I'm concerned, that's a WONTFIX.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: