Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data do you have lying around that's UTF-16?

Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.

Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: