What's your storage requirement that's not adequately solved by the existing enc...

lmm · on May 28, 2015

What are you suggesting, store strings in UTF8 and then "normalize" them into this bizarre format whenever you load/save them purely so that offsets correspond to grapheme clusters? Doesn't seem worth the overhead to my eyes.

cygx · on May 28, 2015

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data do you have lying around that's UTF-16?

Sure, more recently, Go and Rust have decided to go with UTF-8, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-2, UCS-4 as appropriate) model if you have to do actual processing instead of just passing opaque strings around.

Also note that you have to go through a normalization step anyway if you don't want to be tripped up by having multiple ways to represent a single grapheme.

raiph · on May 28, 2015

NFG enables O(N) algorithms for character level operations.

The overhead is entirely wasted on code that does no character level operations.

For code that does do some character level operations, avoiding quadratic behavior may pay off handsomely.