Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let me see if I have this straight. My understanding is that WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also round-trip invalid UTF-16. That is the ultimate goal.

Below is all the background I had to learn about to understand the motivation/details.

UCS-2 was designed as a 16-bit fixed-width encoding. When it became clear that 64k code points wasn’t enough for Unicode, UTF-16 was invented to deal with the fact that UCS-2 was assumed to be fixed-width, but no longer could be.

The solution they settled on is weird, but has some useful properties. Basically they took a couple code point ranges that hadn’t been assigned yet and allocated them to a “Unicode within Unicode” coding scheme. This scheme encodes (1 big code point) -> (2 small code points). The small code points will fit in UTF-16 “code units” (this is our name for each two-byte unit in UTF-16). And for some more terminology, “big code points” are called “supplementary code points”, and “small code points” are called “BMP code points.”

The weird thing about this scheme is that we bothered to make the “2 small code points” (known as a “surrogate” pair) into real Unicode code points. A more normal thing would be to say that UTF-16 code units are totally separate from Unicode code points, and that UTF-16 code units have no meaning outside of UTF-16. An number like 0xd801 could have a code unit meaning as part of a UTF-16 surrogate pair, and also be a totally unrelated Unicode code point.

But the one nice property of the way they did this is that they didn’t break existing software. Existing software assumed that every UCS-2 character was also a code point. These systems could be updated to UTF-16 while preserving this assumption.

Unfortunately it made everything else more complicated. Because now:

- UTF-16 can be ill-formed if it has any surrogate code units that don’t pair properly.

- we have to figure out what to do when these surrogate code points — code points whose only purpose is to help UTF-16 break out of its 64k limit — occur outside of UTF-16.

This becomes particularly complicated when converting UTF-16 -> UTF-8. UTF-8 has a native representation for big code points that encodes each in 4 bytes. But since surrogate code points are real code points, you could imagine an alternative UTF-8 encoding for big code points: make a UTF-16 surrogate pair, then UTF-8 encode the two code points of the surrogate pair (hey, they are real code points!) into UTF-8. But UTF-8 disallows this and only allows the canonical, 4-byte encoding.

If you feel this is unjust and UTF-8 should be allowed to encode surrogate code points if it feels like it, then you might like Generalized UTF-8, which is exactly like UTF-8 except this is allowed. It’s easier to convert from UTF-16, because you don’t need any specialized logic to recognize and handle surrogate pairs. You still need this logic to go in the other direction though (GUTF-8 -> UTF-16), since GUTF-8 can have big code points that you’d need to encode into surrogate pairs for UTF-16.

If you like Generalized UTF-8, except that you always want to use surrogate pairs for big code points, and you want to totally disallow the UTF-8-native 4-byte sequence for them, you might like CESU-8, which does this. This makes both directions of CESU-8 <-> UTF-16 easy, because neither conversion requires special handling of surrogate pairs.

A nice property of GUTF-8 is that it can round-trip any UTF-16 sequence, even if it’s ill-formed (has unpaired surrogate code points). It’s pretty easy to get ill-formed UTF-16, because many UTF-16-based APIs don’t enforce wellformedness.

But both GUTF-8 and CESU-8 have the drawback that they are not UTF-8 compatible. UTF-8-based software isn’t generally expected to decode surrogate pairs — surrogates are supposed to be a UTF-16-only peculiarity. Most UTF-8-based software expects that once it performs UTF-8 decoding, the resulting code points are real code points (“Unicode scalar values”, which make up “Unicode text”), not surrogate code points.

So basically what WTF-8 says is: encode all code points as their real code point, never as a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8). However, if the input UTF-16 was ill-formed and contained an unpaired surrogate code point, then you may encode that code point directly with UTF-8 (like GUTF-8, not allowed in UTF-8).

So WTF-8 is identical to UTF-8 for all valid UTF-16 input, but it can also round-trip invalid UTF-16. That is the ultimate goal.



By the way, one thing that was slightly unclear to me in the doc. In section 4.2 (https://simonsapin.github.io/wtf-8/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.

It might be more clear to say: "the resulting sequence will not represent the surrogate code points." It might be by some fluke that the user actually intends the UTF-16 to interpret the surrogate sequence that was in the input. And this isn't really lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.

The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end up being ill-formed.


The encoding that was designed to be fixed-width is called UCS-2. UTF-16 is its variable-length successor.


Thanks for the correction! I updated the post.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: