Let me see if I have this straight. My understanding is that WTF-8 is identical ...

haberman · on May 27, 2015

By the way, one thing that was slightly unclear to me in the doc. In section 4.2 (https://simonsapin.github.io/wtf-8/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate code point pair, the conversion will be incorrect and the resulting sequence will not represent the original code points.

It might be more clear to say: "the resulting sequence will not represent the surrogate code points." It might be by some fluke that the user actually intends the UTF-16 to interpret the surrogate sequence that was in the input. And this isn't really lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.

The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end up being ill-formed.

cygx · on May 27, 2015

The encoding that was designed to be fixed-width is called UCS-2. UTF-16 is its variable-length successor.

haberman · on May 27, 2015

Thanks for the correction! I updated the post.