This article is pretty great, however I would like to caution against using the word "character" in discussions about Unicode. The post does say that it's a fuzzy concept, but also says that it's "Basically a thing in the Unicode table somewhere.", which is the definition of a code point. Unicode itself gives multiple incompatible meanings for the term (http://unicode.org/glossary/#character) and only seems to use it in non-normative text -- so the post isn't wrong, but has the potential to cause confusion.
Often when you want to say character you really mean "grapheme cluster"[1]. A Devanagri consonant cluster (of one or more consonants) with optional vowel (and misc diacritics) is a grapheme cluster. A latin letter with accent mark(s) is a grapheme cluster. A Hangul Jamo (precomposed or otherwise) is a grapheme cluster. A flag or multicultural family emoji is a grapheme cluster.
When talking about characters, the question arises if a Hangul Jamo made of decomposed characters is one character or three. Or if the combination of a [devanagri consonant + virama] "character" and [devanagri consonant + vowel] "character" is one character or two. This is specified in the case of grapheme clusters, and this usually maps to where the notion of a character is important -- text selection and offsets in editing, etc. "code point" does not universally map to any tangible concept -- it's a concept made up for the sake of specifying unicode. You only care about code points when dealing with UTF32 strings or when implementing operations on unicode text. "glyph" is also sometimes what we mean when we say "character", though that's more useful on the rendering end of things.
[1]: to be pedantic, "extended grapheme cluster", because unicode gave a rigorous definition of grapheme cluster and later decided to change it.
I find the following hierarchy helpful:
bytes -> code units -> code points -> extended grapheme clusters -> user-perceived characters
In Python, Unicode string is an immutable sequence of Unicode code points. It has nothing to do with UTF32 (Python uses flexible internal representation). To get "user-perceived characters" (approximated by eXtended graphemes clusters):
In practice, whatever is produced by the default method of iterating over a string is often called a character in that programming language (code point, or UTF16 code unit, or even a byte).
The .chars method should be the fastest, because Perl 6 internally uses strings of fully composed characters (normalized form grapheme). It's much better than having to do regex hacks like in Python.
Given that one can make graphemes that contain many Unicode code points (213 in this extreme example: https://www.reddit.com/r/Unicode/comments/4yie0a/tallest_lon...), I wondered whether that can be correct for any reasonable definition of "fully composed character".
As far as I can tell, it turned out to be correct, though. Perl 6 has its own Inocode normalization variant called NFG (https://design.perl6.org/S15.html#NFG):
"Formally Perl 6 graphemes are defined exactly according to Unicode Grapheme Cluster Boundaries at level "extended" (in contrast to "tailored" or "legacy"), see Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION 3 Grapheme Cluster Boundaries>. This is the same as the Perl 5 character class
\X Match Unicode "eXtended grapheme cluster"
With NFG, strings start by being run through the normal NFC process, compressing any given character sequences into precomposed characters.
Any graphemes remaining without precomposed characters, such as ậ or नि, are given their own internal designation to refer to them, at least 32 bits in length, in such a way that they avoid clashing with any potential future changes to Unicode. The mapping between these internal designations and graphemes in this form is not guaranteed constant, even between strings in the same process."
From that, I guess Perl 6 extends its mapping between NFG code points (an extension of Unicode code points) and Unicode graphemes clusters whenever it encounters a grapheme cluster it hasn't seen before. Ignoring performance concerns (might not be bad, but I'm not sure about that), that seems a nice approach.
wcswidth computes the cell-width (or column-count) of a Unicode string, which is unrelated to the count of graphemes, EGCs or code points. For example, Latin characters are one cell/column wide, while for example many CJK characters occupy two cells/columns, while they are still one EGC.
A typical application is printing CJK things to a terminal mask, progress display or similar.
Right, my point is that the concept of a code point is seldom useful unless doing storage stuff with utf32 or implementing unicode algorithms. Python may expose an API of code points but that doesn't mean that it's meaningful.
Performance arguments can be made as to why the API should use code points instead of grapheme clusters, so there are legitimate reasons for Python (and Rust, and many other languages) to do so. Sometimes you just need some comparable notion of length and "number of code points" is acceptable.
However, you should be careful when writing code that confers meaning to the concept of a code point. A lot of code does this (using code points when they mean glyphs or grapheme clusters).
> Right, my point is that the concept of a code point
> is seldom useful unless doing storage stuff with
> utf32 or implementing unicode algorithms.
In XML land, where strings are almost always UTF8, XPath offers a string-to-codepoints() function that returns a sequence of integers, and a corresponding codepoints-to-string(). These two have been invaluable to me on many occasions when doing string manipulation gymnastics.
What kind of string manipulation gymnastics? I'd be wary of using codepoints for string manipulation for anything other than algorithms where you are explicitly asked to (e.g. algorithms that implement operations from the unicode spec)
On the other hand, there are algorithms embedded in widely-deployed standards which are defined in terms of code points.
For example, one I know quite well from having implemented it in Python: the HTML5 color parsing algorithm (the one that turns even incredible junk strings like "chucknorris" into color values) requires, in step 7 of the parsing process, replacing any code point higher than U+FFFF with the sequence '00' (that's two instances of U+0030 DIGIT ZERO).
And personally I think code points, as the basic atomic units of Unicode, do make sense as the things strings are made up of; I wish Python had better support for identifying graphemes without third-party libraries, but since Unicode encodings all map back to code points it makes sense to me that a Unicode string is a sequence of those rather than a sequence of some more-complex concept.
> there are algorithms embedded in widely-deployed standards which are defined in terms of code points.
From my original comment:
> You only care about code points when dealing with UTF32 strings or when implementing operations on unicode text.
These operations fall in the latter. It's still pretty niche. If an algorithm is defined explicitly in terms of code points this makes sense. Stuff starts falling apart when people assign meaning to code points and use it as a placeholder for other concepts like glyph or columns of grapheme cluster.
Lots of intelligent comments from many contributors in that high quality thread.
For me, the takeaway was the truly understanding Unicode requires holding 5 different layers of concepts in your head simultaneously and the oft-cited Joel Software "Absolute Minimum You Must Know About Unicode" only covers 2 of them. (Which to be fair to Joel, his title emphasizes "minimum to know" and not "everything to know"). Also, at the highest level of abstraction, intelligent people can have philosophical disagreements on _what_ to encode in Unicode.
The problem with "character" is the same as the problem with "string." There is a formal computer science definition that comes from the discipline of automata, is mathematical, and useful for building compilers and state machines and why characters and strings in C are bytes and streams of bytes.
Because ASCII abstracts onto bytes well but text does not, layering an abstraction for 'text' that decomposes into characters and strings is where the problem lies. That's where, for example, Python 2 ran into trouble. Perl's set of compromises made its text abstraction useful, but creating them required creating them to be a principle focus of the language.
A stream of bytes is a stream of bytes, short of magic there's no way to generate a correct interpretation but for a predesignated protocol. And the only way to get a predesignated protocol for text is to make a deep study of human language and to choose to live with some compromises and not to live with others.
MySQL claims to support utf8, but in reality, it doesn't. You need utf8mb4 to support certain common Kanji characters.
This company had spent untold thousands (possibly millions) trying to convert gigantic databases (and I don't use the term gigantic loosely...) from utf8 to utf8mb4 because some of their Japan-based clients were using Kanji.
Sounds easy right? Wrong. utf8mb4 comes with some technical "gotchas" (google it) that had delayed the attempt to change to it by almost a year.
Anyway, I found this pretty amusing, and got a huge paycheck to explain to them just how screwed they were.
A while back I found a minor bug in Thunderbird, where astral plane code points were considered by the line wrapping algorithm to have a width of 2 rather than 1. So I filed a bug report.
Oh, boy.
Turns out that trying to include astral plane code points in whichever version of Bugzilla that Mozilla uses causes comments to be silently truncated! Because MySQL.
I filed that one in 2010; it got deduped against a bug originally filed in 2007; it is now 2016, and the bug is RESOLVED FIXED, and Mozilla's bugzilla still has the same problem.
When I worked at Mozilla I was on the MDN (developer.mozilla.org) team, and we had this inexplicable bug: articles can be categorized with tags, and both articles and tags are localizable for all the languages MDN supports. So, for example, English reference articles on CSS properties were tagged "CSS Reference", while French reference articles on CSS properties were tagged "CSS Référence".
And... sometimes an English article's page would show it as having the French ("Référence") tag, and sometimes the French article's page would show it as having the English article's tag.
Turns out, MySQL's case-insensitive UTF-8 collation treated "e" and "é" as the same character. We didn't know about that, and hadn't noticed because the tagging library we used worked around it. Until one day a new version of it didn't, and tags from one language would start showing on another language's articles (if the words were the same, aside from diacritics/accents on certain characters). Which led to this:
> utf8mb4 comes with some technical "gotchas" (google it)
I know InnoDB limits index sizes to 767 bytes, meaning VARCHAR(255) using utf8 can have all 255 characters indexed, but VARCHAR(255) using utf8mb4 can only index 191 characters (floor(767/4) == 191).
After a quick Google search, that seems to be the most common gotcha. What other gotchas did you have in mind?
This was definitely the first thing that came up, as you found.
To be honest, I just don't remember. There was something about something that made something scary to the PM who was in charge of it all? That is about the best I can come up with.
I want to say the needed to index more than 191 chars, but that seems like a stupid thing to say. Who needs to index that many chars?
If I remember, I'll edit :)
edit: I guess I should say I was consulted to do some unrelated things, then helped them with some MySQL stuff that came up towards the end of the contract, then the utf8mb4 stuff came up, and I spent some time going through it with them. It was not the main focus of the contract, which is part of why I don't remember it very well. Just something that came up in the day to day...
My takeaway is that turkish "I" and English "I" as well as English "æ" and Icelandic "æ" are crammed into the same character even though that causes all sorts of problems, but there are about 20 different characters that represent "x"...
The solution to the problem, by the way, is to determine what the user expects and give it to them. That is, you can't just sort words without defining what sort you want to do (according to German phone book sort? Portuguese dictionary sort? etc.).
Where do you sort the Dutch "ij"? The obvious answer is "between 'ii' and 'ik'" but it's actually just the print representation of "ÿ", a letter that is essentially only used in freehand nowadays. So, do you sort it in place of "ÿ", when in the Dutch locale? What if you have a borrowed word that happens to have "ij" in it, like "hijack" (which really is used in Dutch)? In practice, you sort it as if it's English (i.e. between "ii" and "ik"), but that leads to confusion because when capitalising you treat it as a single letter. The titlecase form of "ijzer" (iron) is "IJzer". I guess the correct way would probably be whatever they do in dictionaries and phone books, but I'm an expat so I have no idea what ordering those use.
"ij" is usually sorted between "ij" and "ik" by Dutch dictionaries. The odd exception (even for Dutch speakers it seams odd) are phonebooks, because for historic reasons the surname of family members is sometimes written with "y" and sometimes with "ij".
Titlecase (IJzer) is not really a problem, because there are no loan words that start with "ij" (like hijack).
Very, very rarely "ij" is treated as a ligature, and counted as one letter.
So, for the most part you should follow the English sorting rules, and you need one exception for handeling title case.
Unicode contains a ligature "ij", which should basically not be used.
English has simply abandoned the letters thorn and eth (and others), which was largely the result of the rise of printing (IIRC).
It's interesting to me that this happened far enough back that the only time we tend to consider 'th' to be letterlike is when we need to explain pronunciation — e.g. when reaching a kid to read
Based on other cases, I guess the "unicode-style" solution would be to add a new codepoint DUTCH CHARACTER IJ OR ÿ which is cased and sorted accordingly.
Of course then you'd have to teach people to use that character when writing instead of just typing "ij". And one day, there will be someone who, for stylistic reasons, needs tight control over when it's rendered as "ij" and when as "ÿ"...
Wikipedia claims that Dutch dictionaries sort 'ij' as if it were two characters, but telephone directories sort as if it were 'y'.
Based on precedent, I'd expect the Unicode Consortium to declare it the radical of a Chinese character.
(it does exist in Unicode, but as a ligature -- U+0132 for majuscule, U+0133 for minuscule -- and decomposes to the two code points for 'i' and 'j', with use of the ligature discouraged)
One of my favourite dark corners of Unicode is IDS, or Ideographic Description Sequence. It's allows you to describe characters that are not encoded in Unicode, but can be described by a combination of existing ones.
For example, the Chinese character for the word Biang[1] can be describe with:
The very first input scheme the article talks about is Pinyin, with which you'd input biang by typing... "b" "i" "a" "n" "g" then selecting from a list (most likely a list of length 1). Not insane or even difficult.
(Except that biang is not encoded in Unicode yet so you can't type it anyway.)
Think about it, though. You type phonetically in one script, then select a character in another script that might be pronounced in a similar way. That's like entering Hangul syllables that sound similar to what you want, then choosing the right English character sequences.
What's there to think about? How else would you input a script with more than 50k characters?
> You type phonetically in one script, then select a character in another script that might be pronounced in a similar way.
Sure. Japanese works the same way, you input in kana or romaji, then select the suitable kanji (or kanji sequence).
Of course it only works when you have a regular phonology, that would be completely impossible for english since by and large orthography and pronunciation have no relation.
I've thought about it. I input Japanese every day, Chinese frequently, and Korean on occasion. The Latin alphabet is a first-class citizen in these languages; I don't see the issue.
In fact English is becoming the same way: when you input "apple" and choose the [U+1F34E RED APPLE; stripped from input on HN] suggestion you've done exactly the same thing.
As someone who wrote the LTR/RTL text shaping for Uber Maps, I know the pains of this way too well. Off by one Unicode errors were causing Chinese characters to get appended to the end of Arabic words! Great read.
From the comments on that page, some registrars are now registering domains with emoji in them. RFC 3490 didn't contemplate that, so it is not, apparently, disallowed. No idea if IDNA does normalization for the color modifiers.
RFC 3490 has been replaced by RFC 5890. Sadly, emojis are disallowed. But some registries (for example, .ws, according to rumors) may allow them to be registered anyway.
Its worse than that. None of this weirdness is part of natural human language. It is all part weirdness of the writing system, which is entirely invented. What we are facing now is difficulties in interfacing different implementations of the same technology (writing).
The double-width emoji bug has at least been fixed as of 2016 & the Unicode 9 release. That string works fine in my VTE based gnome-terminal today.
(I was the one who finally prodded the relevant Unicode cttee into fixing this bug. They did all the heavy lifting of writing proposals and steering the change through though: Thanks Ken et al!)
Yes, but it describes the Unicode situation in 2015. Unicode has probably already gained a few more dark corners since then, what with the proliferation of emoji, but that is not reflected in the article.
(I was fine with unrealistic, inhuman, Simpsons-style yellow...) I imagine fine gradations of locale-dependent zero-width gender identity modifiers will be added at some point. Unicode is a horror-show that will be producing bugs for decades to come. Every time you see a bug caused by "\r\n" vs. "\n", double-encoded HTML entities, or "smart" quotes, remember that Unicode is orders of magnitude more complex.
Often when you want to say character you really mean "grapheme cluster"[1]. A Devanagri consonant cluster (of one or more consonants) with optional vowel (and misc diacritics) is a grapheme cluster. A latin letter with accent mark(s) is a grapheme cluster. A Hangul Jamo (precomposed or otherwise) is a grapheme cluster. A flag or multicultural family emoji is a grapheme cluster.
When talking about characters, the question arises if a Hangul Jamo made of decomposed characters is one character or three. Or if the combination of a [devanagri consonant + virama] "character" and [devanagri consonant + vowel] "character" is one character or two. This is specified in the case of grapheme clusters, and this usually maps to where the notion of a character is important -- text selection and offsets in editing, etc. "code point" does not universally map to any tangible concept -- it's a concept made up for the sake of specifying unicode. You only care about code points when dealing with UTF32 strings or when implementing operations on unicode text. "glyph" is also sometimes what we mean when we say "character", though that's more useful on the rendering end of things.
[1]: to be pedantic, "extended grapheme cluster", because unicode gave a rigorous definition of grapheme cluster and later decided to change it.