Most programs claim to support Unicode but they actually don't. They either miscount string lengths (you type a CJK character or an emoji in, string appears shorter than what the program thinks), separate them improperly or many other things. It doesn't help that by default, most programming languages also handle unicode poorly, with the default APIs producing wrong results.
I'd take "we don't do unicode at all" or "we only support BMP" or "we don't support composite characters" any day over pretend-support (but then inevitably breaking when the program wasn't tested with anything non-ASCII)
(ninjaedit: to see how prevalent it is, even gigantic message apps such as discord make this mistake. There are users on discord who you can't add as friends because the friend input field is limited to 32.... something - probably bytes, yet elsewhere the program allows the name to be taken. This is easy to do with combining characters)
Why are people so interested in ‘lengths’ of strings?
The only thing anyone should actually care about is either the number of bytes it takes to store the string, or the number of pixels wide it is when rendered. Both of which are only loosely related to how many ‘characters’ or ‘grapheme clusters’ they contain, and which are themselves only vaguely correlated.
𒈙 (CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is four bytes of UTF8 and as wide as about 9 Latin characters. (an England flag emoji) is about a character wide, but takes 28 bytes to store in UTF8.
Saying it's the ONLY thing anyone should care about is bound to over-generalize but I do find myself agreeing the vast majority of cases are really trying to guarantee something about byte or pixel dimensions. I think the reason we often end up defaulting to string length is, for most western text, it was often much easier to deal with and Emoji is, somewhat intentionally, a key driver in breaking that assumption even for those that don't care about supporting Cuneiform or have any idea or plans on trying to support other languages from the start.
The number of bytes is still hard to do right. For example if I have a payload that can only be 4000 bytes, how do I take an arbitrary utf-8 string and get one that is <= 4000 bytes and doesn't cut off a graphmeme cluster?
There's no way to take an arbitrary utf-8 string and shorten it to <= 4000 bytes without losing something. Heck, even ASCII has \r\n which can leave you in a bad place if you cut it off in the middle.
Even if you do the Unicode stuff right and make sure you don't break a grapheme cluster, you're still cutting words in half, which, if you're doing something like pulling AP headlines and truncating them to fit on a screen, can have embarrassing consequences for one like 'FDA Chief Exposes Butter Industry Corruption'. Even if you make sure to break on word boundaries you're still at risk of turning 'US Navy Fires Nuclear Weapons Program Chief' into 'US Navy Fires Nuclear Weapons' on your news ticker.
The rules for grapheme clusters are well-defined and documented. I’ve only implemented a forward iterator on graphemes in the finl_unicode rust crate, but backwards would not be that difficult either. If I’m looking at a byte stream, I have to first make sure that I’m not in the middle of a multi-byte sequence (easily enough accomplished), you need to do some forwards and backwards iteration to determine where the grapheme boundaries are, and likely you will want a library to do this for you, but it’s something that should take a reasonably skilled programmer in about a day or two.
Well don't ask me, ask the service operators who limit their fields to X bytes!:D Whatever their reasoning, if they want to limit strings to a certain length, they should make it consistent so the user won't run into problems where it's accepted in one place then it runs into an error or another part of the service refuses to accept it.
Your app, unless it's a text editor or UI toolkit, should not care.
The easiest way to support Unicode is to avoid supportimg it. Leave text editing to UI widgets. Leave truncation to web browsers. Avoid fancy marquee effect on the terminal. Your program may look less fancy but it will automatically support unicode, even future extensions.
Unless you deal with very old data, just use UTF-8 [0]. Mac terminal uses UTF-8 by default, most Linuxes use UTF-8 by default, and if you are stuck with Windows, there are libraries which let you use UTF-8 everywhere.
And if you do deal with old data, still prefer to use UTF-8, and make sure you only translate character sets if user asks / they are explicitly specified. And for the love of god, _please_ don't just hardcode two common encodings (like 8859-1 and 8859-15)... I used to deal with encodings a lot, and the only time I actually had data loss from encoding problems was from misguided apps which assume 8859-1 (and tried to convert to ASCII by stripping accents). Please either assume UTF-8 or assume nothing and keep data as-is.
And that's it, _most_ software does not need to know more. Let's look at some examples:
- For file archiver, assume utf-8 everywhere and don't worry about encoding at all (unless you planning to deal with very old archives, in this case you'd need encoding support. But for something made from scratch, don't bother.) _Especially_ don't add any encoding support for "extract to console" function -- if I need a different encoding I can pipe into iconv myself, thank you very much.
- For API client, you don't need to worry about encoding either.. A very old HTTP server might return data in something other than utf-8 but (1) your HTTP library likely handles it already (2) how many such servers are left, anyway?
- For XML analyzer, use proper XML library, they will handle encodings for you.
- For web dashboard or a database, keep everything in utf-8 and things will just work.
- For ETL-like applications, many modern data sources are already in UTF-8.. and if they are not, convert everything to utf-8 ASAP.
In other words: while it's useful to know that encodings exist, unless you are working with very old data or legacy systems, utf-8 is the only thing you need.
Do you have any specific examples, or are you just saying general statements?
Back in the day, I worked quite a bit various encodings (my language had 2 primary one and 2 secondary one, and it was a guess which one the text was), and the data loss usually happened from programs that tried to support encodings.
When program would not touch encodings, there might be some mojibake and unreadable text, but you could generally fix things.. I had to write some scripts which changed encoding of filenames and fixup random database or five, but there were no data loss.
It's the programs which were "able to tell the difference between, e.g., ISO 8859-1, ISO 8859-15, and UTF-8" which caused data loss. There were so many cases when I ended up with directory full of "?????????" or "aoeoaoao" files and that was it, there is no way to recover.
So please don't add Unicode support unless you have to, it is not ideology but rather the results of hard practical experience.
Hmm, I could imagine that you have a precomposed character with two diacritical marks; if there were no precomposed character with the remaining diacritical you'd need to replace the precomposed character with the base character and diacritical. However I would imagine the better UX would be for backspace to remove the entire character (whether diacriticals were composed or not) because that's how people think of their characters.
Is there a real world case? I'd be excited to learn it.
What happens when the user presses backspace when the carat is positioned just past the graphical representation of U+AC01? It should by default result in U+AC01 being replaced by U+1100 and U+1161. (In some IME implementations the user might have to configure it, e.g., "Delete by jaso unit" in Windows 7.)
There was probably an existing character set that had precomposed characters and Unicode always includes existing character sets for round trip idempotency.
Exactly this. When Unicode 1.0 came out, there was no standardized way to indicate that f i should compose into fi, let alone that ᄋ ㅡ ᄀ should compose into 윽. IIRC, composing of East Asian scripts into han or hangul happened at the firmware level, not at the app or OS level. If you didn’t have the hardware built into your keyboard, you simply could not type anything in East Asian scripts. A lot of the inconsistencies in Unicode (like the difference in how vowels are handled in Thai vs Indic scripts) come down to how the original 8-bit encodings worked pre-Unicode.
Not especially. I was around for a lot of it (I shared an office with one of the contributors to Unicode 1.0 back in the late 80s and used to have a massive binder with every ECMA code page in it (which included a lot of non-European codes). The other thing about Han unification that people don’t realize is that there were five different East Asian 16-bit encodings (the original draft of ISO-10646 had a 32-bit encoding which included all of them). I worked on a project in the early 90s where we needed multiple-script capabilities (Roman, Cyrillic and Japanese) and ended up using JIS because it was a well-defined encoding that we could buy fonts that employed it (although we still had to contract with Bitstream to get a version of their font that was split into 8-bit segments for our software). There was JIS (Japanese), the Korean code (which included both Jamo and pre-composed Hangul syllables) and not one, not two, but three, count ’em, three different Chinese encodings: Mainland China, Hong Kong and Taiwan. Since the original goal of Unicode was to try to be a 16-bit fixed-width encoding (as opposed to the competing 32-bit fixed-width encoding of ISO-10646), there was really no choice but to do Han Unification. Even so, the five 16-bit encodings still linger.
Yeah, Chinese characters radicals are not an alphabet per se, while Hangul has separate, standalone letters. It's literally a matter of composing them in a fixed way, compared to Chinese characters which tend to go bananas on how many thousands of ways they can be composed.
I was tempted to include unicode when I enhanced the URL/Domain generating strategies in Hypothesis to correctly generate from the set of all valid TLDs a few years ago! But I decided against it since technically the canonical storage form doesn't contain the unicode characters as it should have had Punycode (https://en.wikipedia.org/wiki/Punycode ) conversion done by the "input" system be it a URL bar of your browser or whatever input handling is in play. Assuming all domain processors are able to correctly Punycode was a bit too much, so I kept it to the set of basic valid characters without needing Punycode conversion, and if someone wanted to fuzz test fully they could just add a layer that generates from full unicode.
I'd take "we don't do unicode at all" or "we only support BMP" or "we don't support composite characters" any day over pretend-support (but then inevitably breaking when the program wasn't tested with anything non-ASCII)
(ninjaedit: to see how prevalent it is, even gigantic message apps such as discord make this mistake. There are users on discord who you can't add as friends because the friend input field is limited to 32.... something - probably bytes, yet elsewhere the program allows the name to be taken. This is easy to do with combining characters)