Most programs claim to support Unicode but they actually don't. They either misc...

jameshart · on July 25, 2023

Why are people so interested in ‘lengths’ of strings?

The only thing anyone should actually care about is either the number of bytes it takes to store the string, or the number of pixels wide it is when rendered. Both of which are only loosely related to how many ‘characters’ or ‘grapheme clusters’ they contain, and which are themselves only vaguely correlated.

𒈙 (CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is four bytes of UTF8 and as wide as about 9 Latin characters. 󠁧󠁢󠁥󠁮󠁧󠁿 (an England flag emoji) is about a character wide, but takes 28 bytes to store in UTF8.

zamadatix · on July 26, 2023

Saying it's the ONLY thing anyone should care about is bound to over-generalize but I do find myself agreeing the vast majority of cases are really trying to guarantee something about byte or pixel dimensions. I think the reason we often end up defaulting to string length is, for most western text, it was often much easier to deal with and Emoji is, somewhat intentionally, a key driver in breaking that assumption even for those that don't care about supporting Cuneiform or have any idea or plans on trying to support other languages from the start.

MadVikingGod · on July 26, 2023

The number of bytes is still hard to do right. For example if I have a payload that can only be 4000 bytes, how do I take an arbitrary utf-8 string and get one that is <= 4000 bytes and doesn't cut off a graphmeme cluster?

jameshart · on July 26, 2023

Ask a human.

Failing that, ask an AI.

There's no way to take an arbitrary utf-8 string and shorten it to <= 4000 bytes without losing something. Heck, even ASCII has \r\n which can leave you in a bad place if you cut it off in the middle.

Even if you do the Unicode stuff right and make sure you don't break a grapheme cluster, you're still cutting words in half, which, if you're doing something like pulling AP headlines and truncating them to fit on a screen, can have embarrassing consequences for one like 'FDA Chief Exposes Butter Industry Corruption'. Even if you make sure to break on word boundaries you're still at risk of turning 'US Navy Fires Nuclear Weapons Program Chief' into 'US Navy Fires Nuclear Weapons' on your news ticker.

Strings are language. Language is hard.

dhosek · on July 27, 2023

The rules for grapheme clusters are well-defined and documented. I’ve only implemented a forward iterator on graphemes in the finl_unicode rust crate, but backwards would not be that difficult either. If I’m looking at a byte stream, I have to first make sure that I’m not in the middle of a multi-byte sequence (easily enough accomplished), you need to do some forwards and backwards iteration to determine where the grapheme boundaries are, and likely you will want a library to do this for you, but it’s something that should take a reasonably skilled programmer in about a day or two.

eviks · on July 26, 2023

Count by grapheme clusters and stop right before it exceeds the limit?

zamadatix · on July 27, 2023

Isn't that the length of the string? Also there are various things about Unicode that make doing this "right" for an input much harder than it sounds.

eviks · on July 27, 2023

string length is a poorly defined concept, but yes, counting real characters (grapheme clusters) counts

And what "various things" do you mean that are on top the char split?

Dylan16807 · on July 26, 2023

https://unicode.org/reports/tr29/

Pannoniae · on July 26, 2023

Well don't ask me, ask the service operators who limit their fields to X bytes!:D Whatever their reasoning, if they want to limit strings to a certain length, they should make it consistent so the user won't run into problems where it's accepted in one place then it runs into an error or another part of the service refuses to accept it.

torstenvl · on July 26, 2023

How many bytes should be removed from your gap buffer when the user presses backspace?

Does it matter if it's Hangul?

theamk · on July 26, 2023

Your app, unless it's a text editor or UI toolkit, should not care.

The easiest way to support Unicode is to avoid supportimg it. Leave text editing to UI widgets. Leave truncation to web browsers. Avoid fancy marquee effect on the terminal. Your program may look less fancy but it will automatically support unicode, even future extensions.

torstenvl · on July 26, 2023

This is not good advice.

Most software is not an "app," and even if it is, you at least need to be able to tell the difference between, e.g., ISO 8859-1, ISO 8859-15, and UTF-8. Otherwise you'll get text talking about cutting doorways 8œ feet tall or submitting a rÃ©sumÃ©.

Any software that needs to process information effectively needs to understand character sets, including Unicode.

theamk · on July 26, 2023

Unless you deal with very old data, just use UTF-8 [0]. Mac terminal uses UTF-8 by default, most Linuxes use UTF-8 by default, and if you are stuck with Windows, there are libraries which let you use UTF-8 everywhere.

And if you do deal with old data, still prefer to use UTF-8, and make sure you only translate character sets if user asks / they are explicitly specified. And for the love of god, _please_ don't just hardcode two common encodings (like 8859-1 and 8859-15)... I used to deal with encodings a lot, and the only time I actually had data loss from encoding problems was from misguided apps which assume 8859-1 (and tried to convert to ASCII by stripping accents). Please either assume UTF-8 or assume nothing and keep data as-is.

And that's it, _most_ software does not need to know more. Let's look at some examples:

- For file archiver, assume utf-8 everywhere and don't worry about encoding at all (unless you planning to deal with very old archives, in this case you'd need encoding support. But for something made from scratch, don't bother.) _Especially_ don't add any encoding support for "extract to console" function -- if I need a different encoding I can pipe into iconv myself, thank you very much.

- For API client, you don't need to worry about encoding either.. A very old HTTP server might return data in something other than utf-8 but (1) your HTTP library likely handles it already (2) how many such servers are left, anyway?

- For XML analyzer, use proper XML library, they will handle encodings for you.

- For web dashboard or a database, keep everything in utf-8 and things will just work.

- For ETL-like applications, many modern data sources are already in UTF-8.. and if they are not, convert everything to utf-8 ASAP.

In other words: while it's useful to know that encodings exist, unless you are working with very old data or legacy systems, utf-8 is the only thing you need.

[0] http://utf8everywhere.org/

torstenvl · on July 26, 2023

Yeah. This is objectively terrible advice. Ideology is not an excuse for data loss.

theamk · on July 26, 2023

Do you have any specific examples, or are you just saying general statements?

Back in the day, I worked quite a bit various encodings (my language had 2 primary one and 2 secondary one, and it was a guess which one the text was), and the data loss usually happened from programs that tried to support encodings.

When program would not touch encodings, there might be some mojibake and unreadable text, but you could generally fix things.. I had to write some scripts which changed encoding of filenames and fixup random database or five, but there were no data loss.

It's the programs which were "able to tell the difference between, e.g., ISO 8859-1, ISO 8859-15, and UTF-8" which caused data loss. There were so many cases when I ended up with directory full of "?????????" or "aoeoaoao" files and that was it, there is no way to recover.

So please don't add Unicode support unless you have to, it is not ideology but rather the results of hard practical experience.

torstenvl · on July 26, 2023

> Unless you deal with very old data, just use UTF-8

> I had to write some scripts which changed encoding of filenames and fixup random database or five, but there were no data loss.

If you do the first, it is impossible to do the second.

Are you just disagreeing for the sake of being disagreeable? Why would you do that?

jameshart · on July 26, 2023

It's possible that pressing backspace needs to ADD bytes.

gumby · on July 26, 2023

What is that case?

Hmm, I could imagine that you have a precomposed character with two diacritical marks; if there were no precomposed character with the remaining diacritical you'd need to replace the precomposed character with the base character and diacritical. However I would imagine the better UX would be for backspace to remove the entire character (whether diacriticals were composed or not) because that's how people think of their characters.

Is there a real world case? I'd be excited to learn it.

torstenvl · on July 26, 2023

> What is that case?

... Hangul.

What happens when the user presses backspace when the carat is positioned just past the graphical representation of U+AC01? It should by default result in U+AC01 being replaced by U+1100 and U+1161. (In some IME implementations the user might have to configure it, e.g., "Delete by jaso unit" in Windows 7.)

qalmakka · on July 26, 2023

I don't understand why the felt the need to add precomposed characters for Hangul. Why? Why couldn't they just let the system compose them instead?

gumby · on July 26, 2023

There was probably an existing character set that had precomposed characters and Unicode always includes existing character sets for round trip idempotency.

Imagine if hanzi had been encoded as radicals…

dhosek · on July 27, 2023

Exactly this. When Unicode 1.0 came out, there was no standardized way to indicate that f i should compose into ﬁ, let alone that ᄋ ㅡ ᄀ should compose into 윽. IIRC, composing of East Asian scripts into han or hangul happened at the firmware level, not at the app or OS level. If you didn’t have the hardware built into your keyboard, you simply could not type anything in East Asian scripts. A lot of the inconsistencies in Unicode (like the difference in how vowels are handled in Thai vs Indic scripts) come down to how the original 8-bit encodings worked pre-Unicode.

qalmakka · on July 27, 2023

Very interesting, I guess that legacy really is the root cause of everything. Do you have any links or such about the history of Unicode?

dhosek · on July 28, 2023

Not especially. I was around for a lot of it (I shared an office with one of the contributors to Unicode 1.0 back in the late 80s and used to have a massive binder with every ECMA code page in it (which included a lot of non-European codes). The other thing about Han unification that people don’t realize is that there were five different East Asian 16-bit encodings (the original draft of ISO-10646 had a 32-bit encoding which included all of them). I worked on a project in the early 90s where we needed multiple-script capabilities (Roman, Cyrillic and Japanese) and ended up using JIS because it was a well-defined encoding that we could buy fonts that employed it (although we still had to contract with Bitstream to get a version of their font that was split into 8-bit segments for our software). There was JIS (Japanese), the Korean code (which included both Jamo and pre-composed Hangul syllables) and not one, not two, but three, count ’em, three different Chinese encodings: Mainland China, Hong Kong and Taiwan. Since the original goal of Unicode was to try to be a 16-bit fixed-width encoding (as opposed to the competing 32-bit fixed-width encoding of ISO-10646), there was really no choice but to do Han Unification. Even so, the five 16-bit encodings still linger.

qalmakka · on July 26, 2023

Yeah, Chinese characters radicals are not an alphabet per se, while Hangul has separate, standalone letters. It's literally a matter of composing them in a fixed way, compared to Chinese characters which tend to go bananas on how many thousands of ways they can be composed.

gumby · on July 26, 2023

Yeah, in case my message wasn't clear: my comment about the radicals was in support of encoding precomposed characters.

gumby · on July 26, 2023

Thanks!

david_draco · on July 25, 2023

Maybe an "Acid test" for Unicode would help? These pages seem to go into that direction: https://www.kermitproject.org/utf8.html https://web.archive.org/web/20160306060703/http://www.inter-...

Placing a fuzzy tester like "hypothesis.strategies.characters" into the CI may also be revealing.

techdragon · on July 26, 2023

I was tempted to include unicode when I enhanced the URL/Domain generating strategies in Hypothesis to correctly generate from the set of all valid TLDs a few years ago! But I decided against it since technically the canonical storage form doesn't contain the unicode characters as it should have had Punycode (https://en.wikipedia.org/wiki/Punycode ) conversion done by the "input" system be it a URL bar of your browser or whatever input handling is in play. Assuming all domain processors are able to correctly Punycode was a bit too much, so I kept it to the set of basic valid characters without needing Punycode conversion, and if someone wanted to fuzz test fully they could just add a layer that generates from full unicode.

kazinator · on July 26, 2023

When there are seven opinions on what the length is, you cannot miscount it, unless you somehow come up with some vastly unpopular eighth answer.