Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I seriously applaud the writer to dive into Unicode in such detail and compare multiple implementation in different languages. That must have taken a while!

He is working for Mozilla and I guess he needs to actually know all those nitty-gritty details. I could not imagine myself to even bring up the patience for analyzing it.

In some way I am really scared about Unicode. I don't care if the programming language simply allows input and output of Unicode in text-fields and configuration files. But where it gets tough is, if you actually need to know the rendered size of a string or convert encodings, if the output format requires it. There are so many places where stuff can go awry and there's only a small passage in the blog post which mentions fonts. There are multiple dragons abound, like font encoding, kernings and what not.

Imagine writing a game. Everything works nice and dandy with your ASCII format and now your boss approaches you and wants to distribute the game for the asian market! Oh, dear... My nightmares are made out of this!

How do you handle this? Do you use a programming language which does everything you need? (Which one?) Do you use some special libraries? What about font rendering? Any recommendations?



> Imagine writing a game. Everything works nice and dandy with your ASCII format and now your boss approaches you and wants to distribute the game for the asian market!

You just make sure all translated text is in UTF-8, and use Google Noto fonts for those languages. All game engines I know render UTF-8 text without problems if you supply a font that has the needed glyphs.

Source: I'm and indie game developer and have recently localized my game to Chinese. The game is a mix of RPG and roguelike, so it has a lot of text (over 10000 words). I used SDL_TTF to render text. Precisely: TTF_RenderUTF8_Blended() function. The only issue I had is with multiline/wrapped text. SDL_TTF doesn't break lines on Chinese punctuation characters (.,;:!?) so I would search+replace strings at runtime to add a regular space characters after those.


> SDL_TTF doesn't break lines on Chinese punctuation characters (.,;:!?)

Those aren't Chinese punctuation characters. Chinese punctuation characters are full-width, including the spacing that should follow them (or in the case of "(", precede) within the glyph itself: (。,;:!?). (You may also notice that the period is radically different.) Chinese text should almost never include space characters.

Chinese applications seem happy to break lines anywhere including in the middle of a word, but punctuation seems like an especially good place for a line break, so I'm confused why SDL_TTF would go out of its way to avoid breaking there.


It sounds more like a bug in SDL_TTF than a deliberate attempt to not break the line on Chinese punctuation marks.

I wonder if SDL_TTF works with Unicode Zero-Width Space (U+200B). If so, that would probably be the right choice.


> Those aren't Chinese punctuation characters.

I know, I meant the actual ones you wrote above.

> I'm confused why SDL_TTF would go out of its way to avoid breaking there.

SDL_TTF doesn't break at all. If you have a long Chinese text which uses proper punctuation characters, it would never break, because it only breaks on ASCII whitespace.

I wanted to avoid breaking lines in the middle of a word, so I added extra "regular" space characters to force breaking the line.


You don't really need to only break on punctuation. There is no convention to do so and so long as you so not break any logograms in half, the resulting text reads perfectly fine. In fact, the convention is to have left and right justified text with equal numbers of monospaced logograms, including punctuation, on each line (on the equivalent for vertical text). Classical Chinese before the 20 th century was seldom punctuated.


I wasn't aware of this. Thanks.


>You just make sure all translated text is in UTF-8, and use Google Noto fonts for those languages. All game engines I know render UTF-8 text without problems if you supply a font that has the needed glyphs.

The game could have used a custom engine (like tons of games do), or the requirement could include e.g. Arabic or some such RTL text, further messing up the display...


First rule: don't panic :)

Also disclaimer: I've been working on games which were translated to >20 languages including some exotic ones, but I'm in no way an UNICODE expert.

- most important: consider using UTF-8 as text encoding everywhere, and only encode and decode from and to other text encodings when needed (for instance when talking to APIs which don't understand UTF-8, like Windows)

- be very careful in all places where users can enter strings, and with filesystem paths, this is where most bugs happen (one of the most popular bugs is when a user has Unicode characters in his login name, and the game can't access the user's "home directory", happens to the big ones too: https://what.thedailywtf.com/topic/15579/grand-theft-auto-v-...)

- get familiar with how UTF-8 encoding works and how it is "backward compatible" with 7-bit ASCII, there are good chances you don't need to change much of your old string processing code

- rendering is where it gets interesting, and here it makes sense to only do what's needed:

(1) The easiest case is American and European languages, these are all left-to-right, have fairly small alphabets and don't have complicated 'text transformation' rules

(2) East-Asian languages with huge alphabets can be a problem if you need to pre-render all font textures.

(3) The next step is languages which render from right-to-left, the interesting point is that substrings may still need to be rendered left-to-right (for instance numbers, or "foreign" strings)

(4) And finally there are languages like Arabic which rely heavily on modifying the shape of 'characters' based on where they are positioned in words or in relation to other characters, you need some sort of language-specific preprocessing of strings before you forward them to the renderer. HarfBuzz is a general solution for this problem, but it's also a lot of code to integrate (we created a specialized transformation only for Arabic).

(5) For actual rendering, all text rendering engines which can use TTF fonts are usually ready for rendering UNICODE text

So basically, the problem becomes a lot easier if you only need to support a specific set of languages.


The last part about having support for. limited numbers of languages is bigger than you probably expect. Generally, unless a software has actually been adapted to and tested with a specific language, it shouldn't claim to support it, even if is just processing UTF-8 encoded text in that language.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: