Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Inventing your own pseudo-normalization of Unicode is a worse idea than using the actual normalization forms Unicode defines.

Also, if you think you can decompose without allocating memory... well, try a code point like U+FDFA.

For reference, its decomposition is:

U+0635 U+0644 U+0649 U+0020 U+0627 U+0644 U+0644 U+0647 U+0020 U+0639 U+0644 U+064A U+0647 U+0020 U+0648 U+0633 U+0644 U+0645

(and that doesn't begin to touch any of the potential issues with variant forms, homoglyph attacks, etc.)



There's nothing pseudo about it. To normalize both inputs first then compare, or normalize one character at a time and compare that is equivalent. There is a maximum number of codepoints in a canonical decomposition (or at least there used to be).

This is actually implemented in ZFS. (And also character-at-a-time normalization for hashing.)

I don't see how homoglyphs enter the picture. Can you explain?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: