> Unicode characters that are visually identical This was actually a further bug...

userbinator · on Dec 19, 2014

http://en.m.wikipedia.org/wiki/Unicode_equivalence

http://en.m.wikipedia.org/wiki/IDN_homograph_attack

What seems really scary about this is that even Unicode has several different ways of comparing strings, and the correct one depends on the exact situation, so the common response of "just use a library" doesn't work; for example, if a user were searching for a filename it might make sense for full-width characters to compare equal to half-width ones, but not if opening a file where you wouldn't want e.g. the full width version of /etc/passwd to be equivalent to the half-width one.

raydev · on Dec 19, 2014

I don't understand. Why can't a library just compare strings at the code-point level, ignoring "canonical equivalence"?

medgno · on Dec 19, 2014

Then you run into problems with how characters are represented. For instance, é (lowercase latin e with an acute accent) can be represented either by one unicode codepoint (U+00E9, 'LATIN SMALL LETTER E WITH ACUTE'), or by two unicode codepoints (U+0065 U+0301 -- LATIN SMALL LETTER E, COMBINING ACUTE ACCENT). There are normalization forms that will convert these two representations into the same representation for easier comparison.

If you don't perform canonical equivalence checking, you could search for "café" and not find a file named "café.txt" if it uses the other representation.

oneeyedpigeon · on Dec 19, 2014

"for example, if a user were searching for a filename"

It's useful when I search for "café", if I also get results for "cafe" - Chrome's search does this. Not to mention searching for "don't" and getting hits including "don’t". But that should definitely be restricted to text data operations, rather than lower-level ones, including filenames.

userbinator · on Dec 19, 2014

But that should definitely be restricted to text data operations, rather than lower-level ones, including filenames.

The issue arises when this "text data" includes filenames. Having café.txt and cafe.txt be equivalent when searching is useful, but the real problem is if a filesystem decides that two "equivalent" filenames are essentially identical - to contrive an example, suppose it thought /étc/passwd was referring to the same file as /etc/passwd . It makes checking for and filtering out "sensitive" filenames far more difficult. For example, just take a look at all the ways Unicode homoglyphs and "special" characters can be used to bypass forum wordfilters, and you'll see how difficult that problem is.

(I know permissions, ACLs, etc. can help here with access control, but the problem of distinguishing between filenames still stands.)