Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Unicode characters that are visually identical

This was actually a further bug, reported as part of the same CVE - you could also overwrite .git/config by adding any of a number of zero-width Unicode characters that many filesystems ignore when checking for filename equality (but string comparison doesn't, of course).



http://en.m.wikipedia.org/wiki/Unicode_equivalence

http://en.m.wikipedia.org/wiki/IDN_homograph_attack

What seems really scary about this is that even Unicode has several different ways of comparing strings, and the correct one depends on the exact situation, so the common response of "just use a library" doesn't work; for example, if a user were searching for a filename it might make sense for full-width characters to compare equal to half-width ones, but not if opening a file where you wouldn't want e.g. the full width version of /etc/passwd to be equivalent to the half-width one.


I don't understand. Why can't a library just compare strings at the code-point level, ignoring "canonical equivalence"?


Then you run into problems with how characters are represented. For instance, é (lowercase latin e with an acute accent) can be represented either by one unicode codepoint (U+00E9, 'LATIN SMALL LETTER E WITH ACUTE'), or by two unicode codepoints (U+0065 U+0301 -- LATIN SMALL LETTER E, COMBINING ACUTE ACCENT). There are normalization forms that will convert these two representations into the same representation for easier comparison.

If you don't perform canonical equivalence checking, you could search for "café" and not find a file named "café.txt" if it uses the other representation.


"for example, if a user were searching for a filename"

It's useful when I search for "café", if I also get results for "cafe" - Chrome's search does this. Not to mention searching for "don't" and getting hits including "don’t". But that should definitely be restricted to text data operations, rather than lower-level ones, including filenames.


But that should definitely be restricted to text data operations, rather than lower-level ones, including filenames.

The issue arises when this "text data" includes filenames. Having café.txt and cafe.txt be equivalent when searching is useful, but the real problem is if a filesystem decides that two "equivalent" filenames are essentially identical - to contrive an example, suppose it thought /étc/passwd was referring to the same file as /etc/passwd . It makes checking for and filtering out "sensitive" filenames far more difficult. For example, just take a look at all the ways Unicode homoglyphs and "special" characters can be used to bypass forum wordfilters, and you'll see how difficult that problem is.

(I know permissions, ACLs, etc. can help here with access control, but the problem of distinguishing between filenames still stands.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: