Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> normalized to Unicode normalization form NFKC

I'm wondering why they chose NFKC (compatibility composed from) instead of NFC (canonically composed form).

`ª` would become `a`, losing its super type. `ᵤ` becomes `u`, losing its sub type? `Ⓐ` becomes `A`, losing its circle type. As for multi-codepoint mappings, `¼` would become three tokens `1⁄4`, where `⁄` (U+2044) doesn't map to `/` (U+002F).



Maybe you are referring the reference [1], which indeed mentions NFKC. As far as I know there is no consensus of the normalization form [2] and the current implementation is not guaranteed to stay, which is why Unicode identifiers are gated behind a `#[feature]` flag.

[1] http://static.rust-lang.org/doc/0.9/rust.html#input-format

[2] https://github.com/mozilla/rust/issues/2253


Yes, I started reading the reference. The normalization form issue is different to the #[non_ascii_idents] feature, though.

The issue 2253 does mention address it, but all the comments mention the issue of NFC/NFKC normalization specifically for filesystem lookup and for program identifiers, but not for the lexing stage. That issue is obviously the best place to continue any conversation about it.


(Also, as that bug suggests, we don't actually do any normalisation at all yet.)


I don't see any mention of normalisation in the release notes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: