I'm wondering why they chose NFKC (compatibility composed from) instead of NFC (canonically composed form).
`ª` would become `a`, losing its super type. `ᵤ` becomes `u`, losing its sub type? `Ⓐ` becomes `A`, losing its circle type. As for multi-codepoint mappings, `¼` would become three tokens `1⁄4`, where `⁄` (U+2044) doesn't map to `/` (U+002F).
Maybe you are referring the reference [1], which indeed mentions NFKC. As far as I know there is no consensus of the normalization form [2] and the current implementation is not guaranteed to stay, which is why Unicode identifiers are gated behind a `#[feature]` flag.
Yes, I started reading the reference. The normalization form issue is different to the #[non_ascii_idents] feature, though.
The issue 2253 does mention address it, but all the comments mention the issue of NFC/NFKC normalization specifically for filesystem lookup and for program identifiers, but not for the lexing stage. That issue is obviously the best place to continue any conversation about it.
I'm wondering why they chose NFKC (compatibility composed from) instead of NFC (canonically composed form).
`ª` would become `a`, losing its super type. `ᵤ` becomes `u`, losing its sub type? `Ⓐ` becomes `A`, losing its circle type. As for multi-codepoint mappings, `¼` would become three tokens `1⁄4`, where `⁄` (U+2044) doesn't map to `/` (U+002F).