The issue is that this is a weird requirement. I have seen real world data sets ...

anonymoushn · on July 25, 2022

It is a context-specific requirement, but "fully general lowercasing" would be an impossible requirement.

For e.g. Japanese text, I think you'd only have to add 1 or 2 characters to the set of whitespace characters. You also have to solve Japanese text segmentation, which is hard-to-impossible. If you want to canonicalize the words by transforming half-width katakana to full-width, transforming full-width romaji to ascii, etc., that's a lot of work, and which of those transformations are desired will be specific to the actual use of the program. If you want to canonicalize the text such that the same word written using kanji or using only hiragana end up in the same bucket, or that words that are written the same way in hiragana but written differently when using kanji end up in different buckets, or that names that are written the same way in kanji but written differently in hiragana end up in different buckets, or that loanwords incorrectly written using hiragana are bucketed with the katakana loanword, or that words written using katakana for emphasis are bucketed with the hiragana word (but katakana loanwords are not converted to hiragana and bucketed with the non-loanword that is made up of the same moras), well, that all sounds even more challenging than the hard-to-impossible problem you already had to solve to decide where words begin and end :)

Edit: One of the first concerns I mentioned, about full width romaji and half width katakana, and additionally concerns about diacritics, can be addressed using unicode normalization, so these things are pretty easy[0]. An issue you may still face after normalizing is that you may receive inputs that have incorrectly substituted tsu ツ for sokuon ッ (these are pronounced differently), because for example Japanese banking software commonly transmits people's names using a set of characters that does not include sokuon.

My point is that this is not just a hard problem but many different, incompatible problems, many of which are hard, and because of the incompatibilities you have to pick one and give up on the others. An English-speaking end user may not want their wordcount to perform full width romaji -> ascii conversion.

[0]: https://towardsdatascience.com/difference-between-nfd-nfc-nf...

nerdponx · on July 26, 2022

Great write up! However I should've clarified, I wasn't talking about word segmentation in general, only about expanding the universe of valid "whitespace" grapheme clusters (or even just codepoints) to include the various uncommon, exotic, and non-Western items.