I want to process text strings by characters (or grapheme clusters, later likewise) and treat characters differently according to their classifications. I want to detect that the character has properties like these:
purpose (letter, numeral, space, punctuation…)
script (Latin, Cyrillic, Hebrew…)
case (upper, lower, none)
direction (left, right, none)
How can I do that effectively?
Some properties can be easily seen from the Unicode number, but many similar characters are scattered around Unicode, so it is not viable. Regex has character classes like[[:lower:]] for lowercase characters, but they work only in ASCII, so it is not useful for general text.
Typst uses rust’s regex and it has support for Unicode properties (this list) - which means it can test unicode Lowercase-ness, for example, or categorize alphabetic letters by script. That’s not to say that it’s easy to work with unicode, but it can do a lot.
Then I know that package unichar can help you look up other unicode properties of codepoints. That’s probably not very efficient to do letter by letter.
Thanks. These properties in regex are useful for occasional checks. One seeming disadvantage is that it seems that using regex is much slower than direct checks.
unichar is how I imagine that it should be implemented but not in the standard library. So I am going to use this, but I will also write a feature request.