How can I detect properties of Unicode characters?

matj1 · December 5, 2025, 4:12pm

I want to process text strings by characters (or grapheme clusters, later likewise) and treat characters differently according to their classifications. I want to detect that the character has properties like these:

purpose (letter, numeral, space, punctuation…)
script (Latin, Cyrillic, Hebrew…)
case (upper, lower, none)
direction (left, right, none)

How can I do that effectively?

Some properties can be easily seen from the Unicode number, but many similar characters are scattered around Unicode, so it is not viable. Regex has character classes like[[:lower:]] for lowercase characters, but they work only in ASCII, so it is not useful for general text.

bluss · December 5, 2025, 5:07pm

Typst uses rust’s regex and it has support for Unicode properties (this list) - which means it can test unicode Lowercase-ness, for example, or categorize alphabetic letters by script. That’s not to say that it’s easy to work with unicode, but it can do a lot.

Then I know that package unichar can help you look up other unicode properties of codepoints. That’s probably not very efficient to do letter by letter.

matj1 · December 5, 2025, 5:27pm

Thanks. These properties in regex are useful for occasional checks. One seeming disadvantage is that it seems that using regex is much slower than direct checks.

unichar is how I imagine that it should be implemented but not in the standard library. So I am going to use this, but I will also write a feature request.