How to parse invalid UTF-8 gracefully?

Let’s say I have a bunch of bytes objects, some of which contain valid UTF-8 and some do not. I want to convert each to a string, either outputting none or some other sentinel value if the bytes are not valid UTF-8.

However, the only conversion I found so far seems to be str(bytes), which panics on invalid UTF-8. Am I missing something, do I have to implement parsing manually, or is there a package that can help me with that?

Hello @blume, could you provide a reproducible example? Does the conversion panic or give you an error: bytes are not valid utf-8?

If it is the latter, then you would need to implement utf-8 validation before the Typst code preferably.

1 Like

Of course. Let’s use the following code example:

#let input = (
    (0x20, 0x61, 0x6e, 0x64),
    (0x2c, 0x0a, 0xff, 0x7d), // invalid due to 0xff byte
    (0x72, 0x65, 0x2c, 0x0a)
).map(bytes)
#let strings = input.map(str) // error: bytes are not valid utf-8

Of course, this could be simplified to just one bytes instance, but then you could just change the code to not parse it as a string. I need either lossy conversion or returning a result-like value because currently there is no way to even know programmatically which values are faulty without consulting external programs.

UTF-8 validation can be implemented in Typst if this is what you really need, see The Algorithm to Valide an UTF-8 String. You can use a regex, or a state machine (these are the suggested implementations).

If you think the error: bytes are not valid utf-8 is not graceful enough, then you can always discuss it at GitHub · Where software is built or on other channels.

Yes, implementing UTF-8 (or rather the specific subset I need) would be the last resort. I figured somebody might have done that before, either in Typst or very accessibly via WASM, but so far I have not been successful in finding it. Or perhaps I don’t know how to easily integrate i.e. the Rust verifier in Typst via WASM.

The error itself is fine and very consistent with other Typst behavior, I would rather suggest adding a method like bytes.is-valid-utf8 such that it is somehow checkable in advance whether a given byte sequences may be successfully be interpreted in this way. Like you can check whether a dict d contains a key k by branching on k in d.

Minor nit: the page you linked to seems to be a draft, and a fairly early one at that, considering the amount of typos in the title and body. However, I also could not find a readable document regarding this as normative as your reference, so I might as well try it.

I did not notice that, thanks for nitpicking!

As for WASM, I think you would have to implement the bindings yourself, but there are many Rust crates implementing algorithms for utf8 validation. Take a look at GitHub - astrale-sharp/wasm-minimal-protocol for examples

1 Like