`hy-dro-gen`: towards custom hyphenation patterns

This post is about the new capability of hy-dro-gen:0.1.2 (repo: Vanille-N/hy-dro-gen) to load custom hyphenation patterns for languages not natively supported by Typst.
If you sometimes write documents in languages that have non-builtin hyphenation rules, this is for you.

Disclaimer: I am not an expert on hyphenation.
Basically everything I know comes from this blog post about hypher.

Background

Currently Typst can handle hyphenation for 34 (perhaps soon-to-be 35) languages. Hyphenation points are computed by the crate typst/hypher, by compiling TeX patterns into a finite automata.

As has been noted multiple times,

  1. The hyphenation patterns are not always licensed in a way that would permit their distribution embedded in the Typst compiler.
  2. Even if they are permissively licensed, every additional language increases the size of the compiler, so it is impractical to hope that eventually every possible language/variant will be supported.

As far as I can tell, the main obstacles to beginning to resolve issue #5223 – requesting the ability to dynamically load user-specified hyphenation patterns – are simply simply that (1) it takes work to add to typst/hypher the capability to load user-specified patterns, and (2) it takes time to decide on a stable API for doing so. Since packages have much less inertia that the entire compiler for settling on an API, it seems to be a more effective approach in the short term.

I happen to be maintaining the package hy-dro-gen because it is a dependency for meander, and It occurred to me that having (1) working knowledge of Rust, (2) a hand on the already existing hyphenation package for Typst, and (3) a bit of free time, I was in a good position to be the person to implement dynamically loaded hyphenation patterns.

About hy-dro-gen

hy-dro-gen is a thin wrapper around a WASM module on top of typst/hypher. It provides bindings to the same library that Typst uses for native hyphenation, and thus guarantees consistent results.

For hy-dro-gen:0.1.2 specifically, I forked hypher, and added the abilities to dynamically load precompiled patterns and compile new patterns on the fly.

Setting up hyphenation for a new language

As an example, the Romanian language (ISO code “ro”) is not supported by Typst. Because its patterns are not licensed, it is unclear if it ever will be supported natively. Fortunately, this is feasible right now using hy-dro-gen.

1. Procure pattern files

From www.hyphenation.org you can get a list of languages for which patterns exist.

The patterns themselves can be downloaded from hyphenation/tex-hyphen. In this example, I grabbed hyph-ro.tex and saved it to a local folder patterns/.

2. Load the patterns into hy-dro-gen

#import "@preview/hy-dro-gen:0.1.2" as hy

#let trie_ro = hy.trie(
  // Downloaded from github:hyphenation/tex-hyphen
  tex: read("patterns/hyph-ro.tex"),
  // See column '(left,right)-hyphenmin' on hyphenation.org
  bounds: (2, 3),
)

// The patterns are compiled on the fly exactly once, and stored in
// hy-dro-gen's global registry of languages. Hypher is pretty efficient,
// so this one-time cost at startup is barely noticeable.
// It is technically possible -- though not convenient yet -- to precompile
// the patterns for even less loading time.
#hy.load-patterns(
  ro: trie_ro,
)

3. Apply patterns

Below is a comparison of how the same excerpt gets hyphenated with different settings.

Left:

// This would be the right solution if Romanian was natively supported.
// Because it is not, this doesn't work at all.
#set par(justify: true)
#set text(hyphenate: true, lang: "ro")
#excerpt

Middle:

// If we lie to Typst and pretend it's English, we get some hyphenation.
// However this is neither correct (hyphenation may occur where forbidden)
// nor pleasant (some lines have unnatural spacing).
// We could find another supported language that is more similar to Romanian,
// but that would just be another hack, and other features that depend on
// the language (e.g. selecting the right quotation marks) may be wrong.
#set par(justify: true)
#set text(hyphenate: true, lang: "en")
#excerpt

Right:

// This time we get the right hyphenation points.
// Fewer excessive spaces, and slightly more compact overall.
// Semantically correct in that it accurately states the text's language.
#set par(justify: true)
#set text(hyphenate: true, lang: "ro")
#show: hy.apply-patterns("ro")
#excerpt

You can view this example on typst.app.

Further reading

For more details, you can consult the full documentation of hy-dro-gen.
If you encounter an issue, please file a bug report.

6 Likes

What I would like too see is way for a Typst compiler to notify me (as a warning) about words with no hyphenation rules.

Very useful for writing fiction where you’re using uncommon or made up words (for specific world or universe) and these are omitted during hyphenation.

Don’t know if that could be done via such extensions though, or should there be a patch for compiler itself.

Oh, very interesting application. I think that should be doable.

For example if you write

#import "@preview/hy-dro-gen:0.1.2" as hy

#show regex("\w{5,}"): ww => {
  if hy.syllables(ww.text).len() == 1 {
    highlight(ww)
  } else {
    ww
  }
}

It’ll highlight every word that is 5 or more letters that cannot be hyphenated.

That will include a lot of false positives, but by refining the heuristic a bit (maybe you’re only looking for fantasy words that are capitalized) and having a list of known false positives, I think it’s possible to achieve that.

Then the harder part will probably be to update the pattern file with how you want those words to be hyphenated.

Alternatively, an even lower-tech solution:

#show regex("\w+"): ww => hy.syllables(ww.text).join("-")

This will hyphenate every word between every syllable, and then you can see if your invented words have any hyphenation rules that apply.

If I understand the implementation correctly, in the way this is currently implemented, for a custom language every call to syllables will copy the compiled bytes for the language from Typst to Rust. I’m not sure how much performance overhead this will actually have in practice, but if it turns out to be a bottleneck, I just wanted to mention that it would be possible to persist these on the Rust side via plugin.transition.

4 Likes

It doesn’t feel like that much copying is really being done.

A document with about 10000 calls to syllables using a custom language compiles in around 2 seconds, which I feel wouldn’t be possible if all the data was copied given how big the tries are.

The Rust side receives a &[u8], so I expect that the bytes would indeed not be copied. It even sounds like it’s a bug in wasm_minimal_protocol if copying occurs.

It’s definitely copying. It’s just how WASM works, a plugin has its own linear memory that it can read from. It can’t read from host memory.

I guess computers are just fast. :smile: If some of the words are the same, then automatic memoization will also kick in.

That’s just convenience on the side of wasm_minimal_protocol. It might as well be Vec<u8>. It’s similar for wasm-bindgen.

Ok, I don’t really know all the intricacies of WASM.
I might try to implement the transition and I’ll see if it makes a noticeable difference.

Yeah ok, after some benchmarking there’s definitely some overhead.

Calling str.rev on the first 1,000,000 words of lorem takes 12 seconds. syllables with a static language takes 23 seconds, and syllables with a dynamic language takes 33 seconds.

Still, some napkin math tells me there’s probably a lot of optimization going on because “only” 10 seconds of overhead feels like not that much for copying 3kb of data one million times…

The first 1,000,000 words of lorem will have a lot of repetitions, right? Each call to syllables with the same arguments will be automatically cached by Typst. So to truly benchmark it, you would need to hyphenate unique words every time.