Exporting to DOCX (combining pdf2docx and Pandoc)

Typst is wonderful. I can’t wait to apply it to my actual papers. However, in my field, MS Word is still the most widely used format for both casual review among colleagues and final submission. To adopt Typst into my actual workflow, there must be a decent support for DOCX output.

Luckily, the good thing about using DOCX, the complexity of typesetting it is completely offloaded to the press. There’s NO need for outline, colors, cross referencing. In that regards, Acrobat’s (an many other tools’) support for PDF to DOCX already does a very great job (they even preserve all the cross referencing).

The only part that’s missing is math equations. So far, only pandoc correctly renders into Unicode math in Word. But pandoc’s support for Typst is very limited. Many basic functions are not supported (e.g. table stroke), and they will cause the conversion to fail. (I tried to contribute to Pandoc directly, but I’m very unfamiliar with Haskell, and the existing code base is still very basic)

There is a solution that I’m considering:

  1. Use some regex to extract math equations and some surrounding text from a document, and put all of them into a separate Typst file.
  2. Render this file to DOCX with Pandoc.
  3. Compile the original Typst source file to PDF with Typst.
  4. Convert this PDF into DOCX (I use Acrobat)
  5. Use Office Word MCP (model context protocol) to make LLMs merge these two DOCX files.

I’d like to hear some thoughts on converting Typst to Word, especially for academic papers submission. If this is helpful for many researchers adopting Typst, maybe the community should put more effort to it.

EDIT: LLMs will attempt to read the whole document as XML, which easily exeeds the token limit. Perhaps I can save XML to file system and use tools like grep for better extraction of the interested portion.

EDIT: The reason I wanted to use LLM is I don’t want to deal with Word XML at all. However, I realized that it’s very possible to operate on it directly. So here’s a new pipeline I’m proposing:

  1. Use a show math rule to add marker around all math content.
  2. Use typst-syntax Rust crate to extract all math, and put them into a separate file
  3. Render raw file to PDF, then to DOCX; render the math reference file with Pandoc
  4. Use XLST ( XSL Transformations Language) to swap the proper math in.

Interesting idea. Although interesting to see MS Word being still a review standard (the university I work at is using PDFs for reviewing or recently started to introduce Typst to my colleagues).

And at what point is it easier to introduce Typst to your colleagues than going through the whole hassle of converting Typst docs ;)

Just a side note for others attempting this. If you’re under NDA or similar restrictions, where you’re not allowed to publicize your thesis, use local LLMs or a service that is under strict data protection laws (the US does not have this compared to the EU). LLM services such as the big five (or more) (ChatGPT, Gemini, Claude,…) will use your uploaded files as training material for their models!

1 Like

Typlite is capable of exporting to docx through html.

It converts formulae into images embedded in docx, but there is a big problem with scaling… Nevertheless, it is a feasible tech path.

You can download the typlite executable from tinymist’s repo, Release v0.13.30 · Myriad-Dreamin/tinymist · GitHub for example.

Thanks for the reminder about privacy!

I am also confused about MS Word being a review standard. As far as I know, some people will use track changes feature to directly modify on the document for providing suggestion.

Even PDF can be used for some reviewing, the final submission still has to be in DOCX, so this conversion is crucial.

Thanks for the reply! I came across Typlite already. The problem is the HTML export itself is very limited. Any page setting will cause it to fail immediately (there are discussion on GitHub about making it a warning instead of error).

And yes, the equations are completely messed up. So far only pandoc does it well.

1 Like