How to extract the document structure programmatically?

Raul_Durand · June 13, 2025, 9:44pm

I’m looking to convert a document containing headings, equations, figures, citations, and references into a LaTeX file using a programmatic approach. The layout is relatively simple, as it’s meant for research manuscripts.

Instead of manually parsing the Typst file, I’d like to know if there’s a way to extract these components—ideally into a structured format like JSON or plain text—so I can process them and generate a .tex file more easily.

I’ve tried exporting to HTML, but it skips some elements.
Is there any method or tool available to help with this kind of extraction?
May be using filtering and #show: doc => { } calls.

Andrew · June 14, 2025, 2:36am

Pandoc? I haven’t seen any other solutions. Well, by specifically writing a converter for 2 formats, you can make it more precise. I don’t know how much better the Typst support has become in Pandoc.

Well, there is also tinymist/crates/typlite at main · Myriad-Dreamin/tinymist · GitHub, which I haven’t tried.

bluss · June 14, 2025, 6:30am

In an active sibling topic here, The good, bad, and complicated of writing a PhD thesis in typst TheZoq2 links to their blog and they use a custom conversion tool (Typst to Latex), might be worth just checking what it can do, too.

quachpas · June 14, 2025, 6:39am

I follow your opinion. I think your best bet is to use the typst crate directly (or use the tool mentionned, although with caveats). That will give you the “truth” about the document layout.

If you want a bastardized version (as long as you don’t have context), you can always print the repr.

Repr

sequence(
parbreak(),
context(),
parbreak(),
styled(child: sequence(
parbreak(),
[],
[],
equation(
block: true,
body: sequence(
[x],
[ ],
[+],
[ ],
[y],
[ ],
[+],
[ ],
accent(base: [y], accent: "\u{303}"),
),
),
parbreak(),
...

TheZoq2 · June 14, 2025, 7:48am

Yep, my tool Frans Skarman / ttt · GitLab sounds like it would do what you want, either to use or as inspiration for your own tool.

The idea is to just use the typst compiler as a library and extracting the expanded but unlayouted document from the middle of the compilation proocess.

I have submitted a paper written using this tool and have not heard any complaints yet

There are some limitations which you can see in the readme

sijo · June 16, 2025, 10:33am

Elaborating on @quachpas’ answer: To get a repr-like output as structured data instead of a string, you can put a label on the whole document and retrieve it with typst query, for example in JSON format:

#show: it => [#metadata(it)<all>] + it

#set math.equation(numbering: "(1)")

= A

$ x $<x>

See @x.

Running typst query file.typ '<all>' --pretty gives a JSON representation of the document:

Output

[
  {
    "func": "metadata",
    "value": {
      "func": "sequence",
      "children": [
        {
          "func": "parbreak"
        },
        {
          "func": "styled",
          "child": {
            "func": "sequence",
            "children": [
              {
                "func": "parbreak"
              },
              {
                "func": "heading",
                "depth": 1,
                "body": {
                  "func": "text",
                  "text": "A"
                }
              },
              {
                "func": "parbreak"
              },
              {
                "func": "equation",
                "block": true,
                "body": {
                  "func": "symbol",
                  "text": "x"
                },
                "label": ""
              },
              {
                "func": "parbreak"
              },
              {
                "func": "text",
                "text": "See"
              },
              {
                "func": "space"
              },
              {
                "func": "ref",
                "target": ""
              },
              {
                "func": "text",
                "text": "."
              },
              {
                "func": "space"
              }
            ]
          },
          "styles": ".."
        }
      ]
    },
    "label": ""
  }
]

Note that the content is in a “styled” container, which corresponds to the set-rule for equation numbering. In the JSON we cannot see what this rule does. If you have meaningful styling rules in your document this might be a problem. And if you have opaque elements (anything that uses context) you also won’t see the content in the JSON.

Raul_Durand · June 24, 2025, 8:33pm

This is quite good—having a JSON file makes it easy to extract the content and generate non-stylized versions of the document in other formats.
Thanks for sharing your approach.

I ran some tests, and here’s what I’ve observed so far:

Styles disappear: That’s fine, since they’re not needed.
Comments disappear: Acceptable, though preserving them would be preferable.
Custom functions are expanded: Great.
Equations using package-defined functions become “context”: Not ideal, but depending on complexity, it’s possible to define custom function equivalents, so it’s manageable.
Labels are retained: In your example they were removed, but in my tests they remained—which is great.

I intend to start using this method to generate LaTeX versions of moderately simple manuscripts.

sijo · June 25, 2025, 6:58am

Another option that has been recently released is typlite, see this post.