Benchmarking LLMs on Typst

rkstgr · May 14, 2025, 9:57am

I started working on an open-source evaluation suite to test how well different LLMs understand and generate Typst code.

Early findings :

Model	Accuracy
Gemini 2.5 Pro	65.22%
Claude 3.7 Sonnt	60.87%
Claude 4.5 Haiku	56.52%
Gemini 2.5 Flash	56.52%
GPT-4.1	21.74%
GPT-4.1-Mini	8.70%

The dataset contains only 23 basic tasks atm. A more appropriate amount would probably be at around >400 tasks. Just for reference the typst docs span >150 pages.

To make the benchmark more robust contributions from the community are very much welcome. Check out the github repo: GitHub - rkstgr/TypstBench: Benchmarking LLMs on Typst

janekfleper · May 14, 2025, 10:35am

Are you passing any context (e.g. the Typst docs) to the LLMs or is your goal to measure how well the LLMs know Typst without actually knowing (much) about Typst? Or are you sure that Typst is already included in the knowledge cut-off of the various models?

rkstgr · May 14, 2025, 12:33pm

The evaluation suite does not provide any additional context or typst documentation. Its main purpose is to benchmark and to provide a reference point for further improvements. Claude 3.7 and GPT4.1 are recent enough to potentially have encountered some typst markup, but they are not great, which suggests its only a tiny amount of their training data or the training samples are of poor quality.

Andrew · May 14, 2025, 1:53pm

This checks out. Many people say that Claude works pretty well for the most part, while GPT is…well, like the findings show, it’s pretty bad. It uses Markdown and LaTeX syntax or just made-up Typst-like syntax.

I would be interested to see how new Google’s Gemini models handle it, and maybe DeepSeek too.

rkstgr · May 15, 2025, 7:31am

I added Gemini 2.5 and it does perform better. However, you should be cautious when interpreting the current results, as I have only evaluated them on 23 rather basic tasks. I will be adding more in the upcoming days.

quachpas · May 15, 2025, 12:20pm

@rkstgr I think there is an issue with your evaluation method. The md5 hash of a pdf file is not a reliable way to check the output difference.
You can get an incorrect generation, just because the compiled file is slightly different.
eg

❯ md5sum 007.md_model.pdf 007.md_target.pdf
c17ad228ad61a090dade4eed0ae67093  007.md_model.pdf
2472258d377e6dff2022d26bf33daf4c  007.md_target.pdf
❯ magick compare -metric phash 007.md_model.pdf 007.md_target.pdf -compose
0 (0)⏎

Visually checking both the model output and the compiled pdf show absolutely no difference. You could just invoke magick compare instead. It returns 0 on similar images, and 1 otherwise. you could parse the output instead if you prefer.

rkstgr · May 15, 2025, 12:54pm

I agree. Also thought about switching to image based comparison. I will look at magick, thanks for the suggestion.