Benchmarking LLMs on Typst

I started working on an open-source evaluation suite to test how well different LLMs understand and generate Typst code.

Early findings :

  • Claude 3.7 Sonnt: 60.87% accuracy
  • Claude 4.5 Haiku: 56.52%
  • GPT-4.1: 21.74%
  • GPT-4.1-Mini: 8.70%

The dataset contains only 23 basic tasks atm. A more appropriate amount would probably be at around >400 tasks. Just for reference the typst docs span >150 pages.

To make the benchmark more robust contributions from the community are very much welcome. Check out the github repo: GitHub - rkstgr/TypstBench: Benchmarking LLMs on Typst

6 Likes

Are you passing any context (e.g. the Typst docs) to the LLMs or is your goal to measure how well the LLMs know Typst without actually knowing (much) about Typst? Or are you sure that Typst is already included in the knowledge cut-off of the various models?

The evaluation suite does not provide any additional context or typst documentation. Its main purpose is to benchmark and to provide a reference point for further improvements. Claude 3.7 and GPT4.1 are recent enough to potentially have encountered some typst markup, but they are not great, which suggests its only a tiny amount of their training data or the training samples are of poor quality.

This checks out. Many people say that Claude works pretty well for the most part, while GPT is…well, like the findings show, it’s pretty bad. It uses Markdown and LaTeX syntax or just made-up Typst-like syntax.

I would be interested to see how new Google’s Gemini models handle it, and maybe DeepSeek too.

1 Like