Are there any best practices for documents with large tables?

I have been experimenting with typst as a report generation engine replacement and have noticed extremely large memory usage, gigabyte levels, when large tables are rendered. I am not the first one to have this problem, and the tips I have found and tried are:

  • Use any of the data loading functions, I personally am using a #json to parse an array that contains all the row data (4 MiB), and use the scripting language to build the cells to be used on the table.
  • Use fixed column widths.

I wish to know if there are any other things that could be tried, loading from JSON helps a little and fixed column widths didn’t show a noticeable improvement.

I may try using the API to build the cells in Rust instead of scripting their creation, but I presume the problem is related to the amount of data the compiler retain while the table is being laid out.

Note: I am currently using a Java library called iText that have the same problems if you programmatically build a single table. Its way of reducing memory consumption is that the application can add the table in batches, simulating that there are multiple tables and flushing state, but still functioning as a single one (headers and footers, etc). It maybe easier to do in a API designed for a program, that one that is more document based, but just a hint of how other PDF generation tools try to solve the problem

1 Like

Hi, welcome to the forum! Could you provide further info?

  • Is the content of this table mainly numbers or texts, or both?

    You used json, rather than binary formats like cbor, so I assume they are mostly texts?

  • Do you want to print the table on a single page (by set page(height: auto)) or split it into several pages? (Or don’t even care?)

    The typst compiler may have done a lot of useless work on determining page breaks, I guess.

  • How many tables are there to be generated? In other words, what is the frequency of generating tables? Is it 100+ tables per single run, or a single table every day?

    The GB-level memory usage might be caused by the cache for incremental compilation. For low frequency demands, perhaps it can be turned off.

  • Did you use the typst executable (CLI + sys api like exec(["typst", "compile", …], stdin=…)) or the typst library (rust/python/… package)?

Greetings, it is example data: names, cities and ages, just tried not using a number for age but directly text just in case some alignment algorithm was kicking in. same gigabytes memory usage. The data source doesn’t sound like the cause of memory usage since just loading it, and transforming it to a cell array without generating the table uses reasonable memory.

I am using the default spanning the table on standard sized pages.

It is a single table (1440 pages in my test). I don’t see an option for disabling incremental compilation. If that ends being the source of the problem, maybe the option is needed, for final compilation of documents. I would be ok if it is only at least on the typst Rust crate.

For testing I am using the CLI. I plan to experiment with the Rust crate later, probably not loading from JSON but building the cells using the Rust based model.

#let data = json("data.json")

#{
  // Get all row data
  let rows = data.map(element => {
    (element.at("Name"), [#element.at("Age")], element.at("City"))
  }).flatten()
  
// Alternative maybe slower
//  rows = ()
//  for element in data {
//    rows.push(element.at("Name"))
//    rows.push([#element.at("Age")])
//    rows.push(element.at("City"))
//  }

  table(
    columns: (4cm, 4cm, 4cm),
    align: center,
    table.header("Name", "Age", "City"),
    ..rows
  )
}

And the data is just a large JSON array of:

  {
    "Name": "Alice",
    "Age": 30,
    "City": "New York"
  }

As a baseline, can you not forego the use of tables if you’re not really using the automatic table layout? Something like this?

#set par(spacing: 0pt, leading: 0pt)

#set box(
  width: 4cm,
  inset: 1em,
  stroke: black,
)

#let data = json("data.json")

#for row in data [
  #box(row.at("Name"))#box(str(row.at("Age")))#box(row.at("City"))#linebreak()
]

You can always fake a table header with a page header if your document doesn’t contain anything else than this one table.

I’m really curious about what’s going to happen with these thousand page plus documents. Is it just for archiving purposes? Will they be printed?

1 Like

Have you tried something like splitting the table into 10 or 100 smaller tables? What’s the effect on memory usage and document compile time in that case?

The use of #box really frees like 1 GiB of RAM of max RAM usage in this example, from 1.7 to around 0.7. a great improvement and it have lower CPU usage.

This could be an alternative, I would have to replicate a lot of the formatting the table and cell functions gives directly with the box, but it is doable, I will be translating a custom XML schema that defines the report format (cells. bold, fields, etc) to the typst model anyway, but it will be very hard to replicate the repetition of headers unless all cells are of the same height, probably cutting content.

About the usage of a report like this, It is not very frequent, but there are sometimes legal requirements to have, for example, a “book” of inventory movements, it used to be a literal book a long time ago. It is usually requested by fiscal authorities. It should have certain formatting and content, and in a Healthcare Information System, those movements can be pretty large when even the delivery of a single pill is accounted for.

1 Like

Tried splitting the table and it didn’t help, and there is a problem in predicting where to split it if at least one cell content can be large and use more than one line.

I’m happy to hear that. I have no knowledge of the innards of Typst but I’m wondering if the extra memory use of table is caused by its design; table receives each cell as a separate argument to the table function. Is Typst building the entire argument list before trying to look at the individual cells?

I do not know what your formatting requirements are, but the regarding the headers, what I did propose was something like this:

#set page(
  header-ascent: 0pt,
  header: [#box("Name")#box("Age")#box("City")],
)

(to be combined with my previous box code.)

Anyway, thanks for your explanation, and I"m pretty sure that people here will be able to help you concerning your formatting requirements.

Did you try creating single row tables?

Hey, did you make any progress on this issue? I have a similar use-case and have implemented a poc. We have ~25k rows where each page can hold ~12 rows. The solution I reached to render fast with low memory usage was to create sub-tables with ~30 rows each but in doing so I could not have proper table headers on every page sadly. But the low memory/rendering time was worth it.

Yes, I did not find meaningful memory usage reduction with single row tables vs tables that fit exactly one page. Boxes, use nearly half of memory but there are many issues with boxes, If your cell content will always fit on one line it is fine.

I have been playing with layout and measure, it uses more RAM than using a single table, but those are only experiments:

#let use_boxes = true

#let rows = ()
#let n = 0
#while n < 10000 {
  rows.push(([#n] , lorem(2), lorem(2)))
  n += 1;
}

#let header = ("Name", "Age", "City")

#let draw_row(cells) = {
    if use_boxes {
        let i = 0
        while i < cells.len() {
            box(width: 4cm, stroke: 1pt, inset: 0.5em, cells.at(i))
            i += 1
        }
    } else {
        table(
            columns: (4cm, 4cm, 4cm),
            ..cells
        )
    }
}

#let rowCounter = counter("rowCounter")

#let draw_page() = context {
    if (rowCounter.get().at(0) < rows.len()) {
        draw_row(header)
        v(0pt, weak: true)

        block(height: 1fr, fill: aqua, layout(size => {
            let remaining = size.height
            let rowIndex = rowCounter.get().at(0)

            while rowIndex < rows.len() {
                let row = rows.at(rowIndex);
                let size = measure(draw_row(row))
                remaining -= size.height

                if remaining >= 0pt {
                    draw_row(row)
                    v(0pt, weak: true)
                } else {
                    break
                }

                rowIndex += 1
                rowCounter.step()
            }
        }))

        pagebreak()
    }
}


#let i = 0
#while i < 1000 {
    draw_page()
    i += 1
}

By default this uses boxes, but you can change the value of use_boxes to use single row tables, both use more RAM than a native single table, but it shows clearly how boxes use a like a half.

The hack #while i < 1000 is because I don’t know yet how to use the counter or a state to know if all rows has been processed, so for now I try to add more pages and do nothing when there is no more.

I think the best alternative using the current version on typst, wishing for a future one with better memory usage, is to split the table into smaller ones and generate one PDF for each.

With my latest example code I managed to add rows exactly to the point where the page will break, so it would be a medium table with headers + one without headers and enough rows to fill the page, then stop the PDF generation. With some #metadata added to the document, the next PDF can be generated from the exact same row and the exact page number counter and then these PDF can be joined with an external tool.

Sadly the CLI has no way to print the metadata values after compilation, only an alternative command to query the metadata, so with only the CLI it will mean two passes per PDF. The Rust API probably can do both things at the same time. This will be my next experiment.

The cons of joining multiple PDFs could be if you embed images multiple times for example the company logo on the page header, joining PDFs means these images will be repeated on each PDF, that if typst reuse images pixmap data between multiple images insertions already.