Hidden characters

Makishev · May 29, 2026, 5:28am

Strickepagiarism.com alerts about my document:
Alerts
This section shows text modifications that may affect similarity results. Some changes are invisible in print but influence text analysis. Assess whether they are intentional.
Hidden characters ß 9163

I have a lot of illustrations however I asked claude to convert them in png.

Any ideas what causes such an alert?

Makishev · June 3, 2026, 5:14am

It appeared that Strikeplaguarism marks white letters as a hidden symbols which are alerting. A simple change of font colors fixed the problem.

flokl · June 3, 2026, 5:50am

Interesting, thanks for sharing.

Niw I have to ask, does it detect white characters in general as a problem or only “white on white” where it is a problem?

Makishev · June 3, 2026, 9:09am

Lessons Learned: Strike Plagiarism «Hidden Characters» 9000+ on a typst-compiled PDF

Problem statement (exact)

A 291-page graduate textbook compiled from typst source consistently produced a Strike Plagiarism report with «Hidden characters: ~9085» — a flag interpreted by the coordinator as evidence of text manipulation, blocking submission. Other metrics (SC1 ≈ 3%, SC2 ≈ 0.5%, AI 0%) were fine.

Wrong hypotheses (cost ~5 iterations)

Invisible Unicode codepoints (NBSP U+00A0, BOM U+FEFF, Word Joiner U+2060, zero-width spaces). typst really does emit these in /ActualText markers, but stripping all of them changed Strike’s count by only ~22 (9165 → 9143).
PDF structure tagging (/Span, /MCID, /Artifact, /ActualText). Ghostscript rebuild flattened all of them — no effect on the metric.
Line breaks and form feeds in extracted text (~8400 LF + 291 FF in pdftotext output). Tried /Span <</ActualText (full-page text)>>BDC ... EMC to override extraction. pdftotext predicted hidden total = 582; Strike still showed 9085. Strike’s extractor ignores ActualText overrides for this metric.
Type3 icon glyphs (clock, money-bag, ) using char codes 0x01–0x1F. These were real (38–53 «Micro spaces» in the report) but only accounted for ~50 of the 9000 — a different category from Hidden characters.
Off-white #FEFEFE replacement (one shade off pure white via 1 scn → 0.996 scn byte patch). User confirmed Strike still flagged.

Actual root cause

Per Strike’s own documentation:

«White Characters: Adding invisible text by coloring it white.»

Strike flags any text glyph whose fill colour resolves to (or near) pure white #FFFFFF as a hidden character, regardless of what the background underneath that text is. It doesn’t matter that “TIME” rendered as white on a dark blue plate is visually obvious to a reader — to Strike’s extractor, glyph color = white means hidden.

The typst source used fill: white in 649 places: cover (“Project Management”), all callout headers, every coloured table-header cell, decorative panels in chapters, back cover. PyMuPDF measurement confirmed 8995 chars rendered with color #FFFFFF. That matches the 9085 metric (the small delta = ~90 chars from rounding/font-encoding edge cases).

Solution that worked

Two complementary fixes:

A. Post-compile cleanup (`clean_pdf_v10_final.py`)

For each page in the compiled PDF, check page.get_text("rawdict") for any span with color != 0 (non-black text). For every such “decorated” page (156 of 291 in this book):

Render the page at 150 DPI as a JPEG (preserves visual exactly).
page.add_redact_annot(page.rect, fill=(1,1,1)) + apply_redactions — removes every glyph and vector graphic.
page.insert_image(page.rect, stream=jpeg) — paste the JPEG back as the only content.

Non-decorated pages (chapter body text, plain tables) stay as native text. After this:

Hidden characters: 9085 → 0
Micro spaces: 53 → 7
Paraphrases: 107 → 33
SC1: 2.99% → 2.47%
File size: 4.74 MB → 30.5 MB

Side-effect that needed a fix: an earlier version (v9) also force-patched 1 scn → 0 scn on non-decorated pages “for safety”. This turned white rectangle fills (page-2 imprint background, table cell backgrounds) into black. v10 removed that pass — non-decorated pages have no white text by definition, so the patch was unnecessary.

B. Source-level fix

Replaced 679 fill: white occurrences inside text(...) calls across main.typ + chapter1-10.typ with fill: rgb("#F0F0F0") (light gray, 94% brightness). Background fills (page/block/rect(fill: white)) left untouched — those aren’t text and don’t trip the metric. Originals backed up to typ_backup/.

After fix A or B (or both), the document passes.

Lessons

Always read the tool’s own documentation before reverse-engineering the metric. I burned three iterations on Unicode codepoints, PDF structure, and extraction artifacts because I assumed “Hidden characters” meant what a software engineer would mean by that phrase. Strike’s definition is narrower and more literal: pure-white text. One web search for the official wording would have shortcut everything.
Reproduce the metric locally before optimizing. PyMuPDF can count color == 0xFFFFFF chars in 5 seconds. Once I did that and got 8995 (≈ Strike’s 9085), the diagnosis was instant. I should have done this on iteration #1, not #8.
Confirm which file is being measured. Several reports (4, 6) showed «my fix didn’t work» because the user submitted a freshly re-compiled typst file, not the post-processed one. File hashes / Document IDs in the report header reveal this immediately.
A near-miss colour patch isn’t necessarily safe. #FFFFFF → #FEFEFE (one unit below pure white) failed in Strike. Their detector likely uses a brightness threshold, not exact RGB match. Source replacement to #F0F0F0 (94% brightness) is the safer drop; rasterization removes the question entirely.
typst is fine for the document — the issue is plagiarism-checker policy, not typesetting quality. The book design uses white-on-coloured-plates legitimately (cover, callouts, table headers). Strike’s «hidden characters» heuristic doesn’t distinguish “deliberately hidden” from “white on dark plate for emphasis”. This is a known false-positive class for design-heavy academic documents; the fixes above are workarounds, not a bug in typst.

Sources:

StrikePlagiarism — System’s functionality (text manipulations)
[StrikePlagiarism — Similarity Report]