MORE WORKS
More works
Multi-Instance Editing with Flow Matching ICML 2026 Reliable Styled Text Image Generation WACV 2026 More from our lab ↗
ⲥ ⲕ ⲁ ⲙ

SCAM: A Text Recognition Dataset from
Sahidic Coptic Ancient Manuscripts

  ICDAR 2026

1University of Modena and Reggio Emilia   2Fondazione per le Scienze Religiose (FSCIRE)

Paper, arXiv and dataset links will be added upon publication.

Sample text lines from the SCAM manuscript

Sample line images used to build SCAM. The top rows belong to the SCAM-A subset (professional scans), the bottom rows to SCAM-B (camera photographs). The manuscript is written in scriptio continua — without spaces between words.

Abstract

We target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and the degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. Beyond visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths, and underlining the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios.

Highlights

The SCAM Dataset

SCAM contains lines from 27 leaves (53 pages) of a single-author manuscript known as Coptic Literary Manuscript (CLM) 359, dated to 1002–1003 C.E. and originating from the White Monastery in Upper Egypt. Only 49 leaves of the original codex are known to have survived, now spread across libraries and institutions worldwide and digitized under highly different conditions. Lines are isolated with polygon annotations (rather than bounding boxes) to handle slanted lines and the ornate paragraph-initial letters typical of the script.

3,240
Text lines
99
Graphemes
53
Pages / 27 leaves
1
Author · scriptio continua

SCAM-A

Professional scanners · controlled lighting
  • 1,445 lines · 24 pages · 12 leaves
  • 87 graphemes · 11.52 ± 2.30 chars/line
  • Splits: 943 train / 242 val / 260 test
  • Cleaner images, broader glyph coverage

SCAM-B

Camera photographs · uncontrolled conditions
  • 1,795 lines · 29 pages · 15 leaves
  • 67 graphemes · 11.29 ± 1.81 chars/line
  • Splits: 1,057 train / 368 val / 370 test
  • More degradation, narrower glyph set

Visual & Linguistic Analysis

Although both subsets come from the same single-author manuscript, the difference in preservation and digitization induces a measurable domain gap. We quantify the visual gap with FID/KID and the handwriting-style gap with HWD, and the linguistic gap with the Jensen–Shannon divergence of character n-grams. Grayscale conversion narrows — but does not close — the visual gap.

SCAM-A ↔ SCAM-BFID ↓KID ↓HWD ↓JSD ↓
RGB images113.620.131.060.49
Grayscale images65.060.071.04

Lower is more similar. HWD values in 0–1.5 are typical of same-author sets, consistent with SCAM being a single-author manuscript.

Benchmark

We benchmark 11 HTR models across CTC, sequence-to-sequence attention, and Transformer paradigms, reporting Character Error Rate (CER) and Sequence Error Rate (SER). The table below reports models trained on the full SCAM training set and tested on the whole test set and on each subset.

Model SCAM SCAM-A SCAM-B
CER ↓SER ↓ CER ↓SER ↓ CER ↓SER ↓
CRNN9.4749.799.0051.549.8548.38
C-SAN88.06100.0087.52100.0088.50100.00
VAN5.2531.995.9636.154.6828.65
HTR-VT10.2759.6210.9866.549.7054.05
Kang et al.11.4354.1811.4457.6911.4251.35
Michael et al.33.3591.1636.1995.3931.0787.76
LT22.5978.7421.8678.8523.1878.65
VLT15.0362.8114.8967.3115.1559.19
TrOCR-S13.1265.9917.0486.879.9649.20
TrOCR-B12.4274.3914.4277.7810.8171.66
TrOCR-L11.3372.6312.8274.5010.1471.12

Trained on the full SCAM training set. ★ marks the best model (VAN). Lower is better.

Key findings

Qualitative results of the best model (VAN) across train/test setups

Qualitative results of the best model (VAN) across Train → Test setups. Errors are highlighted in red; per-line CER is reported. SER stays high even at low CER — a direct consequence of scriptio continua, where a single character error fails the whole line.

BibTeX

@inproceedings{quattrini2026scam,
  title     = {A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts},
  author    = {Quattrini, Fabio and Zaccagnino, Carmine and Bianchi, Costanza
               and Cascianelli, Silvia and Cucchiara, Rita},
  booktitle = {International Conference on Document Analysis and Recognition},
  year      = {2026}
}