TextCopies Dataset

This webpage presents the TextCopies Dataset used to assess the method published in the following publication:

S. Eskenazi, P. Gomez-Krämer, and J. Ogier, “When document security brings new challenges to document analysis,” in Proc. of 6th International Workshop on Computational Forensics (IWCF), pages 104–116, 2014.

Sample 1

Sample 2

Sample 3

Sample 4

Exemple 1 Exemple 2 Exemple 3 Exemple 4

This dataset contains clean, text-only, typewritten documents. They are a combination of single and double column documents. We used only and all the characters that tesseract with the default english training can recognize: [^ I'veJoin\|-Sz:#6%50@parmFusB»fdchCtL?TMyRl~<®Nbk[«1,.”gH$(+DwV£49Q&AP¢]32©8/>Xéj;7€O¥Ux}E§=!’G)Zq{“—YK*W\"°ﬁ‘_ﬂ ]

The dataset is made of 22 pages of text with the following characteristics:

1 page of a scientific article with a single column header and a double column body
3 pages of scientific articles with a double column layout
2 pages of programming code with a single column layout
4 pages of a novel with a single column layout
2 pages of legal texts with a single column layout
4 pages of invoices with a single column layout
4 pages of payslips with a single column layout
2 pages of birth extract with a single column layout

Several variants of these 22 text pages were created by combining:

6 fonts : Arial, Calibri, Courier, Times New Roman, Trebuchet and Verdana
3 font sizes : 8, 10 and 12 points
4 styles: normal, bold, italic and the combination of bold and italic

This makes 1584 documents. We printed these documents with three printers (a Konica Minolta Bizhub 223, a Sharp MX M904 and a Sharp MX M850) and scanned them with three scanners and at different resolutions between 150dpi and 600dpi. Each document has been scanned the same number of times at each resolution in order to have enough images and to avoid any statistical bias. This makes a dataset of 42768 document images.

The dataset has a size of 36GB and is hosted on an FTP server of the University of La Rochelle. Please contact "l3i-pn(at)univ-lr(dot)fr" to have access to it.