This webpage presents the TextCopies Dataset used to assess the method published in the following publication:
- S. Eskenazi, P. Gomez-Krämer, and J. Ogier, “When document security brings new challenges to document analysis,” in Proc. of 6th International Workshop on Computational Forensics (IWCF), pages 104–116, 2014.
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 | |
---|---|---|---|---|
This dataset contains clean, text-only, typewritten documents. They are a combination of single and double column documents. We used only and all the characters that tesseract with the default english training can recognize: [^ I'veJoin\|-Sz:#6%50@parmFusB»fdchCtL?TMyRl~<®Nbk[«1,.”gH$(+DwV£49Q&AP¢]32©8/>Xéj;7€O¥Ux}E§=!’G)Zq{“—YK*W\"°fi‘_fl ]
The dataset is made of 22 pages of text with the following characteristics:
- 1 page of a scientific article with a single column header and a double column body
- 3 pages of scientific articles with a double column layout
- 2 pages of programming code with a single column layout
- 4 pages of a novel with a single column layout
- 2 pages of legal texts with a single column layout
- 4 pages of invoices with a single column layout
- 4 pages of payslips with a single column layout
- 2 pages of birth extract with a single column layout
- 6 fonts : Arial, Calibri, Courier, Times New Roman, Trebuchet and Verdana
- 3 font sizes : 8, 10 and 12 points
- 4 styles: normal, bold, italic and the combination of bold and italic
The dataset has a size of 36GB and is hosted on an FTP server of the University of La Rochelle. Please contact "l3i-pn(at)univ-lr(dot)fr" to have access to it.