This webpage presents the L3i LayoutCopies Dataset used to assess the method published in the following publication:

Sample 1

Sample 2

Sample 3

Sample 4

Exemple 1 Exemple 2 Exemple 3 Exemple 4

This dataset is intended to provide clean segmentation results representative of the ones that could be produced with a document submitted to a print and scan process. It contains 15 layouts similar to the samples above. These layouts are the results obtained by three segmentation algorithms PAL [1], JSEG [2] and Voronoi [3] on the documents of the PRiMA dataset [4]. For PAL we only used the block information. Among these 15 layouts two of them are identical but obtained with a different segmentation algorithm: they have the same number of regions with approximately the same size and the same positions. The layouts contain between 6 and 28 regions.

There are 64 prints, copies and double copies of each layout. The total size of the dataset is then 15 X 64 = 960 images. The scanners added salt and pepper noise which created many regions made of one or two pixels. Such noise would not be produced by a segmentation algorithm and we removed it from the dataset. The dataset contains scale variations as the printers add margins around the layout images and hence change their scale. We also used batch scanners that have introduced a surprisingly significant amount of skew (about 5 - 10°).

[1] K. Chen, F. Yin, and C.-l. Liu. Hybrid page segmentation with efficient whitespace rectangles extraction and grouping. In Proc. of 12th International Conference on Document Analysis and Recognition (ICDAR), pages 958–962. IEEE, Aug. 2013.
[2] Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. Pattern Analysis and Machine Intelligence (PAMI), 23(8):800–810, 2001.
[3] K. Kise, A. Sato, and M. Iwata. Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding, 70(3):370–382, June 1998.
[4] A. Antonacopoulos, D. Bridson, C. Papadopoulos, and S. Pletschacher. A realistic dataset for performance evaluation of document layout analysis. In Proc. of 10th International Conference on Document Analysis and Recognition (ICDAR), pages 296–300. IEEE, 2009.

To download the dataset, click here