Conception of a Dataset for the Evaluation of Administrative Document Segmentation into Color Layers

This webpage presents the conception of a dataset used to evaluate the segmentation of color administrative documents into color layers.

More information about the segmentation project can be seen on the website of the L3i laboratory (only in French): here.

The following ground-truth is composed of:

An original dataset of 2000 synthetic images
The ground-truth associated to the original images
A dataset of the synthetic images after noise application

You can download it here.

This ground-truth aims at evaluating color segmentation applied on color administrative document images. The synthetic images have been automatically generated. The color value is known at a pixel-level. Different templates have been defined with drawing areas (e.g. headers, footers ...) to get semi-structured images which look like documents. Images have a variable number of colors. In this work, we have assumed that the images have a main background color and a main foreground color. Two models have been defined so far: half of the documents in the dataset are letter documents and the other half leaflet documents containing only text on background. Several documents of a same family have been generated. They contain a part with fix content, and a part with random content. The elements, which are randomly initialized are: the colors, the fonts, the text, the size, and the shape of the elements.

For each synthetic image from the original dataset, the generated ground-truth is a set of binary layer images. For a given layer, all pixels belonging to the same color appear black on a white background. Each layer is associated to a color corresponding to the mean value [L*a*a] of all pixels belonging to the layer (in the L*a*b color space). The name of each layer follows the rule: 'L_a_b.png'.

Finally, we have created a dataset of noisy images. Thereforen in order to simulate the real noise, a Gaussian noise has been applied to the images. Then, they have been saved with a high jpeg compression in order to suit the constraints met by our industrial partner in its digitization flow.