We provide here a labeled dataset which can be useful for document understanding research experiments (download the dataset). We used this dataset (or part of it) for assessing the performance of several systems for document understanding and classification we built:
Please cite our work if you use the dataset.
The dataset is composed as follows. It contains two groups of documents: 110 data-sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.
For each document the dataset contains:
Three sample images corresponding to the 1st page of three documents of the dataset are presented here. The former two concern the data-sheets and patents groups; the latter belongs to a third portion of the dataset (invoices) which we could not publish due to privacy concerns. Click on the image to see a larger version.
Here follows a snippet of the dataset file structure:
ghega-dataset
datasheets
central-zener-1
central-zener-2
diodes-zener
document-000-123542.blocks.csv
document-000-123542.groundtruth.csv
document-000-123542.in.000.png
document-000-123542.out.000.png
document-001-123663.blocks.csv
document-001-123663.groundtruth.csv
document-001-123663.in.000.png
document-001-123663.out.000.png
...
mcc-zener
...
patents
...
The filenames follows this conventions:
document-#documentInClass-id.in.#page.png
document-#documentInClass-id.out.#page.png
document-#documentInClass-id.blocks.csv
document-#documentInClass-id.groundtruth.csv
A block consist in a rectangular portion of the processed image where the OCR sofware found a single-line piece of text. We made the OCR work with standard configuration concerning line segmentation. A block includes:
An example of two lines of a blocks csv file follows:
TextLineBlockCommon,0,5.67,0.31,0.11,0.059,value,[B@3c649bb0
TextLineBlockCommon,0,1.28,0.51,0.74,0.11,"some, "" point", [B@71295ec9
The text part is double-quoted when needed (possible double-quotes are escaped).
We manually build the groundtruth by visually inspecting each document and, by means of a UI, selecting the blocks of interest. We looked for the following information (elements):
Each document could contain 0, 1 or more values for each element.
For each document, for each element which has a value in the document, we inserted in the groundtruth one or two blocks, as follows:
Where possible, we kept the same behaviour in selecting one or two blocks for the same element in the document of the same class. The value block text could contain other text other then the value itself.
The actual groundtruth csv file contains one line for each value: the line always contains both the two blocks. There can be more than one line for the same element (for example, the document show the patent number in two places).
A groundtruth csv line contains:
An example of two lines of a blocks csv file follows (the actual second line is splitted):
Case,-1,0.0,0.0,0.0,0.0,,0,1.28,2.78,0.79,0.10,MELF CASE
StorageTemperature,0,0.35,3.40,2.03,0.11,Operating and Storage Temperature,0,4.13,3.41,0.63,0.09,-65 to +200
The dataset name is inspired by Carl Ritter von Ghega, who designed a railway from our city, Trieste, to our past capital, Vienna.