Ghega dataset
Ghega-dataset: a dataset for document understanding and classification
We provide here a labeled dataset which can be useful for document understanding research experiments (download the dataset). We used this dataset (or part of it) for assessing the performance of several systems for document understanding and classification we built:
- A Probabilistic Approach to Printed Document Understanding
- A Domain Knowledge-based Approach for Automatic Correction of Printed Invoices
- Improving Features Extraction for Supervised Invoice Classification
- Open World Classification of Printed Invoices
Please cite our work if you use the dataset.
dataset composition
The dataset is composed as follows. It contains two groups of documents: 110 data-sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.
For each document the dataset contains:
- one or more png images (b/w, 300dpi) of the document pages (original);
- one or more png images (b/w, 300dpi) of the document pages after deskew and binarization (processed);
- one csv file containing all text blocks found by the OCR (we used OCRopus 0.2) (see below for details) (blocks)
- one csv file containing the groundtruth (see below for details) (groundtruth)
Sample documents
Three sample images corresponding to the 1st page of three documents of the dataset are presented here. The former two concern the data-sheets and patents groups; the latter belongs to a third portion of the dataset (invoices) which we could not publish due to privacy concerns. Click on the image to see a larger version.
file and directory structure
Here follows a snippet of the dataset file structure:
ghega-dataset
datasheets
central-zener-1
central-zener-2
diodes-zener
document-000-123542.blocks.csv
document-000-123542.groundtruth.csv
document-000-123542.in.000.png
document-000-123542.out.000.png
document-001-123663.blocks.csv
document-001-123663.groundtruth.csv
document-001-123663.in.000.png
document-001-123663.out.000.png
...
mcc-zener
...
patents
...
The filenames follows this conventions:
- original:
document-#documentInClass-id.in.#page.png
- processed:
document-#documentInClass-id.out.#page.png
- blocks:
document-#documentInClass-id.blocks.csv
- groundtruth:
document-#documentInClass-id.groundtruth.csv
Blocks file composition
A block consist in a rectangular portion of the processed image where the OCR sofware found a single-line piece of text. We made the OCR work with standard configuration concerning line segmentation. A block includes:
- type of block (just one fixed value)
- page (0 is the first page)
- x position from upper-left corner of the page, in inches
- y position from upper-left corner of the page, in inches
- width in inches
- height in inches
- found text
- useless serialized data
An example of two lines of a blocks csv file follows:
TextLineBlockCommon,0,5.67,0.31,0.11,0.059,value,[B@3c649bb0
TextLineBlockCommon,0,1.28,0.51,0.74,0.11,"some, "" point", [B@71295ec9
The text part is double-quoted when needed (possible double-quotes are escaped).
Groundtruth file composition
We manually build the groundtruth by visually inspecting each document and, by means of a UI, selecting the blocks of interest. We looked for the following information (elements):
- data-sheets: Model, Type, Case, Power Dissipation, Storage Temperature, Voltage, Weight, Thermal Resistance
- patents: Title, Applicant, Inventor, Representative, Filing Date, Publication Date, Application Number, Publication Number, Priority, Classification, Abstract 1st line
Each document could contain 0, 1 or more values for each element.
For each document, for each element which has a value in the document, we inserted in the groundtruth one or two blocks, as follows:
- one block which contains the value (value block)
- one block which contains the value and one block which contains a label for the value (label block)
Where possible, we kept the same behaviour in selecting one or two blocks for the same element in the document of the same class. The value block text could contain other text other then the value itself.
The actual groundtruth csv file contains one line for each value: the line always contains both the two blocks. There can be more than one line for the same element (for example, the document show the patent number in two places).
A groundtruth csv line contains:
- element type
- page of the label block (-1 if absent)
- x of the label block
- y of the label block
- w of the label block
- h of the label block
- text of the label block
- page of the value block (never absent!)
- x of the value block
- y of the value block
- w of the value block
- h of the value block
- text of the label block
An example of two lines of a blocks csv file follows (the actual second line is splitted):
Case,-1,0.0,0.0,0.0,0.0,,0,1.28,2.78,0.79,0.10,MELF CASE
StorageTemperature,0,0.35,3.40,2.03,0.11,Operating and Storage Temperature,0,4.13,3.41,0.63,0.09,-65 to +200
Dataset name
The dataset name is inspired by Carl Ritter von Ghega, who designed a railway from our city, Trieste, to our past capital, Vienna.