Ghega dataset

Ghega-dataset: a dataset for document understanding and classification

We provide here a labeled dataset which can be useful for document understanding research experiments (download the dataset). We used this dataset (or part of it) for assessing the performance of several systems for document understanding and classification we built:

    1. A Probabilistic Approach to Printed Document Understanding
    2. A Domain Knowledge-based Approach for Automatic Correction of Printed Invoices
    3. Improving Features Extraction for Supervised Invoice Classification
    4. Open World Classification of Printed Invoices

Please cite our work if you use the dataset.

dataset composition

The dataset is composed as follows. It contains two groups of documents: 110 data-sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.

For each document the dataset contains:

    1. one or more png images (b/w, 300dpi) of the document pages (original);
    2. one or more png images (b/w, 300dpi) of the document pages after deskew and binarization (processed);
    3. one csv file containing all text blocks found by the OCR (we used OCRopus 0.2) (see below for details) (blocks)
    4. one csv file containing the groundtruth (see below for details) (groundtruth)

Sample documents

Three sample images corresponding to the 1st page of three documents of the dataset are presented here. The former two concern the data-sheets and patents groups; the latter belongs to a third portion of the dataset (invoices) which we could not publish due to privacy concerns. Click on the image to see a larger version.


file and directory structure

Here follows a snippet of the dataset file structure:

ghega-dataset
    datasheets
        central-zener-1
        central-zener-2
        diodes-zener
            document-000-123542.blocks.csv
            document-000-123542.groundtruth.csv
            document-000-123542.in.000.png
            document-000-123542.out.000.png
            document-001-123663.blocks.csv
            document-001-123663.groundtruth.csv
            document-001-123663.in.000.png
            document-001-123663.out.000.png
            ...
        mcc-zener
        ...
    patents
        ...

The filenames follows this conventions:

    • original: document-#documentInClass-id.in.#page.png
    • processed: document-#documentInClass-id.out.#page.png
    • blocks: document-#documentInClass-id.blocks.csv
    • groundtruth: document-#documentInClass-id.groundtruth.csv

Blocks file composition

A block consist in a rectangular portion of the processed image where the OCR sofware found a single-line piece of text. We made the OCR work with standard configuration concerning line segmentation. A block includes:

    1. type of block (just one fixed value)
    2. page (0 is the first page)
    3. x position from upper-left corner of the page, in inches
    4. y position from upper-left corner of the page, in inches
    5. width in inches
    6. height in inches
    7. found text
    8. useless serialized data

An example of two lines of a blocks csv file follows:

TextLineBlockCommon,0,5.67,0.31,0.11,0.059,value,[B@3c649bb0
TextLineBlockCommon,0,1.28,0.51,0.74,0.11,"some, "" point",    [B@71295ec9

The text part is double-quoted when needed (possible double-quotes are escaped).

Groundtruth file composition

We manually build the groundtruth by visually inspecting each document and, by means of a UI, selecting the blocks of interest. We looked for the following information (elements):

    1. data-sheets: Model, Type, Case, Power Dissipation, Storage Temperature, Voltage, Weight, Thermal Resistance
    2. patents: Title, Applicant, Inventor, Representative, Filing Date, Publication Date, Application Number, Publication Number, Priority, Classification, Abstract 1st line

Each document could contain 0, 1 or more values for each element.

For each document, for each element which has a value in the document, we inserted in the groundtruth one or two blocks, as follows:

    1. one block which contains the value (value block)
    2. one block which contains the value and one block which contains a label for the value (label block)

Where possible, we kept the same behaviour in selecting one or two blocks for the same element in the document of the same class. The value block text could contain other text other then the value itself.

The actual groundtruth csv file contains one line for each value: the line always contains both the two blocks. There can be more than one line for the same element (for example, the document show the patent number in two places).

A groundtruth csv line contains:

    1. element type
    2. page of the label block (-1 if absent)
    3. x of the label block
    4. y of the label block
    5. w of the label block
    6. h of the label block
    7. text of the label block
    8. page of the value block (never absent!)
    9. x of the value block
    10. y of the value block
    11. w of the value block
    12. h of the value block
    13. text of the label block

An example of two lines of a blocks csv file follows (the actual second line is splitted):

Case,-1,0.0,0.0,0.0,0.0,,0,1.28,2.78,0.79,0.10,MELF CASE
StorageTemperature,0,0.35,3.40,2.03,0.11,Operating and Storage     Temperature,0,4.13,3.41,0.63,0.09,-65 to +200

Dataset name

The dataset name is inspired by Carl Ritter von Ghega, who designed a railway from our city, Trieste, to our past capital, Vienna.