Data and tools

Tools

Automatic generator of regular expressions from examples

A web application which generates regular expressions (regex) automatically by means of examples: each example is a pair of strings. The actual generation is performed using Genetic Programming.

Data

Annotated strings for learning text extractors

We provide here a set of dataset of annotated strings which we used in order to experimentally evaluate a method for automatic inference of text extractors using Genetic Programming (GP).

Ghega-dataset: a dataset for document understanding and classification

A labeled dataset of several digitalized paper documents, processed by OCR.
We used this dataset (or part of it) for assessing the performance of several systems for document understanding and classification we built.

Paper citations for important Computer Science venues

The charts presented here are obtained using citations data for the paper published between 2000 and 2009 (included) on 8 important Computer Science venues. Data says that every year, a significant percentage of papers that should be considered as being of "high quality" under any metric or human judgement, either never get cited at all or take just a bunch of citations.

Hidden fraudulent URLs dataset

We provide here a dataset which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.

XML data for automatic schema generation

We provide here a dataset of XML files which we used in order to experimentally evaluate a method for automatic schema generation using Genetic Programming (GP).

Supplementary material