Data and tools

tools

Automatic generator of regular expressions from examples:

A web application which generates regular expressions (regex) automatically by means of examples: each example is a pair of strings. The actual generation is performed using Genetic Programming.

This tool is a demo of our works awarded with the Silver Medal at the 13-th HUMIES 2016 (Awards for Human-Competitive Results produced by Genetic and Evolutionary Computation)

Data

Annotated strings for learning text extractors:

We provide here a set of dataset of annotated strings which we used in order to experimentally evaluate a method for automatic inference of text extractors using Genetic Programming (GP).

Ghega-dataset: a dataset for document understanding and classification:

A labeled dataset of several digitalized paper documents, processed by OCR. We used this dataset (or part of it) for assessing the performance of several systems for document understanding and classification we built.

Paper citations for important Computer Science venues:

The charts presented here are obtained using citations data for the paper published between 2000 and 2009 (included) on 8 important Computer Science venues. Data says that every year, a significant percentage of papers that should be considered as being of "high quality" under any metric or human judgement, either never get cited at all or take just a bunch of citations.

Hidden fraudulent URLs dataset:

We provide here a dataset which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.

XML data for automatic schema generation:

We provide here a dataset of XML files which we used in order to experimentally evaluate a method for automatic schema generation using Genetic Programming (GP).

Automatic search-and-replace:

We provide here a set of datasets which we used in order to experimentally evaluate a method for automatic inference of search-and-replace expressions.