Data and tools‎ > ‎

Annotated strings for learning text extractors

We provide here a set of dataset of annotated strings which we used in order to experimentally evaluate a method for automatic inference of text extractors using Genetic Programming (GP).
  1. Web-HTML/Heading The corpus is composed of all lines of the HTML source of several pages taken from the following web sites: Wikipedia, W3C, Hacker News, StackOverflow, Repubblica.it and Libero.it. The task consists in extracting the headings HTML element (tags and content).

  2. Web-HTML/Heading-Content* The corpus is composed of all lines of the HTML source of several pages taken from the following web sites: Wikipedia, W3C, Hacker News, StackOverflow, Repubblica.it and Libero.it. The task consists in extracting the headings HTML element content (without enclosing tags).

  3. CongressBills/Date The corpus is composed of 600 bills promoted by the United State Congress and obtained from the THOMAS online database. In order to vary the format of the dates present in the bills, we changed the format of all the dates in order to obtain 9 different formats—including 3 formats in which the month is shown by name rather than with number. The task consists in extracting all the dates. The dates are present in 9 different formats.

  4. BibTeX/Author* The corpus is composed of a collection of bibliographic references in the form of BibTeX elements which we obtained by querying Google Scholar with the following keywords: computer, rotavirus, divina commedia, fuel cells, neurosurgery, seismic, atmosphere, nuclear bomb, astrology, lovecraft. For each keyword we downloaded the citations of the first 20 search results. The task consists in extracting the full name of all the authors of a scientific publication.

  5. BibTeX/Title* The corpus is composed of a collection of bibliographic references in the form of BibTeX elements which we obtained by querying Google Scholar with the following keywords: computer, rotavirus, divina commedia, fuel cells, neurosurgery, seismic, atmosphere, nuclear bomb, astrology, lovecraft. For each keyword we downloaded the citations of the first 20 search results. The task consists in extracting the title of a scientific publication.

  6. Reference/First-Author* The corpus is composed of 198 bibliographic references formatted according to the Springer LNCS format.The task consists in extracting the name of all the authors of a scientific publication cited in a document.
ċ
Annotated Strings for Learning Text Extractors.zip
(7297k)
Andrea De Lorenzo,
Feb 12, 2015, 2:37 AM