Learning Text Patterns using Separate-and-Conquer Genetic Programming

posted Jan 9, 2015, 1:19 AM by Eric Medvet   [ updated Mar 23, 2015, 2:47 AM ]
  • 18th European Conference on Genetic Programming (EuroGP), 2015, Copenhagen (Denmark)
  • Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, Fabiano Tarlao
  • Google Scholar
The problem of extracting knowledge from large volumes of unstructured textual information has become increasingly important. We consider the problem of extracting text slices that adhere to a syntactic pattern and propose an approach capable of generating the desired pattern automatically, from a few annotated examples. Our approach is based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing. Key feature of our proposal is its ability of discovering automatically whether the extraction task may be solved by a single pattern, or rather a set of multiple patterns is required. We obtain this property by means of a separate-and-conquer strategy: once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. Our proposal outperforms an earlier state-of-the-art approach on three challenging datasets.
Eric Medvet,
Mar 23, 2015, 2:50 AM