News‎ > ‎

Learning multiple patterns for text extraction

posted Jan 8, 2015, 11:40 PM by Alberto Bartoli   [ updated Jan 9, 2015, 1:20 AM by Eric Medvet ]
This new year has started very well. A paper describing one of the key improvements to our new regex generator tool has been just accepted at the prestigious 18-th European Conference on Genetic Programming (EuroGP)—despite its European qualification, it is "the" reference forum for GP.

The paper is titled "Learning Text Patterns using Separate-and-Conquer Genetic Programming". Essentially, it provides the "ability of discovering automatically whether the text extraction task may be solved by a single pattern, or rather a set of multiple patterns is required"—including, of course, the generation of all such patterns.

We obtain this property by implementing a separate-and-conquer approach. Once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. As observed in the very deep and insightful observations that we received from the anonymous reviewers—thanks a lot!—this approach can be seen from a different point of view: we actually implemented an ensemble of classifiers, that is, the GP search generates several slightly different classifiers, each tuned on slightly different portions of the training set.

Of course, turning this idea into practice is difficult for a number of reasons, including the identification of suitable criteria for identifying the "adequate" level of performance that triggers the shrinking of the training set, the termination of the current search and the starting of the next one.

The full abstract follows.

The problem of extracting knowledge from large volumes of unstructured textual information has become increasingly important. We consider the problem of extracting text slices that adhere to a syntactic pattern and propose an approach capable of generating the desired pattern automatically, from a few annotated examplesOur approach is based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing. Key feature of our proposal is its ability of discovering automatically whether the extraction task may be solved by a single pattern, or rather a set of multiple patterns is required. We obtain this property by means of a separate-and-conquer strategy: once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. Our proposal outperforms an earlier state-of-the-art approach on three challenging datasets.

BTW, the "state-of-the-art" approach was our previous tool as described in the December 2014 issue of IEEE Computer (IEEE Xplore link, pdf on our publications page).