News‎ > ‎

Learning of Syntax Patterns: GECCO!

posted Mar 21, 2015, 7:48 AM by Alberto Bartoli   [ updated Mar 23, 2015, 2:53 AM by Eric Medvet ]
We have just received the acceptance notification of a work submitted to the most prestigious conference in evolutionary computation: ACM Genetic and Evolutionary Computation Conference (GECCO). We are proud to be there for the fourth year in row (201220132014).

Our paper is titled Evolutionary Learning of Syntax Patterns for Genic Interaction Extraction. We describe a method for generating a classifier of sentences of "biomedical interest", i.e., of sentences that describe gene/protein interactions. We generate the classifier automatically based solely on a dictionary of relevant terms and on examples of interesting sentences. The resulting classifier will extract from any scientific paper only those sentences that contain protein/gene interactions. 

There are two key points in our work. First, it is one of the very few successful applications of evolutionary computing in real-world problems of natural language processing. Our experimental evaluation shows that we obtain performance that is much better than general classification techniques and is comparable to that of techniques carefully tailored to this specific classification problemwhereas our method does not exploit any problem-specific knowledge thus it is, at least in principle, suitable for any other kind of classification problem. Second, our classifier detects the occurrence of common syntax patterns, i.e., it learns automatically a syntactic model of the relevant sentences.

Needless to say, this work leverages on our strong experience in the automatic generation of regular expressions from examples: we represent the text in terms of part-of-speech annotations, then we learn syntax patterns in terms of regular expressions over those annotations... full details in the paper.


There is an increasing interest in the development of techniques for automatic relation extraction from unstructured text. The biomedical domain, in particular, is a sector that may greatly benefit from those techniques due to the huge and ever increasing amount of scientific publications describing observed phenomena of potential clinical interest.
In this paper, we consider the problem of automatically identifying sentences that contain interactions between genes and proteins, based solely on a dictionary of genes and proteins and a small set of sample sentences in natural language. We propose an evolutionary technique for learning a classifier that is capable of detecting the desired sentences within scientific publications with high accuracy. The key feature of our proposal, that is internally based on Genetic Programming, is the construction of a model of the relevant syntax patterns in terms of standard part-of-speech annotations. The model consists of a set of regular expressions that are learned automatically despite the large alphabet size involved.
We assess our approach on two realistic datasets and obtain 77% accuracy, a value sufficiently high to be of practical interest and that is in line with significant baseline methods.