Our work on automatic generation of regular expressions from examples (see this, this, this and this) has won the Silver Medal at the 2016 "Human competitive" awards (Humies).
We are really very proud of this prestigious result.
Humies is an international competition, now at its 13-th edition, that establishes the state of the art in genetic and evolutionary computation. It is open to "human-competitive results that were produced by any form of genetic and evolutionary computation" and establishes precisely what "human-competitive" means, for example, "The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions".
We were selected as finalists also in the 11-th edition, but that time our submission was not as strong as this year. The presentation made by Andrea De Lorenzo can be found here.
Genetic programming is one of our main research activities at the Machine Learning Lab. We are grateful to the many people that have collaborated with us along the years. We are most grateful to Cyril Fillon, which introduced us to the mysteries of genetic programming back in 2005 (and which managed to survive in Italy despite the 2006 Football World Cup final). We are also grateful to Giorgio Davanzo, Marco Mauri and Enrico Sorio, which participated in the early phases of our multi-year research on regex construction (Giorgio has also the merit of suggesting this research topic).
Thanks to all.
Alberto, also on behalf of Eric, Andrea, Fabiano
Our work on automatic generation of regular expressions from examples (see this, this, this and this) has been selected for inclusion among the 8 finalists at the annual "HUMIES Awards for Human-Competitive Results produced by Genetic and Evolutionary Computation"!
The rules for this prestigious competition are clearly stated on their website (which includes the full list of 22 participants to this year competition). Participants submit "human-competitive results that were produced by any form of genetic and evolutionary computation and that were published during previous year... The criterion for human-competitiveness is that an automatically created result is considered human-competitive if it satisfies at least one of the following eight criteria". We believe our work satisfies five criteria:
The final will be on Friday, July 22-nd in Denver, Colorado. Andrea De Lorenzo will be there.
This will be our second HUMIES final (see 2014 edition). As always, the level is very high and the other participants are very strong... being in the final is a great result already... but we believe we have a very strong submission... let's cross the fingers...!
Very good news for the lab: three papers just accepted at prestigious conferences!
These papers focus on very different application domains and propose approaches based on very different machine learning paradigms.
Syntactical Similarity Learning by means of Grammatical Evolution
(14-th Parallel Problem Solving from Nature, PPSN 2016, biennial conference)
A Language and an Inference Engine for Twitter Filtering Rules
"Best Dinner Ever!!!": Automatic Generation of Restaurant Reviews with LSTM-RNN
Our last post was about acceptance on IEEE TKDE of our work on automatic regex construction and ended as follows:
We are crossing our fingers as we hope to make another announcement soon...
And now, yes, we have just received another acceptance notification from a very prestigious journal: IEEE Intelligent Systems. So prestigious that their automated response to our submission was, in a sense, discouraging us of even trying: "please bear in mind that our current acceptance rate for unsolicited manuscripts is under 10%, and that we have a backlog that makes it impossible for us to accept everything we might want to, even if it is of high quality.".
This work provides a very convincing demonstration of the power and novelty of our regex-construction tool:
Building a regular expression involves a considerable amount of skill, expertise and creativity. In this work we investigate whether a machine may surrogate these qualities and construct automatically regular expressions for tasks of realistic complexity.
We discuss a large scale experiment involving more than 1700 users on 10 challenging tasks. We compared the solutions constructed by these users to those constructed by a tool based on Genetic Programming that we have recently developed and made publicly available.
The quality of automatically-constructed solutions turned out to be similar to the quality of those constructed by the most skilled user group; and, the time for automatic construction was similar to the time required by human users.
Our "big paper" on automatic generation of regular expressions from examples will appear soon on IEEE Transactions on Knowledge and Data Engineering!
We are really very proud of this result. According to the "State of the Journal Editorial" published by the Editor in Chief in January 2016, "TKDE remains a very competitive venue for publishing the best research results. Among the 552 articles submitted in the first 10 month of 2015, 17 were invited for minor revision (3%) and an additional 117 (21%) were invited for major revision". Needless to say, the remaining 418 submissions were rejected.
Our paper was one of those 17 which were asked only a minor revision.
This paper is the result of a multi-year effort (a very short summary written in December 2014 can be found here; a few months later we published this paper; but we kept working even after then...).
We are crossing our fingers as we hope to make another announcement soon...
Inference of Regular Expressions for Text Extraction from Examples
A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior.
We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it.
We have been working for several months on a significant improvement of our regex generator tool (source code): the ability to work in an active learning fashion. Rather than requiring the user to identify a set of examples describing the desired extraction task, it is the system which submits extraction queries to the user.
A paper based on this work has been just accepted at the 31-th ACM Symposium on Applied Computing, Evolutionary track, a good conference with only one-in-four acceptance rate:
Active Learning Approaches for Learning Regular Expressions with Genetic Programming
We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior.
We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task.
We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity.
The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.
This paper is actually a snapshot of our status several months ago. Since then, we have improved our active learning tool substantially and we have carried out a much broader and much deeper experimental evaluation, in particular, taking into account the time cost spent by the user for annotating examples. We have carefully modelled such cost, by distinguishing between: (i) annotation of examples that have to be found by the user; (ii) queries submitted by the system that require a mere yes/no click; (iii) queries submitted by the system which require a more elaborate answer, describing portions of text to be extracted and portions of text not to be extracted.
Currently we do not have any plan for making our active learning tool publicly available, though. We might reconsider this choice in the next months.
...not tightly related to our regex research.
A short paper of ours has appeared today on an ACM Journal (a prestigious venue, as any other ACM venue): Data Quality Challenge: Toward a Tool for String Processing by Examples. The title should be quite self-explanatory: we attempt to illustrate the need for practical tools capable of inferring what the user intend to do by means of just a few examples and then act accordingly. Of course, not "intend to do" in general but restricted to a very specific domain, i.e., string processing. We place this topic in a broader perspective discussing the relevant literature and then summarize key requirements and challenges.
In September we participated in the PAN 2015 competition (13-th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse). In particular, we ranked 1st on the Authorship identification task for the Spanish language obtaining also very good results for the three other languages involved (http://www.tira.io/task/authorship-verification/; this year the organizers chose to not generate a single global ranking). We participated also in the Author profiling task with, say, moderately good results, much better for Dutch than for other languages (http://www.tira.io/task/author-profiling/). The papers describing our method are open access (http://ceur-ws.org/Vol-1391/ then search "bartoli" or something alike).
As we wrote in the PAN papers, "During the competition we discovered several opportunities for fraudulently boosting the accuracy of our method during the evaluation phase... We notified the organizers which promptly acknowledged the high relevance of our concerns and took measures to mitigate the corresponding vulnerabilities. The organizers acknowledged our contribution publicly. We submitted for evaluation an honestly developed method—the one described in this document—that did not exploit such unethical procedures in any way".
We have received several requests to make the source code for our regex generator by examples tool public.
We were reluctant to do so because we cannot offer any support, but eventually we decided to make it public: https://github.com/MaLeLabTs/RegexGenerator
The engine is a developement release that implements the algorithms published in our articles:
More details about the project can be found on http://machinelearning.inginf.units.it/news/newregexgeneratortoolonline.
We hope that you find this code instructive and useful for your research or study activity.
If you use our code in your reasearch please cite our work and please share back your enhancements, fixes and modifications.
We have just received the acceptance notification of a work submitted to the most prestigious conference in evolutionary computation: ACM Genetic and Evolutionary Computation Conference (GECCO). We are proud to be there for the fourth year in row (2012, 2013, 2014).
Our paper is titled Evolutionary Learning of Syntax Patterns for Genic Interaction Extraction. We describe a method for generating a classifier of sentences of "biomedical interest", i.e., of sentences that describe gene/protein interactions. We generate the classifier automatically based solely on a dictionary of relevant terms and on examples of interesting sentences. The resulting classifier will extract from any scientific paper only those sentences that contain protein/gene interactions.
There are two key points in our work. First, it is one of the very few successful applications of evolutionary computing in real-world problems of natural language processing. Our experimental evaluation shows that we obtain performance that is much better than general classification techniques and is comparable to that of techniques carefully tailored to this specific classification problem—whereas our method does not exploit any problem-specific knowledge thus it is, at least in principle, suitable for any other kind of classification problem. Second, our classifier detects the occurrence of common syntax patterns, i.e., it learns automatically a syntactic model of the relevant sentences.
Needless to say, this work leverages on our strong experience in the automatic generation of regular expressions from examples: we represent the text in terms of part-of-speech annotations, then we learn syntax patterns in terms of regular expressions over those annotations... full details in the paper.
There is an increasing interest in the development of techniques for automatic relation extraction from unstructured text. The biomedical domain, in particular, is a sector that may greatly benefit from those techniques due to the huge and ever increasing amount of scientific publications describing observed phenomena of potential clinical interest.
In this paper, we consider the problem of automatically identifying sentences that contain interactions between genes and proteins, based solely on a dictionary of genes and proteins and a small set of sample sentences in natural language. We propose an evolutionary technique for learning a classifier that is capable of detecting the desired sentences within scientific publications with high accuracy. The key feature of our proposal, that is internally based on Genetic Programming, is the construction of a model of the relevant syntax patterns in terms of standard part-of-speech annotations. The model consists of a set of regular expressions that are learned automatically despite the large alphabet size involved.
We assess our approach on two realistic datasets and obtain 77% accuracy, a value sufficiently high to be of practical interest and that is in line with significant baseline methods.
This new year has started very well. A paper describing one of the key improvements to our new regex generator tool has been just accepted at the prestigious 18-th European Conference on Genetic Programming (EuroGP)—despite its European qualification, it is "the" reference forum for GP.
The paper is titled "Learning Text Patterns using Separate-and-Conquer Genetic Programming". Essentially, it provides the "ability of discovering automatically whether the text extraction task may be solved by a single pattern, or rather a set of multiple patterns is required"—including, of course, the generation of all such patterns.
We obtain this property by implementing a separate-and-conquer approach. Once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. As observed in the very deep and insightful observations that we received from the anonymous reviewers—thanks a lot!—this approach can be seen from a different point of view: we actually implemented an ensemble of classifiers, that is, the GP search generates several slightly different classifiers, each tuned on slightly different portions of the training set.
Of course, turning this idea into practice is difficult for a number of reasons, including the identification of suitable criteria for identifying the "adequate" level of performance that triggers the shrinking of the training set, the termination of the current search and the starting of the next one.
The full abstract follows.
The problem of extracting knowledge from large volumes of unstructured textual information has become increasingly important. We consider the problem of extracting text slices that adhere to a syntactic pattern and propose an approach capable of generating the desired pattern automatically, from a few annotated examples. Our approach is based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing. Key feature of our proposal is its ability of discovering automatically whether the extraction task may be solved by a single pattern, or rather a set of multiple patterns is required. We obtain this property by means of a separate-and-conquer strategy: once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. Our proposal outperforms an earlier state-of-the-art approach on three challenging datasets.
1-10 of 37