Our "big paper" on automatic generation of regular expressions from examples will appear soon on IEEE Transactions on Knowledge and Data Engineering!
We are really very proud of this result. According to the "State of the Journal Editorial" published by the Editor in Chief in January 2016, "TKDE remains a very competitive venue for publishing the best research results. Among the 552 articles submitted in the first 10 month of 2015, 17 were invited for minor revision (3%) and an additional 117 (21%) were invited for major revision". Needless to say, the remaining 418 submissions were rejected.
Our paper was one of those 17 which were asked only a minor revision.
This paper is the result of a multi-year effort (a very short summary written in December 2014 can be found here; a few months later we published this paper; but we kept working even after then...).
We are crossing our fingers as we hope to make another announcement soon...
Inference of Regular Expressions for Text Extraction from Examples
A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior.
We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it.