Journal Paper on Active Learning of Regexes from Examples

posted Mar 4, 2017, 4:04 AM by Alberto Bartoli   [ updated Mar 6, 2017, 1:08 AM by Eric Medvet ]

Yet another great result for our multi-year activity on the automatic construction of regular expressions from examples of the desired behavior: our work Active Learning of Regular Expressions for Entity Extraction has been accepted for publication on IEEE Transactions on Cybernetics.

The key point of this work is that it is based on the "active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system".

Full abstract below.

We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream.  This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this work we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system.

During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with Genetic Programming and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with Genetic Programming and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters and execution time over a state-of-the-art baseline.

Open PostDoc position

posted Mar 1, 2017, 1:08 AM by Eric Medvet   [ updated Mar 1, 2017, 1:09 AM ]

The lab has 1 one-year Postdoctoral fellowship available. The fellow will work on the "Design and development of a Machine Learning-based software system for detecting and predicting failures and wear in internal combustion engines", as part of a research project based on an industry collaboration whose goal is to design and develop a software system for the predictive maintenance of internal combustion engines (Predictive Security Maintenance Software - PSMS) based on Machine Learning Techniques which are used to detect or predict failures and wear by analyzing relevant signals collected on the engine itself or with proper monitoring tools.
The application deadline is March, 27th.
The salary is 21.3k€ pre-taxes.
More info, including the application forms and instructions, is available here.

Silver medal at the 2016 "Human competitive" awards!!!

posted Jul 28, 2016, 5:57 AM by Alberto Bartoli   [ updated Aug 3, 2016, 2:37 AM by Eric Medvet ]

Our work on automatic generation of regular expressions from examples (see thisthisthis and this) has won the Silver Medal at the 2016 "Human competitive" awards (Humies). 

We are really very proud of this prestigious result.

Humies is an international competition, now at its 13-th edition, that establishes the state of the art in genetic and evolutionary computation. It is open to "human-competitive results that were produced by any form of genetic and evolutionary computation" and establishes precisely what "human-competitive" means, for example, "The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions".

We were selected as finalists also in the 11-th edition, but that time our submission was not as strong as this year.  The presentation made by Andrea De Lorenzo can be found here.

Genetic programming is one of our main research activities at the Machine Learning Lab. We are grateful to the many people that have collaborated with us along the years. We are most grateful to Cyril Fillon, which introduced us to the mysteries of genetic programming back in 2005 (and which managed to survive in Italy despite the 2006 Football World Cup final). We are also grateful to Giorgio Davanzo, Marco Mauri and Enrico Sorio, which participated in the early phases of our multi-year research on regex construction (Giorgio has also the merit of suggesting this research topic).

Thanks to all.

Alberto, also on behalf of Eric, Andrea, Fabiano

We are finalist (again) at the "Human-Competitive" Awards!

posted Jun 24, 2016, 1:56 AM by Alberto Bartoli   [ updated Jun 24, 2016, 2:07 AM by Eric Medvet ]

Our work on automatic generation of regular expressions from examples (see this, this, this and this) has been selected for inclusion among the 8 finalists at the annual "HUMIES Awards for Human-Competitive Results produced by Genetic and Evolutionary Computation"!

The rules for this prestigious competition are clearly stated on their website (which includes the full list of 22 participants to this year competition). Participants submit "human-competitive results that were produced by any form of genetic and evolutionary computation and that were published during previous year... The criterion for human-competitiveness is that an automatically created result is considered human-competitive if it satisfies at least one of the following eight criteria". We believe our work satisfies five criteria:
  • B "The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal."
  • D "The result is publishable in its own right as a new scientific result independent of the fact that the result was mechanically created."
  • E "The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions."
  • H “The result holds its own or wins a regulated competition involving human contestants (in the form of either live human players or human-written computer programs).
  • G "The result solves a problem of indisputable difficulty in its field."

The final will be on Friday, July 22-nd in Denver, Colorado. Andrea De Lorenzo will be there. 

This will be our second HUMIES final (see 2014 edition). As always, the level is very high and the other participants are very strong... being in the final is a great result already... but we believe we have a very strong submission... let's cross the fingers...!

Our recent papers...

posted Jun 23, 2016, 12:03 AM by Alberto Bartoli   [ updated Jun 24, 2016, 2:07 AM by Eric Medvet ]

Very good news for the lab: three papers just accepted at prestigious conferences!

These papers focus on very different application domains and propose approaches based on very different machine learning paradigms.

Syntactical Similarity Learning by means of Grammatical Evolution
  • Several research efforts have shown that a similarity function synthesized from examples may capture an application-specific similarity criterion in a way that fits the application needs more effectively than a generic distance definition.
  • We propose a similarity learning algorithm tailored to problems of syntax-based entity extraction from unstructured text streams.
  • The algorithm takes in input pairs of strings along with a binary indication of whether they adhere or not adhere to the same syntactic pattern.
  • Our approach is based on Grammatical Evolution and learns a function expressed with a specialized, simple language that we have defined for this purpose.
  • (abstract)
A Language and an Inference Engine for Twitter Filtering Rules
  • We consider the problem of the filtering of Twitter posts, that is, the hiding of those posts which the user prefers not to visualize on his/her timeline.
  • We define a language for specifying filtering policies suitable for Twitter posts. The language allows each user to decide which posts to filter out based on his/her sensibility and preferences.
  • We also propose a method for inferring a policy automatically, based solely on examples of the desired filtering behavior.
  • The method is based on an evolutionary approach driven by a multi-objective optimization scheme.
  • (abstract)
"Best Dinner Ever!!!": Automatic Generation of Restaurant Reviews with LSTM-RNN
  • Consumer reviews are an important information resource for people and a fundamental part of everyday decision-making.
  • We investigate the possibility of generating hundreds of false restaurant reviews automatically and very quickly.
  • We propose and evaluate a method for automatic generation of restaurant reviews tailored to the desired rating and restaurant category.
  • Our method is based on a "long-short term memory" recurrent neural network.
  • As it turns out, it is feasible to automatically generate realistic reviews which can manipulate the opinion of the user.
  • (abstract)

...and then IEEE Intelligent Systems !

posted Mar 23, 2016, 5:10 AM by Alberto Bartoli

Our last post was about acceptance on IEEE TKDE of our work on automatic regex construction and ended as follows:

We are crossing our fingers as we hope to make another announcement soon...

And now, yes, we have just received another acceptance notification from a very prestigious journal: IEEE Intelligent Systems. So prestigious that their automated response to our submission was, in a sense, discouraging us of even trying: "please bear in mind that our current acceptance rate for unsolicited manuscripts is under 10%, and that we have a backlog that makes it impossible for us to accept everything we might want to, even if it is of high quality.".

This work provides a very convincing demonstration of the power and novelty of our regex-construction tool:

Building a regular expression involves a considerable amount of skill, expertise and creativity. In this work we investigate whether a machine may surrogate these qualities and construct automatically regular expressions for tasks of realistic complexity.

We discuss a large scale experiment involving more than 1700 users on 10 challenging tasks. We compared the solutions constructed by these users to those constructed by a tool based on Genetic Programming that we have recently developed and made publicly available. 

The quality of automatically-constructed solutions turned out to be similar to the quality of those constructed by the most skilled user group; and, the time for automatic construction was similar to the time required by human users.

Key findings:
  • 1,764 users: 44% novice, 38% intermediate, and 18% experienced.
  • 12 tasks.
  • On average, our tool delivered solutions with accuracy (F-measure) almost always greater than or equal to the one obtained by each category of  users.
  • On average, the time required by our tool was almost always smaller than the time required by human operators.
  • We believe these results are remarkable and highly significant. Indeed, we are not aware of any similar tool exhibiting such human-competitive performance indexes. 
  • We believe that this result is a further demonstration of the practical capabilities of Genetic Programming techniques even on commodity hardware.

IEEE Transactions (TKDE) paper!

posted Mar 21, 2016, 4:17 AM by Alberto Bartoli   [ updated Mar 21, 2016, 4:52 AM by Eric Medvet ]

Our "big paper" on automatic generation of regular expressions from examples will appear soon on IEEE Transactions on Knowledge and Data Engineering! 

We are really very proud of this result. According to the "State of the Journal Editorial" published by the Editor in Chief in January 2016, "TKDE remains a very competitive venue for publishing the best research results. Among the 552 articles submitted in the first 10 month of 2015, 17 were invited for minor revision (3%) and an additional 117 (21%) were invited for major revision". Needless to say, the remaining 418 submissions were rejected.

Our paper was one of those 17 which were asked only a minor revision.

This paper is the result of a multi-year effort (a very short summary written in December 2014 can be found here; a few months later we published this paper; but we kept working even after then...).

We are crossing our fingers as we hope to make another announcement soon...

Inference of Regular Expressions for Text Extraction from Examples

A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior.

We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at

Active Learning of Regular Expressions

posted Dec 1, 2015, 8:13 AM by Alberto Bartoli   [ updated Dec 10, 2015, 4:21 AM by Eric Medvet ]

We have been working for several months on a significant improvement of our regex generator tool (source code): the ability to work in an active learning fashion. Rather than requiring the user to identify a set of examples describing the desired extraction task, it is the system which submits extraction queries to the user.

A paper based on this work has been just accepted at the 31-th ACM Symposium on Applied Computing, Evolutionary track, a good conference with only one-in-four acceptance rate:

Active Learning Approaches for Learning Regular Expressions with Genetic Programming
We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior.
We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task.
We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity.
The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.

This paper is actually a snapshot of our status several months ago. Since then, we have improved our active learning tool substantially and we have carried out a much broader and much deeper experimental evaluation, in particular, taking into account the time cost spent by the user for annotating examples. We have carefully modelled such cost, by distinguishing between: (i) annotation of examples that have to be found by the user; (ii) queries submitted by the system that require a mere yes/no click; (iii) queries submitted by the system which require a more elaborate answer, describing portions of text to be extracted and portions of text not to be extracted.

Currently we do not have any plan for making our active learning tool publicly available, though. We might reconsider this choice in the next months.

Recent publications...

posted Oct 20, 2015, 4:58 AM by Alberto Bartoli   [ updated Oct 20, 2015, 5:43 AM by Eric Medvet ]

...not tightly related to our regex research.

A short paper of ours has appeared today on an ACM Journal (a prestigious venue, as any other ACM venue): Data Quality Challenge: Toward a Tool for String Processing by Examples. The title should be quite self-explanatory: we attempt to illustrate the need for practical tools capable of inferring what the user intend to do by means of just a few examples and then act accordingly. Of course, not "intend to do" in general but restricted to a very specific domain, i.e., string processing. We place this topic in a broader perspective discussing the relevant literature and then summarize key requirements and challenges.

In September we participated in the PAN 2015 competition (13-th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse). In particular, we ranked 1st on the Authorship identification task for the Spanish language obtaining also very good results for the three other languages involved ( year the organizers chose to not generate a single global ranking). We participated also in the Author profiling task with, say, moderately good results, much better for Dutch than for other languages ( The papers describing our method are open access ( then search "bartoli" or something alike).

As we wrote in the PAN papers, "During the competition we discovered several opportunities for fraudulently boosting the accuracy of our method during the evaluation phase... We notified the organizers which promptly acknowledged the high relevance of our concerns and took measures to mitigate the corresponding vulnerabilities. The organizers acknowledged our contribution publicly. We submitted for evaluation an honestly developed method—the one described in this document—that did not exploit such unethical procedures in any way".

Source code for regex generator is now public

posted Apr 28, 2015, 8:18 AM by Alberto Bartoli

We have received several requests to make the source code for our regex generator by examples tool public.

We were reluctant to do so because we cannot offer any support, but eventually we decided to make it public:

Enjoy it!

The engine is a developement release that implements the algorithms published in our articles:
  • Bartoli, Davanzo, De Lorenzo, Medvet, Automatic Synthesis of Regular Expressions from Examples, IEEE Computer, 2014
  • Bartoli, De Lorenzo, Medvet, Tarlao, Learning Text Patterns using Separate-and-Conquer Genetic Programming, 18th European Conference on Genetic Programming (EuroGP), 2015, Copenhagen (Denmark)
More details about the project can be found on

We hope that you find this code instructive and useful for your research or study activity.

If you use our code in your reasearch please cite our work and please share back your enhancements, fixes and modifications.

1-10 of 39