We have been working for several months on a significant improvement of our regex generator tool (source code): the ability to work in an active learning fashion. Rather than requiring the user to identify a set of examples describing the desired extraction task, it is the system which submits extraction queries to the user.
A paper based on this work has been just accepted at the 31-th ACM Symposium on Applied Computing, Evolutionary track, a good conference with only one-in-four acceptance rate:
Active Learning Approaches for Learning Regular Expressions with Genetic Programming
We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior.
We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task.
We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity.
The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.
This paper is actually a snapshot of our status several months ago. Since then, we have improved our active learning tool substantially and we have carried out a much broader and much deeper experimental evaluation, in particular, taking into account the time cost spent by the user for annotating examples. We have carefully modelled such cost, by distinguishing between: (i) annotation of examples that have to be found by the user; (ii) queries submitted by the system that require a mere yes/no click; (iii) queries submitted by the system which require a more elaborate answer, describing portions of text to be extracted and portions of text not to be extracted.
Currently we do not have any plan for making our active learning tool publicly available, though. We might reconsider this choice in the next months.