We have slightly improved our automatic generator of regular expressions—a few internal optimizations, a few more computing resources and better load balancing. We describe below how the system works—very informally and very briefly. Then, we will list some of the open questions.
The webapp generates regular expressions automatically, only by means of text extraction examples. An example is a string coupled with the substring to be extracted:
It is a research prototype developed by our lab and we believe it is the only existing tool where users provide only examples of the desired behavior. Being a prototype, though, it is far from being "perfect" and we will greatly appreciate any comments or criticism.
The system internally runs an evolutionary search based on genetic programming (GP). With a fair amount of oversimplification, it works as follows.
It partitions the examples in two subsets, a training set and a validation set. Then it executes a gp search:
The system repeats the gp search described above 32 times, thereby obtaining 32 regular expressions---one for each gp search. At this point the system ranks these 32 regular expressions according to their fitness on the validation set. The one with best fitness is taken as final result and presented to the user.
That's it. Please keep in mind this description is grossly oversimplified, though. In order to obtain useful results, we had to carefully design and analyze a number of issues, including fitness definition, handling of multiple objectives, choice of the operators that can be used for constructing a regular expression.
The backend is implemented in Java and makes use of a GP API that we developed. We are not planning to make this code publicly available, one of the reasons being it is not documented and we are not able to provide any serious support.
There are a number of open research questions:
Please be patient: regex generation is computationally expensive; we are a very small research group with very few resources, so the app might occasionally be overloaded.