...not tightly related to our regex research.
A short paper of ours has appeared today on an ACM Journal (a prestigious venue, as any other ACM venue): Data Quality Challenge: Toward a Tool for String Processing by Examples. The title should be quite self-explanatory: we attempt to illustrate the need for practical tools capable of inferring what the user intend to do by means of just a few examples and then act accordingly. Of course, not "intend to do" in general but restricted to a very specific domain, i.e., string processing. We place this topic in a broader perspective discussing the relevant literature and then summarize key requirements and challenges.
In September we participated in the PAN 2015 competition (13-th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse). In particular, we ranked 1st on the Authorship identification task for the Spanish language obtaining also very good results for the three other languages involved (http://www.tira.io/task/authorship-verification/; this year the organizers chose to not generate a single global ranking). We participated also in the Author profiling task with, say, moderately good results, much better for Dutch than for other languages (http://www.tira.io/task/author-profiling/). The papers describing our method are open access (http://ceur-ws.org/Vol-1391/ then search "bartoli" or something alike).
As we wrote in the PAN papers, "During the competition we discovered several opportunities for fraudulently boosting the accuracy of our method during the evaluation phase... We notified the organizers which promptly acknowledged the high relevance of our concerns and took measures to mitigate the corresponding vulnerabilities. The organizers acknowledged our contribution publicly. We submitted for evaluation an honestly developed method—the one described in this document—that did not exploit such unethical procedures in any way".