Best paper at EuroGP 2018!

posted Apr 9, 2018, 1:01 AM by Alberto Bartoli   [ updated Apr 9, 2018, 1:59 AM by Eric Medvet ]

Our work "On the Automatic Design of a Representation for Grammar-based Genetic Programming" has been chosen as best paper at the 21st European Conference on Genetic Programming (part of Evostar 2018).

This work is somewhat theoretical, at least for our standards. The basic idea is this: machine learning methods, including evolutionary algorithms, manipulate candidate solutions; solutions must be represented in a form suitable for the algorithm; the representation of a candidate solution must be defined in advance by a human operator; different representations may lead to different results, even for the same problem and the same evolutionary algorithm; selecting a "good representation" and motivating this choice is very difficult, it is basically an unsolved problem. 

In this work we investigated the possibility of designing a good representation automatically. That is, not only the computer constructs a solution automatically, it also constructs automatically the method for representing  solutions. Full details can be found in the paper, available from our lab page; the slides of the presentation are available from the same page.

Evil Twins and WPA2 Enterprise: A Coming Security Disaster?

posted Jan 11, 2018, 7:53 AM by Alberto Bartoli   [ updated Jan 12, 2018, 1:01 AM ]

This post is an introduction for the research paper “Evil twins and WPA2 Enterprise: A coming security disaster?”, to appear on Computers & Security, Elsevier ( Download link at the end of the post.

Would you trust a security technology that makes it possible (i.e., quite likely) to steal the single sign on enterprise credentials of any specific person in your enterprise by merely walking within 30 meters from that person? The attacker does not need to do any visible activity that might raise suspicions: a 50-euros device in a bag and a few seconds of physical proximity is all that is needed. The attack has to be done outside of the enterprise, Internet connectivity is not required and active cooperation of the target is not required. Thus, the attack may occur anywhere and the target would not notice anything.

And, what if that security technology made it possible (i.e. quite likely) to steal the single sign on credentials of a large fraction of people of your enterprise that happen to pass within 30 meters from an attacker? Perhaps at the canteen, or near the bus station, or anywhere outside of the enterprise?

Of course, you would not trust such a security technology. Interestingly, though, a technology of this kind is nearly ubiquitous and implicitly trusted by a lot of people and enterprises: it is WPA2 Enterprise, the suite of protocols for secure communication in wireless networks.

Scenarios like the ones outlined above are not a theoretical possibility. They are a reality whenever people connect to enterprise wireless networks with devices that are not configured correctly. While these issues are well known to practitioners (the technical details are discussed in the paper), the corresponding risks in the current technological landscape are not widely understood. The world today is very different from when WPA2 Enterprise was designed: first, everyone is now equipped with a personal wifi device and assuming that each device is configured correctly is just unrealistic; second, convergence toward single sign on architectures implies that network credentials unlock access to all services of the enterprise. It follows that attacks aimed at stealing network credentials from personal wifi enabled devices have become very attractive, because they can be carried out quickly, simply, cheaply, the probability of success is high, if successful they deliver enterprise credentials and, last but not least, they are hard to detect.

We show in the paper that by just wandering around for a few hours in regions not covered by a wireless network we collected 200 enterprise credentials. And, that by remaining for a few seconds at less than 35 meters from a specific (voluntary) target, you may steal his/her enterprise credentials; even when he/she is sitting in a car with close windows. Our experience is based on eduroam, the wireless infrastructure that federates thousands of universities and research institutes across the world and that performs billions of authentications each year, but our findings are more general as they are intrinsic to WPA2 Enterprise.

A security technology that in many, if not most, of its practical deployments puts enterprise credentials at risk is no longer acceptable. In particular, because those risks are not only underestimated, but also because they result from a design that is not secure by default.

An off the shelf device cannot connect to an enterprise wireless network securely. It has to be configured correctly by the user. The problem is, a device may connect even if the configuration is not correct, with the result that the connection will not be secure. This is no longer acceptable. Consider an off the shelf device. We are able to connect it securely to an HTTPS server whose name is known, but we are not able to connect it securely to a WPA2 Enterprise network whose name (SSID) is known.

It is not a matter of which cryptographic protocols or primitives are used; it is a matter of usability of the technology. In this respect, the recent news about the next generation of wireless protocols (WPA3) do not seem very encouraging, as there is no mention to usability and secure by default connection.

In the research paper we elaborate on the above issues (that we also compare to the recently discovered KRACK vulnerability) and suggest a direction for investigating practical solutions able to offer stronger security without requiring any overhaul of existing protocols.

The download link will be active for 50 days from January 11-th. After then, please drop an email to Comments and criticisms most welcome.

Please feel free to share this post and the download link:

Journal Paper on Active Learning of Regexes from Examples

posted Mar 4, 2017, 4:04 AM by Alberto Bartoli   [ updated Mar 6, 2017, 1:08 AM by Eric Medvet ]

Yet another great result for our multi-year activity on the automatic construction of regular expressions from examples of the desired behavior: our work Active Learning of Regular Expressions for Entity Extraction has been accepted for publication on IEEE Transactions on Cybernetics.

The key point of this work is that it is based on the "active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system".

Full abstract below.

We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream.  This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this work we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system.

During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with Genetic Programming and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with Genetic Programming and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters and execution time over a state-of-the-art baseline.

Open PostDoc position

posted Mar 1, 2017, 1:08 AM by Eric Medvet   [ updated Mar 1, 2017, 1:09 AM ]

The lab has 1 one-year Postdoctoral fellowship available. The fellow will work on the "Design and development of a Machine Learning-based software system for detecting and predicting failures and wear in internal combustion engines", as part of a research project based on an industry collaboration whose goal is to design and develop a software system for the predictive maintenance of internal combustion engines (Predictive Security Maintenance Software - PSMS) based on Machine Learning Techniques which are used to detect or predict failures and wear by analyzing relevant signals collected on the engine itself or with proper monitoring tools.
The application deadline is March, 27th.
The salary is 21.3k€ pre-taxes.
More info, including the application forms and instructions, is available here.

Silver medal at the 2016 "Human competitive" awards!!!

posted Jul 28, 2016, 5:57 AM by Alberto Bartoli   [ updated Aug 3, 2016, 2:37 AM by Eric Medvet ]

Our work on automatic generation of regular expressions from examples (see thisthisthis and this) has won the Silver Medal at the 2016 "Human competitive" awards (Humies). 

We are really very proud of this prestigious result.

Humies is an international competition, now at its 13-th edition, that establishes the state of the art in genetic and evolutionary computation. It is open to "human-competitive results that were produced by any form of genetic and evolutionary computation" and establishes precisely what "human-competitive" means, for example, "The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions".

We were selected as finalists also in the 11-th edition, but that time our submission was not as strong as this year.  The presentation made by Andrea De Lorenzo can be found here.

Genetic programming is one of our main research activities at the Machine Learning Lab. We are grateful to the many people that have collaborated with us along the years. We are most grateful to Cyril Fillon, which introduced us to the mysteries of genetic programming back in 2005 (and which managed to survive in Italy despite the 2006 Football World Cup final). We are also grateful to Giorgio Davanzo, Marco Mauri and Enrico Sorio, which participated in the early phases of our multi-year research on regex construction (Giorgio has also the merit of suggesting this research topic).

Thanks to all.

Alberto, also on behalf of Eric, Andrea, Fabiano

We are finalist (again) at the "Human-Competitive" Awards!

posted Jun 24, 2016, 1:56 AM by Alberto Bartoli   [ updated Jun 24, 2016, 2:07 AM by Eric Medvet ]

Our work on automatic generation of regular expressions from examples (see this, this, this and this) has been selected for inclusion among the 8 finalists at the annual "HUMIES Awards for Human-Competitive Results produced by Genetic and Evolutionary Computation"!

The rules for this prestigious competition are clearly stated on their website (which includes the full list of 22 participants to this year competition). Participants submit "human-competitive results that were produced by any form of genetic and evolutionary computation and that were published during previous year... The criterion for human-competitiveness is that an automatically created result is considered human-competitive if it satisfies at least one of the following eight criteria". We believe our work satisfies five criteria:
  • B "The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal."
  • D "The result is publishable in its own right as a new scientific result independent of the fact that the result was mechanically created."
  • E "The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions."
  • H “The result holds its own or wins a regulated competition involving human contestants (in the form of either live human players or human-written computer programs).
  • G "The result solves a problem of indisputable difficulty in its field."

The final will be on Friday, July 22-nd in Denver, Colorado. Andrea De Lorenzo will be there. 

This will be our second HUMIES final (see 2014 edition). As always, the level is very high and the other participants are very strong... being in the final is a great result already... but we believe we have a very strong submission... let's cross the fingers...!

Our recent papers...

posted Jun 23, 2016, 12:03 AM by Alberto Bartoli   [ updated Jun 24, 2016, 2:07 AM by Eric Medvet ]

Very good news for the lab: three papers just accepted at prestigious conferences!

These papers focus on very different application domains and propose approaches based on very different machine learning paradigms.

Syntactical Similarity Learning by means of Grammatical Evolution
  • Several research efforts have shown that a similarity function synthesized from examples may capture an application-specific similarity criterion in a way that fits the application needs more effectively than a generic distance definition.
  • We propose a similarity learning algorithm tailored to problems of syntax-based entity extraction from unstructured text streams.
  • The algorithm takes in input pairs of strings along with a binary indication of whether they adhere or not adhere to the same syntactic pattern.
  • Our approach is based on Grammatical Evolution and learns a function expressed with a specialized, simple language that we have defined for this purpose.
  • (abstract)
A Language and an Inference Engine for Twitter Filtering Rules
  • We consider the problem of the filtering of Twitter posts, that is, the hiding of those posts which the user prefers not to visualize on his/her timeline.
  • We define a language for specifying filtering policies suitable for Twitter posts. The language allows each user to decide which posts to filter out based on his/her sensibility and preferences.
  • We also propose a method for inferring a policy automatically, based solely on examples of the desired filtering behavior.
  • The method is based on an evolutionary approach driven by a multi-objective optimization scheme.
  • (abstract)
"Best Dinner Ever!!!": Automatic Generation of Restaurant Reviews with LSTM-RNN
  • Consumer reviews are an important information resource for people and a fundamental part of everyday decision-making.
  • We investigate the possibility of generating hundreds of false restaurant reviews automatically and very quickly.
  • We propose and evaluate a method for automatic generation of restaurant reviews tailored to the desired rating and restaurant category.
  • Our method is based on a "long-short term memory" recurrent neural network.
  • As it turns out, it is feasible to automatically generate realistic reviews which can manipulate the opinion of the user.
  • (abstract)

...and then IEEE Intelligent Systems !

posted Mar 23, 2016, 5:10 AM by Alberto Bartoli

Our last post was about acceptance on IEEE TKDE of our work on automatic regex construction and ended as follows:

We are crossing our fingers as we hope to make another announcement soon...

And now, yes, we have just received another acceptance notification from a very prestigious journal: IEEE Intelligent Systems. So prestigious that their automated response to our submission was, in a sense, discouraging us of even trying: "please bear in mind that our current acceptance rate for unsolicited manuscripts is under 10%, and that we have a backlog that makes it impossible for us to accept everything we might want to, even if it is of high quality.".

This work provides a very convincing demonstration of the power and novelty of our regex-construction tool:

Building a regular expression involves a considerable amount of skill, expertise and creativity. In this work we investigate whether a machine may surrogate these qualities and construct automatically regular expressions for tasks of realistic complexity.

We discuss a large scale experiment involving more than 1700 users on 10 challenging tasks. We compared the solutions constructed by these users to those constructed by a tool based on Genetic Programming that we have recently developed and made publicly available. 

The quality of automatically-constructed solutions turned out to be similar to the quality of those constructed by the most skilled user group; and, the time for automatic construction was similar to the time required by human users.

Key findings:
  • 1,764 users: 44% novice, 38% intermediate, and 18% experienced.
  • 12 tasks.
  • On average, our tool delivered solutions with accuracy (F-measure) almost always greater than or equal to the one obtained by each category of  users.
  • On average, the time required by our tool was almost always smaller than the time required by human operators.
  • We believe these results are remarkable and highly significant. Indeed, we are not aware of any similar tool exhibiting such human-competitive performance indexes. 
  • We believe that this result is a further demonstration of the practical capabilities of Genetic Programming techniques even on commodity hardware.

IEEE Transactions (TKDE) paper!

posted Mar 21, 2016, 4:17 AM by Alberto Bartoli   [ updated Mar 21, 2016, 4:52 AM by Eric Medvet ]

Our "big paper" on automatic generation of regular expressions from examples will appear soon on IEEE Transactions on Knowledge and Data Engineering! 

We are really very proud of this result. According to the "State of the Journal Editorial" published by the Editor in Chief in January 2016, "TKDE remains a very competitive venue for publishing the best research results. Among the 552 articles submitted in the first 10 month of 2015, 17 were invited for minor revision (3%) and an additional 117 (21%) were invited for major revision". Needless to say, the remaining 418 submissions were rejected.

Our paper was one of those 17 which were asked only a minor revision.

This paper is the result of a multi-year effort (a very short summary written in December 2014 can be found here; a few months later we published this paper; but we kept working even after then...).

We are crossing our fingers as we hope to make another announcement soon...

Inference of Regular Expressions for Text Extraction from Examples

A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior.

We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at

Active Learning of Regular Expressions

posted Dec 1, 2015, 8:13 AM by Alberto Bartoli   [ updated Dec 10, 2015, 4:21 AM by Eric Medvet ]

We have been working for several months on a significant improvement of our regex generator tool (source code): the ability to work in an active learning fashion. Rather than requiring the user to identify a set of examples describing the desired extraction task, it is the system which submits extraction queries to the user.

A paper based on this work has been just accepted at the 31-th ACM Symposium on Applied Computing, Evolutionary track, a good conference with only one-in-four acceptance rate:

Active Learning Approaches for Learning Regular Expressions with Genetic Programming
We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior.
We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task.
We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity.
The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.

This paper is actually a snapshot of our status several months ago. Since then, we have improved our active learning tool substantially and we have carried out a much broader and much deeper experimental evaluation, in particular, taking into account the time cost spent by the user for annotating examples. We have carefully modelled such cost, by distinguishing between: (i) annotation of examples that have to be found by the user; (ii) queries submitted by the system that require a mere yes/no click; (iii) queries submitted by the system which require a more elaborate answer, describing portions of text to be extracted and portions of text not to be extracted.

Currently we do not have any plan for making our active learning tool publicly available, though. We might reconsider this choice in the next months.

1-10 of 41