Publications‎ > ‎

International Journal Publications

Active Learning of Regular Expressions for Entity Extraction

posted Mar 6, 2017, 1:02 AM by Eric Medvet   [ updated Mar 6, 2017, 1:06 AM ]

  • IEEE Transactions on Cybernetics (TCyb), 2017, to appear
  • Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, Fabiano Tarlao
We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this work we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with Genetic Programming and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with Genetic Programming and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters and execution time over a state-of-the-art baseline.

An architecture for anonymous mobile coupons in a large network

posted Nov 17, 2016, 6:32 AM by Eric Medvet   [ updated Dec 19, 2016, 2:10 PM ]

A mobile coupon (m-coupon) can be presented with a smartphone for obtaining a financial discount when purchasing a product or service. M-coupons are a powerful marketing tool that has enjoyed a huge growth and diffusion, involving tens of millions of people each year.
We propose an architecture which may enable significant improvements over current m-coupon technology, in terms of acceptance of potential customers and of marketing actions that become feasible: the customer does not need to install any dedicated app; a m-coupon is not bound to any specific device or customer; a m-coupon may be redeemed at any store in a set of potentially many thousands of stores, without any prior arrangement between customer and store. We are not aware of any proposal with these properties.

Regex-based Entity Extraction with Active Learning and Genetic Programming

posted Aug 26, 2016, 12:49 AM by Eric Medvet   [ updated Oct 3, 2016, 2:01 AM ]

We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior. We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system.
 The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task. We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity. The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required.

Predicting the Effectiveness of Pattern-based Entity Extractor Inference

posted May 16, 2016, 2:21 AM by Eric Medvet   [ updated May 27, 2016, 3:31 AM ]

An essential component of any workflow leveraging digital data consists in the identification and extraction of relevant patterns from a data stream. We consider a scenario in which an extraction inference engine generates an entity extractor automatically from examples of the desired behavior, which take the form of user-provided annotations of the entities to be extracted from a dataset. We propose a methodology for predicting the accuracy of the extractor that may be inferred from the available examples. We propose several prediction techniques and analyze experimentally our proposals in great depth, with reference to extractors consisting of regular expressions. The results suggest that reliable predictions for tasks of practical complexity may indeed be obtained quickly and without actually generating the entity extractor.

Can A Machine Replace Humans In Building Regular Expressions? A Case Study

posted Mar 23, 2016, 5:05 AM by Eric Medvet   [ updated Jun 7, 2016, 2:36 AM ]

Regular expressions are routinely used in a variety of different application domains. Building a regular expression involves a considerable amount of skill, expertise and creativity. In this work we investigate whether a machine may surrogate these qualities and construct automatically regular expressions for tasks of realistic complexity. We discuss a large scale experiment involving more than 1700 users on 10 challenging tasks. We compared the solutions constructed by these users to those constructed by a tool based on Genetic Programming that we have recently developed and made publicly available. The quality of automatically-constructed solutions turned out to be similar to the quality of those constructed by the most skilled user group; and, the time for automatic construction was similar to the time required by human users.

Inference of Regular Expressions for Text Extraction from Examples

posted Mar 21, 2016, 4:48 AM by Eric Medvet   [ updated May 26, 2016, 2:57 AM ]

A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior.
We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at

Data Quality Challenge: Toward a tool for string processing by examples

posted Jun 9, 2015, 4:25 AM by Eric Medvet   [ updated May 26, 2016, 3:47 AM ]

Many data-related activities at organizations of all sizes are concerned with low-level string processing, such as format transformation and validation, data cleaning, substring extraction and classification, and so on. Problems of this sort occur routinely in a one-off fashion as part of specific processes or activities that cannot be integrated in long-lived workflows, such as analysis of data gathered from the web or from other enterprise sources. The input stream may range from structured or semistructured data, such as database tables or spreadsheets, to unstructured data, such as text in natural language, as well as data whose structure is not available, such as a web page or a pdf invoice. These tasks are a vital ingredient of virtually every organization but are difficult to address efficiently: they are usually too simple to justify the cost and latency of a full-blown IT project, yet they are not simple enough to be solved by non-IT specialists.

Bibliometric Evaluation of Researchers in the Internet Age

posted Jun 26, 2014, 4:12 AM by Eric Medvet   [ updated May 26, 2016, 3:43 AM ]

Research evaluation, which is an increasingly pressing issue, invariably relies on citation counts. In this contribution we highlight two concerns that the research community needs to pay attention to. One, in the world of search engine facilitated research, factors such as ease of web discovery, ease of access, and content relevance rather than quality influence what gets read and cited. Two, research evaluation based on citation counts works against many types of high-quality works. We will also elaborate on the implications of these points by examining a recent nation-wide evaluation of researchers performed in Italy. We focus on our discipline (computer science), but we believe that our observations have relevance for a broad audience.

Automatic Synthesis of Regular Expressions from Examples

posted Feb 25, 2014, 8:35 AM by Eric Medvet   [ updated May 26, 2016, 3:41 AM ]

We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology.

Semisupervised Wrapper Choice and Generation for Print-Oriented Documents

posted Dec 10, 2012, 6:33 AM by Eric Medvet   [ updated May 26, 2016, 3:34 AM ]

Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, e.g., the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multi-source scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists and generates one when necessary. PATO assumes that the need for new source-specific wrappers is part of normal system operation: new wrappers are generated on-line based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging dataset composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, patents. We also perform an extensive analysis of the crucial trade-off between accuracy and automation level.

1-10 of 15