Automatic Search-and-Replace

Text modification examples for learning search-and-replace expressions

We provide here a set of datasets which we used in order to experimentally evaluate a method for automatic inference of search-and-replace expressions using Multi-Objective Cooperative Coevolutionary Genetic Programming (MOCCGP) algorithms.

Each dataset represents a different search-and-replace task and provides textual examples of the search-and-replace task. Each example is a pair of strings that shows one string and the same string after the text transformation has been applied.

    1. Date. A task of date format change consisting in changing the format of each date found in the web server log of the IP task from the Gregorian little-endian slash-separated format to the Gregorian big-endian dash-separated format, e.g., 31/Dec/2012 becomes 2012-Dec-31.
    2. Phone. A task of phone number format change consisting in changing the format of each phone number found in an email collection by removing the parenthesis around the area code and adding a dash, i.e., (555) 555-5555 becomes 555-555-5555.
    3. Ebook. A task of fixing paragraph terminations in ePub ebook files (XHTML-based format). The wrong paragraph terminations (i.e., occurrences of <p><\p> not preceded by ., !, ?, : characters) are replaced with a space character. The corpus is a portion of three publicly available ebooks: I Promessi Sposi (Alessandro Manzoni), Pride and Prejudice (Jane Austen), Don Quijote de la Mancha (Francisco de Robles).
    4. Salary. A task consisting in removing the thousand-separating commas from salaries of NBA players, e.g., $120,350,000 becomes $120350000.

search-and-replace-datasets.zip