Data and tools‎ > ‎

Can A Machine Replace Humans In Building Regular Expressions? A Case Study

Supplemental material

This web page contains additional data about the experiment described in "Can A Machine Replace Humans In Building Regular Expressions? A Case Study": box-and-whiskers diagrams with all the completed tasks; statistical significance analysis (p-values); histograms with average values including different portions of data. Measured quantities are: F-measure on learning set, F-measure on testing set, time for constructing a regular expression. Data are provided at the granularity of each extraction task.

The box-and-whiskers diagrams show that there is ample variability in the results associated with humans, both in F-measure and time, while results obtained with our tool are much more repeatable. The only cases in which there is a relatively wide variability of results with our tool is F-measure for References-LeadAuthor and time for WebHTML-HeadingContent.

The F-measure diagram shows that, for each category of humans, one may always find a fraction of humans which obtain better results than our tool. Not surprisingly, thus, the improvement of our tool with respect to the three categories is statistically significant only for some tasks (see tables of p-values). In other words, while our tool is not systematically better than humans from the point of view of F-measure on all tasks, it does deliver F-measure that is comparable to humans and that, on the average, is even better.

The time diagram, on the other hand, indicates that our tool tends to be systematically faster than humans and the table of p-values confirms that this indication is indeed statistically significant for most tasks.

Box and whiskers


Statistical significance analysis

The following tables show the p-value obtained with a Wilcoxon ranked-sum test. The hypothesis H1 is: the F-measure obtained by GP is greater than the one obtained by the human.

LearningLog-MACCetinkaya-Text-All-URLLog-IPCetinkaya-HTML-HREFReLIE-HTML-All-URLWeb-HTML-HeadingReLIE-Email-Phone-NumberBibtex-AuthorWeb-HTML-Heading-ContentReferences-Lead-Author
Novice0.1270.2130.1970.0930.0060.1380.0930.0050.0000.000
Intermediate0.2120.3140.2790.1710.0130.2220.1800.0200.0010.001
Experienced0.2180.3050.2850.2090.0300.2740.2060.0640.0020.002

TestingLog-MACCetinkaya-Text-All-URLLog-IPCetinkaya-HTML-HREFReLIE-HTML-All-URLWeb-HTML-HeadingReLIE-Email-Phone-NumberBibtex-AuthorWeb-HTML-Heading-ContentReferences-Lead-Author
Novice0.3910.0160.9900.9310.0100.9580.9830.2820.0000.001
Intermediate0.6700.0911.0000.9800.0280.9940.9980.6470.0010.003
Experienced0.6870.1901.0000.9840.0380.9980.9990.9130.0050.004

The following table shows the p-value obtained with a Wilcoxon ranked-sum test. The hypothesis H1 is: the time taken by GP to construct a regular expression is lower than the time taken by the human.

TimeLog-MACCetinkaya-Text-All-URLLog-IPCetinkaya-HTML-HREFReLIE-HTML-All-URLWeb-HTML-HeadingReLIE-Email-Phone-NumberBibtex-AuthorWeb-HTML-Heading-ContentReferences-Lead-Author
Novice0.0010.0030.0010.0070.0080.0990.0290.0150.9420.286
Intermediate0.0000.0040.0010.0080.0030.0710.0410.0020.8910.057
Experienced0.0000.0100.0010.0340.0020.1160.0860.0020.8170.033


Average values with all data


Average values without <1st and >99th percentiles on time


Average values without <5th and >95th percentiles on time