We provide here a dataset (the file is attached to this page, see below) which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.
An hidden fraudulent URL is an URL to an illegitimate web page added to a trusted web site. A trusted web site is a site whose administrators perform their best effort to host content that is indeed genuine and not harmful.
A compromised web site is a site which hosts both fraudulent and legitimate (i.e., not fraudulent) web pages.
A hidden URL is an URL of a page which is not reached by crawling the corresponding web site within up to the third level of depth.
This dataset includes three kinds of URLs:
This dataset considers two categories of fraudulent web page:
Concerning phishing, we used the data provided by Phishtank. We composed a list of about 7500 valid and online URLs extracted from Phishtank.
Concerning defacements, we used data provided by Zone-H. We composed a list of about 2500 URLs extracted from Zone-H.
We then augmented the lists by adding all URLs of pages reached by crawling the compromised sites up to the third level and dropped from the lists the following items:
Concerning URLs of legitimate pages belonging to trusted and uncompromised, we selected a set of 20 web sites extracted from the top 500 web sites ranking provided by Alexa. We excluded from this selection:
For each of these 20 sites, we crawled the site up to the 10th level of depth and saved all the URLs obtained.
The data is a csv file with the following columns:
The file is attached to this page, see below.
Data and tools >