Hidden fraudulent URLs dataset
We provide here a dataset (the file is attached to this page, see below) which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.
An hidden fraudulent URL is an URL to an illegitimate web page added to a trusted web site. A trusted web site is a site whose administrators perform their best effort to host content that is indeed genuine and not harmful. A compromised web site is a site which hosts both fraudulent and legitimate (i.e., not fraudulent) web pages. A hidden URL is an URL of a page which is not reached by crawling the corresponding web site within up to the third level of depth.
This dataset includes three kinds of URLs:
- hidden fraudulent URLs;
- URLs of legitimate pages belonging to trusted, yet compromised web sites;
- URLs of legitimate pages belonging to trusted and uncompromised web sites.
This dataset considers two categories of fraudulent web page:
- web defacements;
How we collected the data
Concerning phishing, we used the data provided by Phishtank. We composed a list of about 7500 valid and online URLs extracted from Phishtank.
Concerning defacements, we used data provided by Zone-H. We composed a list of about 2500 URLs extracted from Zone-H.
We then augmented the lists by adding all URLs of pages reached by crawling the compromised sites up to the third level and dropped from the lists the following items:
- URLs whose domain is an IP address;
- URLs whose path is empty or equal to
Concerning URLs of legitimate pages belonging to trusted and uncompromised, we selected a set of 20 web sites extracted from the top 500 web sites ranking provided by Alexa. We excluded from this selection:
- web sites providing different content depending on whether the user is authenticated,
- social network web sites,
- search engines.
For each of these 20 sites, we crawled the site up to the 10th level of depth and saved all the URLs obtained.
The data is a csv file with the following columns:
url, the actual URL
compromissionType, a categorical value among "phishing", "defacement" or the empty string, if, respectively, the URL belongs to a web site compromised by phishing, defacement or not compromised
isHiddenFraudulent, a boolean value which is true if the URL is hidden and fraudulent
contentLength, an integer value corresponding to the Content-Length header value obtained by sending a HTTP HEAD request to URL
serverType, a string corresponding to the X-Server-Type header value, obtained as above
poweredBy, a string corresponding to the X-Powered-By header value, obtained as above
contentType, a string corresponding to the Content-Type header value, obtained as above
lastModified, an date in the format "Thu, 17 Jan 2013 18:49:04 GMT" corresponding to the Last-Modified header value, obtained as above
poweredBy values heve been preprocessed so as to keep only the framework name and the major and minor version number (e.g., Apache/2.2.22-12 becomes Apache/2.2).
The file is attached to this page, see below.