We provide here a dataset (the file is attached to this page, see below) which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.
An hidden fraudulent URL is an URL to an illegitimate web page added to a trusted web site. A trusted web site is a site whose administrators perform their best effort to host content that is indeed genuine and not harmful. A compromised web site is a site which hosts both fraudulent and legitimate (i.e., not fraudulent) web pages. A hidden URL is an URL of a page which is not reached by crawling the corresponding web site within up to the third level of depth.
This dataset includes three kinds of URLs:
This dataset considers two categories of fraudulent web page:
Concerning phishing, we used the data provided by Phishtank. We composed a list of about 7500 valid and online URLs extracted from Phishtank.
Concerning defacements, we used data provided by Zone-H. We composed a list of about 2500 URLs extracted from Zone-H.
We then augmented the lists by adding all URLs of pages reached by crawling the compromised sites up to the third level and dropped from the lists the following items:
index.html
Concerning URLs of legitimate pages belonging to trusted and uncompromised, we selected a set of 20 web sites extracted from the top 500 web sites ranking provided by Alexa. We excluded from this selection:
For each of these 20 sites, we crawled the site up to the 10th level of depth and saved all the URLs obtained.
The data is a csv file with the following columns:
url
, the actual URLcompromissionType
, a categorical value among "phishing", "defacement" or the empty string, if, respectively, the URL belongs to a web site compromised by phishing, defacement or not compromisedisHiddenFraudulent
, a boolean value which is true if the URL is hidden and fraudulentcontentLength
, an integer value corresponding to the Content-Length header value obtained by sending a HTTP HEAD request to URLserverType
, a string corresponding to the X-Server-Type header value, obtained as abovepoweredBy
, a string corresponding to the X-Powered-By header value, obtained as abovecontentType
, a string corresponding to the Content-Type header value, obtained as abovelastModified
, an date in the format "Thu, 17 Jan 2013 18:49:04 GMT" corresponding to the Last-Modified header value, obtained as aboveThe serverType
and poweredBy
values heve been preprocessed so as to keep only the framework name and the major and minor version number (e.g., Apache/2.2.22-12 becomes Apache/2.2).
The file is attached to this page, see below.