Hidden fraudulent URLs dataset

We provide here a dataset (the file is attached to this page, see below) which can be useful for evaluating the performance of a classifier for discriminating hidden fraudulent URLs. The dataset contains 185180 labeled URLs and some related features.

Definitions

An hidden fraudulent URL is an URL to an illegitimate web page added to a trusted web site. A trusted web site is a site whose administrators perform their best effort to host content that is indeed genuine and not harmful. A compromised web site is a site which hosts both fraudulent and legitimate (i.e., not fraudulent) web pages. A hidden URL is an URL of a page which is not reached by crawling the corresponding web site within up to the third level of depth.

This dataset includes three kinds of URLs:

hidden fraudulent URLs;
URLs of legitimate pages belonging to trusted, yet compromised web sites;
URLs of legitimate pages belonging to trusted and uncompromised web sites.

This dataset considers two categories of fraudulent web page:

web defacements;
phishing.

How we collected the data

Concerning phishing, we used the data provided by Phishtank. We composed a list of about 7500 valid and online URLs extracted from Phishtank.

Concerning defacements, we used data provided by Zone-H. We composed a list of about 2500 URLs extracted from Zone-H.

We then augmented the lists by adding all URLs of pages reached by crawling the compromised sites up to the third level and dropped from the lists the following items:

URLs whose domain is an IP address;
URLs whose path is empty or equal to index.html

Concerning URLs of legitimate pages belonging to trusted and uncompromised, we selected a set of 20 web sites extracted from the top 500 web sites ranking provided by Alexa. We excluded from this selection:

web sites providing different content depending on whether the user is authenticated,
social network web sites,
search engines.

For each of these 20 sites, we crawled the site up to the 10th level of depth and saved all the URLs obtained.

File structure

The data is a csv file with the following columns:

url, the actual URL
compromissionType, a categorical value among "phishing", "defacement" or the empty string, if, respectively, the URL belongs to a web site compromised by phishing, defacement or not compromised
isHiddenFraudulent, a boolean value which is true if the URL is hidden and fraudulent
contentLength, an integer value corresponding to the Content-Length header value obtained by sending a HTTP HEAD request to URL
serverType, a string corresponding to the X-Server-Type header value, obtained as above
poweredBy, a string corresponding to the X-Powered-By header value, obtained as above
contentType, a string corresponding to the Content-Type header value, obtained as above
lastModified, an date in the format "Thu, 17 Jan 2013 18:49:04 GMT" corresponding to the Last-Modified header value, obtained as above

The serverType and poweredBy values heve been preprocessed so as to keep only the framework name and the major and minor version number (e.g., Apache/2.2.22-12 becomes Apache/2.2).

Hidden fraudulent URLs dataset

Definitions

How we collected the data

File structure

download