Loading…

TrackSign-labeled web tracking dataset

Recent studies [8] show that more than 95% of the websites available on the Internet contain at least one of the so-called web tracking systems. These systems are specialized in identifying their users by means of a plethora of different methods. Some of them (e.g., cookies) are very well known by m...

Full description

Saved in:
Bibliographic Details
Published in:Computer networks (Amsterdam, Netherlands : 1999) Netherlands : 1999), 2023-05, Vol.226, p.109687, Article 109687
Main Authors: Castell-Uroz, Ismael, Barlet-Ros, Pere
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent studies [8] show that more than 95% of the websites available on the Internet contain at least one of the so-called web tracking systems. These systems are specialized in identifying their users by means of a plethora of different methods. Some of them (e.g., cookies) are very well known by most Internet users. However, the percentage of websites including more "obscure" and privacy-threatening systems, such as fingerprinting methods identifying a user's computer, is constantly increasing. Detecting those methods on today's Internet is very difficult, as almost any website modifies its content dynamically and minimizes its code in order to speed up loading times. This minimization and dynamicity render the website code unreadable by humans. Thus, the research community is constantly looking for new ways to discover unknown web tracking systems running under the hood. In this paper, we present a new dataset containing tracking information for more than 76 million URLs and 45 million online resources, extracted from 1.5 million popular websites. The tracking labeling process was done using a state-of-the-art discovery web tracking algorithm called TrackSign [8]. The dataset also contains information about online security and the relation between the domains, the loaded URLs, and the online resource behind each URL. This information can be useful for different kinds of experiments, such as locating privacy-threatening resources, identifying security threats, or determining characteristics of the URL network graph.
ISSN:1389-1286
1872-7069
DOI:10.1016/j.comnet.2023.109687