A dataset of labelled device Wi-Fi probe requests for MAC address de-randomization - 2021
Description
A Wi-Fi client device can perform a passive scan to detect wireless networks within its radio range, looking for Beacon Frames, i.e., packets issued by the Access Points (APs) to signal their presence. Alternatively, the client can speed up this process by actively searching for a network connection; to this, it transmits Probe Requests messages periodically, which are management frames of the IEEE standard 802.11. The process by which these messages are captured is called sniffing. Sniffing can be performed via a Wi-Fi interface set in monitor mode and tuned to the same channel (or an adjacent channel) where the transmission occurred. Management messages are not encrypted, so they can be used to implement device counting algorithms based on MAC addresses analysis. However, major operating system producers, in order to avoid tracking the device owners, developed functionalities for MAC address randomisation. Devices that change their physical address periodically and randomly, challenge counting algorithms that must then perform an additional address de-randomization, i.e., cluster the probe requests according to the source device by analysing appropriate message features. To the best of our knowledge, our dataset is the only one available with labelled (indication of the emitting source) Wi-Fi probe requests. To obtain the labels, the data has been collected either in an isolated environment (the anechoic chamber of our department) or in a "noisy" environment (a chamber without particular shielding, but with no other sources of probe requests within a radius of two meters). The first type of data is published after removing only packets originating from the Raspberry Pi embedded interface MAC address; the second type of data has been additionally filtered to simulate the anechoic chamber environment. Each capture file has a duration of 20 minutes and considers three non-overlapping channels (channels 1, 6, and 11) simultaneously. The dataset contains Probe Requests from 22 different devices, each observed separately in 6 different modes, including settings based on display status, Wi-Fi connection, and power saving. We collected 315 non-empty files in total. Captures that were empty after filtering were removed. The device used for the capture is a Raspberry Pi with three Wi-Fi dongle interfaces installed, each used to collect data from a channel. The main characteristic of the dataset is its subdivision by device, which enables a more accurate behavior analysis of individual devices in different modes. Moreover, it is possible to use the labelled data to train Machine Learning algorithms or to verify the correct functioning of algorithms that have as their objective the counting of devices through probe request analysis in the presence of random MAC addresses. Note: In version 2, all device directories have been moved inside the folder "Individual devices" and renamed. Moreover, we added the link to a new database published in 2024.
Files
Steps to reproduce
Data was acquired via a Raspberry Pi 3 (Model B+) running a Raspbian Operative System (OS) downloaded from the official repository. The accessories connected to the Raspberry via the USB ports are: three Wi-Fi dongles and a keyboard. A 3.5-inch touch display was connected to the Raspberry Pi pins. The selection of this screen is due to its portability, but any other screen and mouse can be used as an alternative. The Raspberry's built-in Wi-Fi interface does not support monitor mode, so its configuration has not been changed, and it can connect to other internet networks to synchronize the calendar and the clock of the OS. To reproduce such a dataset, it is required to select a completely interference-free environment (anechoic chamber) or, alternatively, a noisy environment must be properly prepared by removing all sources of Wi-Fi messages in the vicinity. To verify the significance of interference, we collected data in noisy environments at various distances from Probe Request sources, empirically assessing that at distances of two metres or more, the power detected in the packets of these sources was below -60 dBm, allowing us to remove them through our filtering algorithm with a -40 dBm threshold. We developed and published on GitHub two algorithms specifically designed for the generation of this dataset: a sniffing algorithm (https://github.com/luciapintor/WiFi-Sniffer) and a filtering one (https://github.com/luciapintor/SnifferFiltering). Our sniffing algorithm has been used to configure the monitor mode on all interfaces, to set each one to listen in a specific Wi-Fi channel, and to perform the data acquisition. After setting up the environment (e.g., removing all Wi-Fi devices except the Raspberry and the device to be analyzed), we started the capture script and waited for its completion. This script starts sniffing simultaneously in three different interfaces, each set to a different channel, and saves the output in pcap files. After that we ran the filtering algorithm in the capture files and we removed the packets emitted by the Raspberry (that are easily identifiable because they use its factory MAC address), packets emitted by Access Points (we have detected their MAC addresses because they transmit probe requests, beacon frames and probe responses) and performed an additional steps to simulate anechoic chamber capture conditions via power thresholds that exploit the burst structure of the probe requests.