A dataset of labelled device Wi-Fi probe requests for MAC address de-randomization - 2024

Published: 1 July 2024| Version 1 | DOI: 10.17632/5tvnwhsj2p.1


A Wi-Fi client device can perform an active scan to speed up the connection process by transmitting Probe Request messages periodically. These are management frames of the IEEE 802.11 standard. The process of capturing these messages is called sniffing and can be performed using a Wi-Fi interface set in monitor mode and tuned to the same channel (or an adjacent channel) where the transmission occurred. Since these messages are not encrypted, they can be used to implement device counting algorithms based on MAC address analysis. However, to prevent tracking of device owners, major operating system producers have developed MAC address randomization functionalities. Devices that periodically and randomly change their physical address pose a challenge to counting algorithms, which must then perform additional steps to cluster probe requests according to the source device through analysis of appropriate message features. Our dataset is divided into two parts: - Anechoic Chamber Data Collection: Data was collected from 22 devices simultaneously in a controlled environment (anechoic chamber) to ensure the absence of external interference. All devices kept the Wi-Fi interface active and the display switched off. Data was collected only on channel 6 for 30 minutes. This data is stored in the "Anechoic chamber" folder and the "Anechoic chamber - info.xlsx" file contains device information. - Individual Device Data Collection: Data was collected from 18 individual devices on three channels simultaneously and in six different modes, including settings based on display status, Wi-Fi connection, and power saving. Collecting data from individual devices allows for labelling them and associating them with their emitting source. The data was collected in "noisy" environments (a chamber without particular shielding but devoid of other probe request sources within a two-meter radius). Data is filtered to simulate the anechoic chamber environment. Capture files last 30 minutes and cover three non-overlapping channels (1, 6, and 11) simultaneously. This data is stored in the "Individual devices" folder and the "Individual devices - info.xlsx" file contains device information. We collected a total of 215 non-empty files, removing captures that were empty after filtering. The capture device used is a Raspberry Pi with three Wi-Fi dongle interfaces, each assigned to collect data from a specific channel. The main characteristic of this dataset is the subdivision by device, allowing for a more accurate analysis of individual device behaviour in different modes. Additionally, the labelled data can be used to train Machine Learning algorithms or to verify the correct functioning of algorithms aimed at counting devices through probe request analysis in the presence of random MAC addresses.


Steps to reproduce

Data was acquired via a Raspberry Pi 3 (Model B+) running a Raspbian Operative System (OS) downloaded from the official repository. The Raspberry Pi was connected to three Wi-Fi dongles and a keyboard via its USB ports and to a monitor via its HDMI interface. Since the built-in Wi-Fi interface of the Raspberry Pi does not support monitor mode, its configuration was left unchanged. To replicate this dataset, it is essential to either select an interference-free environment (anechoic chamber) or properly prepare a noisy environment by removing all sources of Wi-Fi messages nearby, as described in the procedure to replicate our previous dataset: https://data.mendeley.com/datasets/j64btzdsdy/1 We developed and published on github two algorithms specifically designed for the generation of this dataset: a sniffing algorithm ( https://github.com/luciapintor/WiFi-Sniffer ) and a filtering one ( https://github.com/luciapintor/SnifferFiltering ). The sniffing algorithm was used to configure monitor mode on all interfaces, assign each to a specific Wi-Fi channel, and perform data acquisition. After setting up the environment (e.g., removing all Wi-Fi devices except the Raspberry Pi and the device to be analysed), we initiated the capture script and waited for its completion. This script starts sniffing simultaneously on three different interfaces, each set to a different channel, and saves the output in pcap files. Subsequently, we ran the filtering algorithm on the capture files to remove packets emitted by the Raspberry Pi (easily identifiable by its factory MAC address), packets from Access Points (identified by their beacon frames), and performed additional steps to simulate anechoic chamber conditions using power thresholds that exploit the burst structure of probe requests.


Universita degli Studi di Cagliari


Telecommunication Engineering, Internet, Machine Learning, Wireless Network, Media Access Control, Crowd Analysis


Ministero dello Sviluppo Economico

Cagliari Digital Lab - G27F22000040008

Ministero dell'Università e della Ricerca

Sustainable Mobility Center - 00000023