A dataset of labelled device Wi-Fi probe requests for MAC address de-randomization - 2021
A Wi-Fi client device can perform a passive scan to detect wireless networks within its radio range, looking for Beacon Frames, i.e. packets issued by the Access Points (APs) to signal their presence. Alternatively, the client can speed up this process by actively searching for a network connection; to this, it transmits Probe Requests messages periodically, which are management frames of the IEEE standard 802.11. The process by which these messages are captured is called sniffing and can be performed via a Wi-Fi interface set in monitor mode and tuned to the same channel (or an adjacent channel) where the transmission happened. Both these kinds of messages are not encrypted, for this reason they can be used to implement device counting algorithms based on MAC addresses analysis; however, major operating systems producers, in order to avoid the tracking of the device owners, developed functionalities for MAC address randomisation. Devices that change their physical address periodically and randomly, challenge counting algorithms that must then perform an additional address de-randomization, i.e., cluster the probes requests according to the source device by analysing appropriate message features. However, the solution to this is not straightforward and further research is needed to achieve successful de-randomized traces. To the best of our knowledge, our dataset is the only one available with labelled (indication of the emitting source) Wi-Fi probe requests. To obtain the labels, the data has been collected either in an isolated environment (the anechoic chamber of our department) or in a "noisy" environment (a chamber without particular shielding, but in any case without other sources of probe requests in the radius of two meters). The first type of data is published after the removal of packets with Raspberry embedded interface MAC address; the second type of data has been filtered in order to simulate the anechoic chamber environment. Each capture file has a duration of 20 minutes and considers three non-overlapping channels (1, 6 and 11) contemporaneously. The present dataset contains Probe Requests of 22 different devices each observed separately in 6 different modes, including settings based on display status, Wi-Fi connection and power saving. We collected 315 non-empty files in total, captures that were completely empty after filtering were removed. The device used for the capture is a Raspberry Pi with three Wi-Fi dongle interfaces installed, each used to collect data from a channel. The main characteristic of the dataset is the subdivision by device, which allows for a more accurate behavior analysis of individual devices in different modes. Moreover, it is possible to use the labelled data to train Machine Learning algorithms or to verify the correct functioning of algorithms that have as their objective the counting of devices through probe request analysis in the presence of random MAC addresses.
Steps to reproduce
Data was acquired via a Raspberry Pi 3 (Model B+) running a Raspbian Operative System (OS) downloaded from the official repository. The accessories connected to the Raspberry via the USB ports are: three Wi-Fi dongles and a keyboard. A 3.5-inch touch display was connected to the Raspberry pins. The selection of this screen is due to its portability, but any other screen and mouse can be used as an alternative. The Raspberry's built-in Wi-Fi interface does not support monitor mode, so its configuration has not been changed and it can connect to other internet networks to synchronise the calendar and the clock of the OS. To reproduce such a dataset, it is required to select a completely interference-free environment (anechoic chamber) or, alternatively, a noisy environment must be properly prepared by removing all sources of Wi-Fi messages in the vicinity. To verify the significance of interference, we collected data in noisy environments at various distances from Probe Request sources, empirically assessing that at distances of two metres or more, the power detected in the packets of these sources was below -60 dBm, allowing us to remove them through our filtering algorithm with a -40 dBm threshold. We developed and published on github two algorithms specifically designed for the generation of this dataset: a sniffing algorithm (https://github.com/luciapintor/WiFi-Sniffer) and a filtering one (https://github.com/luciapintor/SnifferFiltering). Our sniffing algorithm has been used to configure the monitor mode in all interfaces, to set each of them to listen in a specific Wi-Fi channel and to perform the data acquisition. After setting up the environment (e.g., removing all Wi-Fi devices except the Raspberry and the device to be analyzed), we started the capture script and waited for its completion. This script starts the sniffing contemporaneously in three different interfaces, each set to a different channel and saves the output in pcap files. After that we ran the filtering algorithm in the capture files and we removed the packets emitted by the Raspberry (that are easily identifiable because they use its factory MAC address), packets emitted by Access Points (we have detected their MAC addresses because they transmit probe requests, beacon frames and probe responses) and performed an additional steps to simulate anechoic chamber capture conditions via power thresholds that exploit the burst structure of the probe requests.