Exploring internal correlations in timing features of keystroke dynamics at word boundaries and their usage for authentication and identification
Dataset used in the article "Exploring internal correlations in timing features of keystroke dynamics at word boundaries and their usage for authentication and identification". Contains CSV files with the timing features (hold times and flight times) for the most frequent word of each length appearing in three free text datasets used in previous studies, by the author (LSIA) and two other unrelated groups (KM from and PROSODY, subdivided in GAY, GUN, and REVIEW). Two different languages are represented, Spanish in LSIA and English in KM and PROSODY. Two different languages are represented, Spanish in LSIA and English in KM and PROSODY. The original dataset KM was used to compare anomaly-detection algorithms for keystroke dynamics in the article "Comparing anomaly-detection algorithms forkeystroke dynamic" by Killourhy, K.S. and Maxion, R.A. The original dataset PROSODY was used to find cues of deceptive intent by analyzing variations in typing patterns in the article "Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essay" by Banerjee, R., Feng, S., Kang, J.S., and Choi, Y.
Steps to reproduce
To reduce some sources of noise for this exploratory study, short typing sessions were filtered out in order to keep only those exceeding 150 keystrokes. Any timing parameter, whether hold time or flight time, exceeding 1500 milliseconds was considered as a natural or artificial pause, and used to split the sessions. A second filtering stage was applied where values exceeding three times the moving average of the latest values, even if below 1500 msec., were considered as pauses and used to split the sessions before finding individual words. The resulting filtered sessions were split at word boundaries, using spaces and punctuation marks. Only alphabetic sequences were kept, discarding alphanumeric or combined sequences of keys like "abc123" or "#blue5ky". For each length, the most frequent word was found and each instance of it in every user session was extracted together with its timing parameters, hold time and flight time. The result was a set of tabular CSV files with a fixed number of columns containing, for each observation of the most frequent word of length $n$ in each dataset, the user and $2n-1$ timing parameters, $n$ for hold times and $n-1$ for flight times. Training sets for authentication tasks where generated using, for each most frequent word of its length in each dataset, the instances of the user with the most instances, flagged as legitimate, and a random sample (without replacement) of the same size containing instances of all the other users, flagged as impostors. In this way, the binary classification problem remained, by design, balanced. To avoid dealing with the biases introduced by imbalanced classes in identification tasks, the number of instances for each most frequent word of its length was cut down so that every user had the same number of instances: as much as the one with the lesser amount.