Dataset of Human-written and Synthesized Samples of Free-Text Keystroke Dynamics to Evaluate Liveness Detection Methods
Description
This dataset comprises human-written samples of free-text keystroke dynamics, in the form of sentences in natural language, together with their same-text counterparts synthesized using a variety of methods and degrees of partial knowledge of the legitimate user’s behavior. The human-written samples originate in three publicly available datasets that have been previously used in several keystroke dynamics studies, while the corresponding synthesized samples that share the same keystroke sequences have been forged using a variety of methods that were presented in Nahuel González, Enrique P. Calot, Jorge S. Ierache, Waldo Hasperué, Towards liveness detection in keystroke dynamics: Revealing synthetic forgeries, Systems and Soft Computing, Volume 4, 2022, 200037, ISSN 2772-9419, https://doi.org/10.1016/j.sasc.2022.200037. The source datasets for human-written samples are those of Killourhy and Maxion [2], González and Calot [3], and Banerjee et al. [4]. The first one was used to determine whether composition and transcription tasks produce equivalent results when verifying the identity of the user. The second one was used to evaluate a free-text keystroke dynamics authentication method. The third one was used to find clues of deceptive intent by analyzing variations in typing patterns. For each human-written sample of each source dataset, synthetic samples sharing the same keystroke sequence were created with five different methods and included in the dataset here presented. The objective was to evaluate a liveness detection method that could tell apart the legitimate human from a synthetic forgery of his/her behaviour [1]. For each method, five user profiles were used to create the forgeries, representing the amount of partial knowledge of the legitimate users’ keystroke dynamics an attacker might have. These were a between-subject profile, including only samples from users other than the target were available to the attacker, and four within-subject profiles ranging in size from only 100 keystrokes to all the past samples of the legitimate user. NOTE FOR VERSION 2: The dataset is the same as version 1, but the compression and archiving format has been changed on request of the editors of Data in Brief. The original archive for the dataset was RAR, but it was reuploaded as a ZIP file because the former is not an open access format.
Files
Steps to reproduce
An effort has been made to normalize key codes over the three datasets to fit the Microsoft Windows Virtual Key Codes technical specification. However, the reader must be warned that inaccuracies might remain. Most verification methods are oblivious to the key code and should present no difficulties, but some might treat key groups, like alphabetic, numeric, or special, in a way that impacts the performance of the method. Caution should be exercised in this latter case. When the source datasets recorded a list of keystroke events instead of keystroke timings, the former have been converted to hold times (interval between key press and key release event) and flight times (interval between successive key press events), rounded to the nearest millisecond, to fit the format described in Section 1. Once again, the capture tool at hand used during the collection of the source datasets might have influenced the precision of the timing values. For example, most Windows keyloggers return key event timing values with a granularity of 16 msec. During the collection of the source datasets, pauses while writing were recorded together with the natural flow of typing. All hold times and flight times exceeding 1500 msec. have been removed and marked with the negative value -1 in the CSV files that are included in the dataset. Typing sessions where split in sentences using punctuation, and sentences shorter than 20 characters or with less than 50% keystrokes corresponding to alphanumeric characters where not included in this dataset. The objective of this filtering action was to remove keystroke sequences mostly comprising special keys, which do not represent the natural typing rhythm of the user and could introduce a negative performance bias when this dataset is used to evaluate keystroke dynamics verification methods. This dataset contains human-written samples, compiled from three datasets as explained when dealing with source dataset selection, together with their synthetic counterparts that share the same keystroke sequences and are meant to impersonate the legitimate user. These were created using a variety of methods and profiles, which represent how much knowledge the attacker had of the legitimate users’ behavior. Detailing the synthesis process and the various methods is outside the scope of this article; the reader is referred to Nahuel González, Enrique P. Calot, Jorge S. Ierache, Waldo Hasperué, Towards liveness detection in keystroke dynamics: Revealing synthetic forgeries, Systems and Soft Computing, Volume 4, 2022, 200037, ISSN 2772-9419, https://doi.org/10.1016/j.sasc.2022.200037 for this purpose.