Enron Authorship Verification Corpus

Published: 10-09-2018| Version 2 | DOI: 10.17632/n77w7mygwg.2
Oren Halvani


=========================================== Type of corpus: =========================================== The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset" [1], which has been used across different research domains beyong Authorship Verification (AV). The intention behind this corpus is to provide other researchers in the field of AV the opportunity to compare their results to each other. =========================================== Language: =========================================== All texts are written in English. =========================================== Format of the corpus: =========================================== The corpus was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" [2]. It consists of 80 AV cases, evenly distributed regarding true (Y) and false (N) authorships, as well as the ground truth (Y/N) regarding all AV cases. Each AV case comprise up to 5 documents (plain text files), where 2-4 documents stem from a known author, while the 5th document has an unknown authorship and, thus, is the subject of verification. Each document has been written by a single author X and is mostly aggregated from several mails of X, in order to provide a sufficient length that captures X's writing style. =========================================== Preprocessing steps: =========================================== All texts in the corpus were preprocessed by hand, which resulted in an overall processing time of more than 30 hours. The preprocessing includes de-duplication, normalization of utf-8 symbols as well as the removal of URLs, e-mail headers, signatures and other metadata. Beyond these, the texts themselves have been undergone a variety of cleaning procedures including the removal of greetings/closing formulas, (telephone) numbers, named entities (names of people, companies, locations, etc.), quotes and repetitions of identical characters/symbols and words. As a last preprocessing step, multiple successive blanks, newlines and tabs were substituted with a single blank. =========================================== Basic statistics: =========================================== The length of each preprocessed text ranges from 2,200-5,000 characters. More precisely, the average length of each known document is 3976 characters, while the average length of each unknown document is 3899 characters. =========================================== Paper + Citation: =========================================== https://link.springer.com/chapter/10.1007/978-3-319-98932-7_4 =========================================== References: =========================================== [1] https://www.cs.cmu.edu/~enron [2] http://pan.webis.de