Enron Authorship Verification Corpus
Description
The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset", which was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" (http://pan.webis.de). The corpus consists of 80 authorship verification cases, evenly distributed regarding true/false authorships. Each authorship verification case comprise exactly 5 documents (plain text files). Here, 4 documents represent samples from the known (true) author, while the remaining 1 document represents the text of the unknown author (the subject of verification). The corpus is ballanced, not only in terms of the same number of known documents per case, but also regarding the lenth of the texts, which is near-equal (3-4 kilobyte per text). It can be assumed that each document is aggregated from (short) mails of the same author, in order to have a sufficient length that captures the authors writing style. All texts in the corpus have undergone the same preprocessing-procedure: De-duplication, removing of URL's, newlines/tabs, normalization of utf-8 symbols and substitution of multiple successive blanks with a single blank. All e-mail headers and other metadata (including signatures) have been removed from each document such that it contains only pure natural language text fron a single author. The intention behind this corpus is to provide other researchers in the field of authorship verification the opportunity to compare their results to each other.