Turkish Authorship Attribution

Published: 04-10-2018| Version 1 | DOI: 10.17632/xcb8r9d554.1
Contributors:
Hayri Volkan Agun,
Ozgur Yilmazel

Description

Our text dataset consists of XML documents that contains author, genre, topic and text fields for blogs, newspaper articles, and tweets in Turkish language. Dataset is collected from web within years of 2015 and 2018. XML documents may contain invalid characters such as some html characters, or unicode however this can be eliminated by first replacing them by appropriate valid xml characters. For this purpose an XML parser source code in scala is provided for showing how xml parsing is done.

Files

Steps to reproduce

In order to reproduce the dataset one needs to parse the XML documents. Every XML document contains single or multiple documents (twitter) with all the necessary information per document. In order to prevent copy write issues it is needed to replace the author names with appropriate codes.