1. INTRODUCTION

	TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization
	
	Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

	By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

	We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

	The published collections are licensed under a Creative Commons Attribution 4.0 International licence.

2. DATA FORMAT

	We publish 3 versions for both TWNERTC and EWNERTC with (a) no noise reduction, (b) domain dependent noise reduction, (c) domain independent noise reduction post-processing.
	Data format for all versions of both languages is same and each line has a domain (category), annotation and sentence information which are separated by TAB character. The order of these information is as follows:

	Domain\tAnnotation\tSentence
	
3. DATA DETAILS

	Note that since several domains has a small number of sentences, we decided to merge them with a more comprehensive domain. For instance, sentences that are categorized as "Ice Hockey" are changed into "Sports" domain. After this merging process, the total number of domains are reduced from 77 to 49. We do not apply any merging process datasets defined as "Fine-Grained". For "Coarse-Grained" datasets, fine-grained entity types are transformed into "organization", "person", "location" and "misc". This process and its related statistics for Turkish dataset can be found in the white paper.   

	Each sentence has only one domain (category) and one annotation in a given dataset versions. However, domain and annotation may change for the same sentence between different versions of the dataset. 
	
	An annotation has same number of tokens as its related sentence (in terms of words and punctuations). An example annotation is as follows:
	
	Sentence (in Turkish)    : Callisto , Jüpiter'in doğal uydularından biridir .
	Annotation (Fine-grained): B-moon_name O B-planet_name O B-celestial_object_category O O
	
	As it can be seen from the above example, raw sentence is tokenized so that each word and punctuation can have an entity type. Tokens are separateed by white space. We apply IOB (Intermediate, Other, Begin) annotation style. Note that all entity types in an annotation has to start with "B" tag. If an entity contains more than one tokens, "I" tag has to be used after "B" tag. For instance, Real Madrid's corresponding annotation must be "B-football_team_name I-football_team_name". 
	
4. PUBLISHED PAPER AND CITATION

	There is a published white paper for the Turkish version of this collection:
	
	http://arxiv.org/abs/1702.02363
	
	The construction process of the English versions are same as the Turkish versions. However, sentence detection tools for languages varies. Hence, if one wants to cite the white paper, above links can be useful. If one wants to cite only datasets, use the following DOI:
	
	http://dx.doi.org/10.17632/cdcztymf4k.1