Extended data set for training the Stanford coreference component

Published: 27 June 2022| Version 1 | DOI: 10.17632/y8vnr5n7mk.1
Julius Ruseckas


The extended data set for training the Stanford coreference component is merged from the following open source data sets: CoNLL-2012 https://cemantix.org/conll/2012/data.html GUM https://github.com/amir-zeldes/gum/tree/master/dep WikiCoref http://rali.iro.umontreal.ca/rali/?q=en/wikicoref Phrase Detectives Corpus 2.1.4 https://github.com/dali-ambiguity/Phrase-Detectives-Corpus-2.1.4 Emailcoref https://github.com/paragdakle/emailcoref NP4E http://clg.wlv.ac.uk/projects/NP4E/ The layout of the archive is as follows: The top level of the archive is divided into the directories development, test and train, which contains the training, development and test sets for training the coreference component. Each of these directories is divided into data (containing the English branch of CoNLL), detectCorp (for Phrase Detectives Corpus 2.1.4), email (for Emailcoref), gum, np4e and wikipedia (for WikiCoref). The whole training process is documented on https://github.com/clarkkev/deep-coref During the process of merge these datasets were reviewed, analysed and prepared for this research in accordance with GDPR (and in accordance with Lithuania Law and Germany Law related to the GDPR and Ethics requirements of EU).



Law Enforcement, Text Mining, Deep Learning