Coreferent Clusters (dataset and a pre-trained model)

Published: 11 February 2021| Version 1 | DOI: 10.17632/mb77zhxynt.1
Artem Kramov


MySQL database (a single table `word`) that contains a marked set of Ukrainian texts with coreferent groups and additional information (Lemma, Cases, etc) for each token. The reference below implies a Docker repository that represents an HTTP web-service to detect coreferent pairs for a Ukrainian corpus.


Steps to reproduce

In order to use a dataset: 1. Download the MySQL file to your local machine. 2. Import the database using either graphical tools (e.g. phpMyAdmin) or raw commands (mysql ...). 3. Table 'word' contains all tokens. Group all tokens by the field 'DocumentID' into a set of documents. 4. In the case of the analysis at the level of sentences, split documents into sentences using the 'RawTagString': value './SENT_END' indicates the end of a sentence. 5. Group all words into mentions using the 'EntityID' attribute. 6. Group all mentions into coreferent clusters for further learning. In order to use a pre-trained model: 1. Install the docker tool on your local machine. 2. Pull the image: docker pull artemkramov/ukrainian-pack-coreference-coherence 3. Start a web-service: sudo docker run -p 5000:5000 artemkramov/ukrainian-pack-coreference-coherence:2.0 4. Send HTTP JSON queies {"text": "<text>"} to http://<local-address>:5000/api/get_coreferent_clusters


Kyyivs'kij nacional'nij universytet imeni Tarasa Shevchenka


Natural Language Processing, Deep Neural Network