Synthetic dataset for cross-lingual text reuse detection evaluation

Name: Synthetic dataset for cross-lingual text reuse detection evaluation
Creator: Oleg Bakhteev
Published: 2022-02-16T12:19:01.850Z
Keywords: Plagiarism

Bakhteev, Oleg; Chekhovich, Yury; Grabovoy, Andrey; Gorbachev, Georgy; Gorlenko, Tatiana; Grashchenkov, Kirill; Ivakhnenko, Andrey; Kildyakov, Aleksandr; Khazov, Andrey; Komarnitsky, Vladislav; Nikitov, Artemiy; Ogaltsov, Aleksandr; Sakharova, Aleksandra

doi:10.17632/53j6hyn99g.1

Synthetic dataset for cross-lingual text reuse detection evaluation

Published: 16 February 2022| Version 1 | DOI: 10.17632/53j6hyn99g.1

Contributors:

Oleg Bakhteev, Yury Chekhovich, Andrey Grabovoy, Georgy Gorbachev, Tatiana Gorlenko, Kirill Grashchenkov, Andrey Ivakhnenko, Aleksandr Kildyakov, Andrey Khazov, Vladislav Komarnitsky, Artemiy Nikitov, Aleksandr Ogaltsov, Aleksandra Sakharova

Description

The evaluation dataset for the cross-lingual text reuse detection task. The dataset was prepared for the article "Cross-language plagiarism detection: a case study of European languages academic works". The dataset contains collections for reuse search (articles from Wikipedia) and documents with translated text passages for 3 language pairs. Each synthtic document is represented by text and markup in XML format. For the evaluation please use the PAN evaluation tool from https://pan.webis.de/clef11/pan11-web/external-plagiarism-detection.html.

Synthetic dataset for cross-lingual text reuse detection evaluation

Description

Files

Categories

Licence