Synthetic dataset for cross-lingual text reuse detection evaluation

Published: 16 February 2022| Version 1 | DOI: 10.17632/53j6hyn99g.1
Contributors:
Oleg Bakhteev, Yury Chekhovich, Andrey Grabovoy, Georgy Gorbachev, Tatiana Gorlenko, Kirill Grashchenkov, Andrey Ivakhnenko, Aleksandr Kildyakov, Andrey Khazov, Vladislav Komarnitsky, Artemiy Nikitov, Aleksandr Ogaltsov, Aleksandra Sakharova

Description

The evaluation dataset for the cross-lingual text reuse detection task. The dataset was prepared for the article "Cross-language plagiarism detection: a case study of European languages academic works". The dataset contains collections for reuse search (articles from Wikipedia) and documents with translated text passages for 3 language pairs. Each synthtic document is represented by text and markup in XML format. For the evaluation please use the PAN evaluation tool from https://pan.webis.de/clef11/pan11-web/external-plagiarism-detection.html.

Files

Categories

Plagiarism

Licence