Synthetic dataset for cross-lingual text reuse detection evaluation
Published: 16 February 2022| Version 1 | DOI: 10.17632/53j6hyn99g.1
Contributors:
Oleg Bakhteev, Yury Chekhovich, Andrey Grabovoy, Georgy Gorbachev, Tatiana Gorlenko, Kirill Grashchenkov, Andrey Ivakhnenko, Aleksandr Kildyakov, Andrey Khazov, Vladislav Komarnitsky, Artemiy Nikitov, Aleksandr Ogaltsov, Aleksandra SakharovaDescription
The evaluation dataset for the cross-lingual text reuse detection task. The dataset was prepared for the article "Cross-language plagiarism detection: a case study of European languages academic works". The dataset contains collections for reuse search (articles from Wikipedia) and documents with translated text passages for 3 language pairs. Each synthtic document is represented by text and markup in XML format. For the evaluation please use the PAN evaluation tool from https://pan.webis.de/clef11/pan11-web/external-plagiarism-detection.html.
Files
Categories
Plagiarism