4lang: Open Access Dataset for Cross-Lingual Plagiarism Detection

Published: 5 April 2023| Version 1 | DOI: 10.17632/vndpn2wsf9.1
Contributors:
German Gritsay, Karen Avetisyan, Andrey Grabovoy

Description

A dataset for cross-lingual plagiarism evaluation. 4collection.zip: a subset of Wikipedia articles on 4 languages (ru, hy, es, en). 4query.zip: wikipedia documents in each of the four languages with translated sentences with Google Translate API from collection. The archieve contains text documents and XML-markup for them. For the markup description and evaluation see http://pan.webis.de/clef13/pan13-web/plagiarism-detection.html

Files

Categories

Text Processing

Licence