Multilingual Corpus of Poems with Human Reference Translations for Literary Machine Translation Research
Description
This dataset contains a multilingual corpus composed of 300 poems and 300 corresponding human reference translations per language pair, totaling 1,800 original poems and 1,800 translated poems. The corpus covers six language pairs: French–Portuguese, French–English, Portuguese–French, Portuguese–English, English–French, and English–Portuguese. The poems were collected from multiple publicly available online sources, digital libraries, and literary websites. All texts are aligned at the poem level to ensure consistency between original and translated versions. Each record in the dataset includes: - The original poem (original_poem) - The human reference translation (translated_poem) - The source language code (src_lang) - The target language code (tgt_lang) This dataset was originally intended to support research in automatic literary translation, serving as ground truth for automatic evaluation metrics such as BLEU, METEOR, BERTScore, and related measures. It enables systematic comparison between machine-generated translations and human references.
Files
Steps to reproduce
Data extraction was performed using a combination of manual curation and scripted workflows for text scraping and cleaning. All texts were stored in UTF-8 encoding and organized into a structured tabular format. Data processing and organization were conducted using scripting workflows in Python for file handling and data structuring. No aggressive text normalization was applied in order to preserve poetic structure, punctuation, and line breaks. Download the dataset files in CSV format from this repository. Load the data using a scripting environment. Use the original_poem field as the source text and the translated_poem field as the reference text. Apply tokenization appropriate to each language while preserving original formatting, punctuation, casing, line breaks, and poetic structure. Avoid aggressive text normalization and preprocessing that would alter stylistic or rhythmic features. Machine Learning Application: Generate system translations using machine translation or large language models. Use the corpus for fine-tuning on machine translation systems. Compute automatic evaluation metrics by comparing system outputs against the human reference translations.
Institutions
- Universidade Federal de Uberlandia