ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Published: 29 January 2024| Version 4 | DOI: 10.17632/6k97jty9xg.4
Contributor:
Rania Al-Sabbagh

Description

ArzEn-MultiGenre is a manually translated and aligned parallel dataset of Egyptian Arabic (Arz) and English (En), encompassing 25,557 sentence pairs across three genres: novels (5,226 sentence pairs), subtitles (17,265 sentence pairs), and songs (3,066 sentence pairs). It serves as a benchmark for machine translation models, aids in fine-tuning large language models, and facilitates research in translation studies, cross-linguistic analysis, and lexical semantics. Three distinct features set ArzEn-MultiGenre apart from existing Arz-En parallel datasets: 1. It includes three genres previously unrepresented in Arz-En parallel datasets. 2. It is manually translated and aligned, differentiating it from crowdsourced Arz-En parallel datasets. 3. It offers a substantial volume of data compared to some existing Arz-En parallel datasets. In terms of word tokens, the dataset comprises 154,658 Arabic word tokens and 210,068 English word tokens. The vocabulary encompasses 29,179 Arabic word types and 18,131 English word types, with a type-token ratio of 19% for Arabic and 9% for English. Segment lengths vary across genres, with novels featuring 54 one-word segments, 1,269 segments with 2-5 words, and 3,903 segments with 6 or more words. Similarly, subtitles display varied segment lengths, with 2,689 one-word segments, 9,252 segments with 2-5 words, and 5,324 segments with 6 or more words. Songs, however, exhibit fewer segments overall.

Files

Institutions

University of Sharjah

Categories

Natural Language Processing, Machine Translation, Parallel Database, Corpus Linguistics, Arabic Language, Egypt, Corpus-Based Translation Studies, Language Modeling

Licence