PEACH: A Sentence-Aligned Parallel English-Arabic Corpus for Healthcare
Description
This paper introduces PEACH, a sentence-aligned parallel English–Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totalling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, peach is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials’ readability and lay-friendliness, and as an educational resource in translation studies. peach is publicly accessible. Full corpus information is available through. Al-Sabbagh, R. (2024). PEACH: A Sentence-Aligned Parallel English-Arabic Corpus for Healthcare. Corpora, 19(3), 395-410. https://doi.org/10.3366/cor.2024.0320
Files
Institutions
Categories
Funding
University of Sharjah
This research was supported by Seed Research Grant No. 2203020129 from the University of Sharjah, United Arab Emirates.