Holy Quran Kurdish Sorani Translation Dataset (HQKSTD)

Published: 18 February 2025| Version 1 | DOI: 10.17632/byyjd7kmvd.1
Contributors:
Shakhawan Hares Wady, Mhamad Bamoki, Soran Badawi

Description

The Kurdish Quranic Corpus is a comprehensive parallel dataset containing translations of the Holy Quran in Kurdish, alongside the original Arabic text and English translations. This valuable linguistic resource has been meticulously compiled through the collaborative efforts of theological experts in Quranic studies and Kurdish language scholars, ensuring both religious accuracy and linguistic authenticity. The corpus is structured across two primary Excel files: a raw text file containing the original translations and a cleaned version optimized for computational processing. Each entry in the dataset is richly annotated with essential metadata, including the name of the Surah in both Arabic and Kurdish, the Surah number, its classification as either Makki or Madani, and the verse (Ayah) numbers. The corpus encompasses all 114 Surahs of the Holy Quran, comprising 6,236 verses, with 86 Makki and 28 Madani Surahs. This systematic organization facilitates easy access and reference for researchers and practitioners working with the text. The dataset has been specifically designed to support a wide range of Natural Language Processing (NLP) applications. It serves as an invaluable resource for tasks such as machine translation, particularly for developing systems capable of translating between Arabic and Kurdish religious texts. Additionally, the corpus can be utilized for sentiment analysis, text classification, and the development of Kurdish language models pre-trained on religious texts. The inclusion of English translations further enhances its utility for multilingual NLP applications and comparative linguistic studies. The technical implementation of the corpus pays careful attention to the specific requirements of processing Arabic and Kurdish texts. All content is encoded in UTF-8 to ensure proper handling of both languages' character sets, and the data structure accommodates right-to-left (RTL) text directionality. The cleaned version of the dataset offers normalized text with standardized character encodings and consistent formatting, making it immediately suitable for computational processing. This thoughtful preparation makes the corpus an ideal resource for researchers working on Kurdish language technology and religious text analysis. This corpus represents a significant contribution to both computational linguistics and religious studies, bridging the gap between traditional Islamic texts and modern NLP applications. It opens new avenues for research in Kurdish language processing and provides a foundation for developing specialized language models for religious text analysis. The careful attention to both theological accuracy and linguistic precision makes this resource particularly valuable for researchers working at the intersection of religious studies and computational linguistics.

Files

Categories

Natural Language Processing, Machine Learning

Licence