Hindi-Kangri Parallel Corpus
Description
Hindi-Kangri Parallel Corpus is designed to bridge the linguistic gap between Kangri, a low-resource Pahari language of Himachal Pradesh, and Hindi. The dataset comprises 25,000 Kangri sentences and 25,000 Hindi sentences, spanning rich cultural and regional domains including Himachali agriculture, native plants and flowers, home remedies, local food, fairs, festivals, temples, and everyday conversations about the natural beauty and traditions of Himachal. It aims to support NLP tasks such as machine translation, language modeling, and cross-lingual transfer for under-resourced Himalayan languages.
Files
Steps to reproduce
The dataset is compiled through two complementary sources: culturally grounded knowledge — including home remedies, agricultural practices, local traditions, and folk conversations — is collected directly from elderly native speakers, ensuring authentic grassroots linguistic representation, while the remaining data is sourced from online Hindi resources and subsequently translated into Kangri by trained native annotators.