Hindi-Kangri Parallel Corpus

Name: Hindi-Kangri Parallel Corpus
Creator: vandana Chaudhary
Published: 2026-06-01T15:56:11.311Z
Keywords: Natural Language Processing, Machine Translation, Parallel Database

Chaudhary, vandana; Yadav, arun

doi:10.17632/3chz6dg9d9.1

Hindi-Kangri Parallel Corpus

Published: 1 June 2026| Version 1 | DOI: 10.17632/3chz6dg9d9.1

Contributors:

vandana Chaudhary,

Description

Hindi-Kangri Parallel Corpus is designed to bridge the linguistic gap between Kangri, a low-resource Pahari language of Himachal Pradesh, and Hindi. The dataset comprises 25,000 Kangri sentences and 25,000 Hindi sentences, spanning rich cultural and regional domains including Himachali agriculture, native plants and flowers, home remedies, local food, fairs, festivals, temples, and everyday conversations about the natural beauty and traditions of Himachal. It aims to support NLP tasks such as machine translation, language modeling, and cross-lingual transfer for under-resourced Himalayan languages.

Files

Steps to reproduce

The dataset is compiled through two complementary sources: culturally grounded knowledge — including home remedies, agricultural practices, local traditions, and folk conversations — is collected directly from elderly native speakers, ensuring authentic grassroots linguistic representation, while the remaining data is sourced from online Hindi resources and subsequently translated into Kangri by trained native annotators.

Hindi-Kangri Parallel Corpus

Description

Files

Steps to reproduce

Categories

Licence