From_Shloka_to_Shobda_Corpus

Published: 29 May 2026| Version 1 | DOI: 10.17632/w6nsyggr64.1
Contributor:
KOUSHIK BISWAS

Description

The Shloka to Shobda dataset is organized as a collection of source-specific Sanskrit and Bengali text files, where each language is stored separately in plain text format. The dataset includes content collected from religious scriptures, classical literature, educational materials, tutorials, and translated documents. Each Sanskrit file has a corresponding Bengali translation file with sentence-level alignment maintained line by line. In other words, a sentence appearing at a particular line number in the Sanskrit text directly corresponds to the translated Bengali sentence at the same line number in the paired file. This structure makes the dataset suitable for neural machine translation, bilingual corpus construction, tokenizer training, and other low-resource NLP research tasks involving Sanskrit and Bengali.

Files

Steps to reproduce

The dataset was created by collecting Sanskrit and Bengali parallel texts from religious scriptures, classical literature, educational materials, tutorials, and translated documents. Printed Sanskrit books were digitized using OCR, followed by manual proofreading and normalization to reduce recognition errors. Sentence-level alignment was then performed to construct parallel sentence pairs. Each Sanskrit text file and Bengali text file were organized with line-by-line alignment, where corresponding line numbers represent translated sentence pairs. The dataset can be reproduced by following the same data collection, OCR correction, preprocessing, and alignment procedures described in the associated publication.

Institutions

Categories

Natural Language Processing, Machine Translation

Licence