KannadaLit4NLP: A Large-Scale Kannada Literary Corpus for Natural Language Processing
Description
To the best of our knowledge, KannadaLit4NLP is the first large-scale, machine-readable corpus of classical Kannada literature, specifically curated for Natural Language Processing (NLP) research. The dataset comprises 24,746 literary verse records drawn from three canonical Kannada literary traditions, viz. Vachanas, Sarvajna’s Tripadis, and Mankutimmana Kagga—spanning nearly a millennium of Kannada literary production from the eleventh to the twentieth century. The corpus is enriched with 22,369 parallel scholarly interpretations collected from 56 authoritative printed commentaries and specialist digital repositories, linked to 9,597 verses. The remaining 15,149 verses (61.2%) do not contain published interpretations, thereby providing a naturally incomplete benchmark for Generative AI–based interpretation generation, semantic inference, and retrieval-augmented explanation tasks. The three constituent sub-corpora include: • 21,701 Vachanas composed by 248 Śaranas, sourced from Samagra Vachana Samputa (Vols. 1–15); • 2,099 Tripadis attributed to the sixteenth-century poet-philosopher Sarvajna; • 946 verses of Mankutimmana Kagga by D. V. Gundappa. In total, the dataset contains 2,414,716 Kannada word tokens distributed across 466,356 unique lexical types, making it one of the most substantial openly available structured corpora for classical Kannada language technology. For computational efficiency and scalable experimentation, each of the 24,746 records is stored as an independent JSONL file (row_N.jsonl) inside the full_dataset directory. Every JSON object contains the metadata fields id, volume number, verse number, author, type, and verse, followed by numbered interpretation–source pairs (interpretation1/source1, interpretation2/source2, … up to interpretation15/source15) wherever available. All files are encoded in the UTF-8 Unicode encoding for the Kannada script. A detailed README.md file is included with the dataset schema, field descriptions, corpus statistics, and Python loading examples. Kannada, despite being one of India’s classical languages, remains significantly under-resourced for NLP. KannadaLit4NLP addresses this gap by supporting a broad range of downstream applications, including semantic retrieval, semantic textual similarity, natural language inference, machine translation, generative interpretation modelling, and adaptation of low-resource Kannada language models. The corpus was compiled over nearly six years through systematic source identification, OCR-assisted digitisation, and rigorous manual verification against authoritative printed editions, yielding a high-quality benchmark resource for AI research and the digital preservation of Kannada literary heritage.
Files
Institutions
- University of MysoreKarnataka, Mysuru