KannadaLit4NLP: A Large-Scale Kannada Literary Corpus for Natural Language Processing

Name: KannadaLit4NLP: A Large-Scale Kannada Literary Corpus for Natural Language Processing
Creator: Basavanna C
Published: 2026-05-05T14:23:43.521Z
Keywords: Artificial Intelligence, Information Retrieval, Natural Language Processing, Text Comprehension, Text Processing, Indian Language, Generative Artificial Intelligence

C, Basavanna; Manjunath, S.; Guru, D. S.

doi:10.17632/nvjydxpxjr.1

KannadaLit4NLP: A Large-Scale Kannada Literary Corpus for Natural Language Processing

Published: 5 May 2026| Version 1 | DOI: 10.17632/nvjydxpxjr.1

Contributors:

,

Description

To the best of our knowledge, KannadaLit4NLP is the first large-scale, machine-readable corpus of classical Kannada literature, specifically curated for Natural Language Processing (NLP) research. The dataset comprises 24,746 literary verse records drawn from three canonical Kannada literary traditions, viz. Vachanas, Sarvajna’s Tripadis, and Mankutimmana Kagga—spanning nearly a millennium of Kannada literary production from the eleventh to the twentieth century. The corpus is enriched with 22,369 parallel scholarly interpretations collected from 56 authoritative printed commentaries and specialist digital repositories, linked to 9,597 verses. The remaining 15,149 verses (61.2%) do not contain published interpretations, thereby providing a naturally incomplete benchmark for Generative AI–based interpretation generation, semantic inference, and retrieval-augmented explanation tasks. The three constituent sub-corpora include: • 21,701 Vachanas composed by 248 Śaranas, sourced from Samagra Vachana Samputa (Vols. 1–15); • 2,099 Tripadis attributed to the sixteenth-century poet-philosopher Sarvajna; • 946 verses of Mankutimmana Kagga by D. V. Gundappa. In total, the dataset contains 2,414,716 Kannada word tokens distributed across 466,356 unique lexical types, making it one of the most substantial openly available structured corpora for classical Kannada language technology. For computational efficiency and scalable experimentation, each of the 24,746 records is stored as an independent JSONL file (row_N.jsonl) inside the full_dataset directory. Every JSON object contains the metadata fields id, volume number, verse number, author, type, and verse, followed by numbered interpretation–source pairs (interpretation1/source1, interpretation2/source2, … up to interpretation15/source15) wherever available. All files are encoded in the UTF-8 Unicode encoding for the Kannada script. A detailed README.md file is included with the dataset schema, field descriptions, corpus statistics, and Python loading examples. Kannada, despite being one of India’s classical languages, remains significantly under-resourced for NLP. KannadaLit4NLP addresses this gap by supporting a broad range of downstream applications, including semantic retrieval, semantic textual similarity, natural language inference, machine translation, generative interpretation modelling, and adaptation of low-resource Kannada language models. The corpus was compiled over nearly six years through systematic source identification, OCR-assisted digitisation, and rigorous manual verification against authoritative printed editions, yielding a high-quality benchmark resource for AI research and the digital preservation of Kannada literary heritage.

Files

Institutions

University of Mysore
Karnataka, Mysuru

KannadaLit4NLP: A Large-Scale Kannada Literary Corpus for Natural Language Processing

Description

Files

Institutions

Categories

Licence