Kannada Abstractive Text Summarization

Published: 15 May 2026| Version 1 | DOI: 10.17632/pfx79p84cj.1
Contributors:
Dakshayani Ijeri, Pushpa Patil

Description

KannadaSum-10K: A 10,000-Sample Dataset for Kannada Abstractive Text Summarization is a Kannada-language natural language processing dataset developed for research on abstractive text summarization. The dataset contains 10,000 article–summary pairs, where each sample consists of a Kannada text article and its corresponding reference summary. It is designed to support the development, training, fine-tuning, and evaluation of machine learning and deep learning models for Kannada text summarization. The dataset is organized into two main fields: article and Reference Summary. The article field contains the source Kannada text, typically written in a news-style or informative prose format. The Reference Summary field contains a concise Kannada summary that captures the central idea of the article. This structure makes the dataset suitable for supervised abstractive summarization, where a model learns to generate meaningful summaries rather than simply extracting sentences from the original text. This dataset can be used for multiple research purposes, including Kannada abstractive summarization, low-resource language modeling, Indic NLP research, text generation, sequence-to-sequence learning, transformer-based model fine-tuning, and comparative evaluation of multilingual summarization models. It may be particularly useful for training models such as mT5, IndicBART, mBART, ByT5, MuRIL-based encoder-decoder systems, and other transformer architectures adapted for Indian languages. The primary objective of KannadaSum-10K is to contribute a reusable Kannada summarization resource to the NLP research community. By providing article and reference-summary pairs in Kannada, the dataset aims to support improved summarization systems for regional-language digital content, news articles, educational material, and information-access applications. The dataset may also help researchers study the challenges of Kannada text generation, including morphology, sentence structure, semantic compression, and content selection. Before using the dataset for benchmarking, users should perform appropriate preprocessing, quality checking, train–validation–test splitting, and duplicate removal if required. Proper citation of the dataset is requested when it is used in academic publications, experiments, or software systems.

Files

Steps to reproduce

The KannadaSum-10K dataset was developed for Kannada abstractive text summarization research and contains 10,000 Kannada article–summary pairs collected from publicly available news sources such as Prajavani and Kannada Prabha. The dataset was manually curated by gathering Kannada articles and creating corresponding abstractive summaries with help of Kannada scholars. Text preprocessing included removal of unwanted symbols, duplicate entries, HTML tags, and encoding inconsistencies, along with Unicode normalization and Kannada text formatting checks. The dataset consists of two fields: Article and Reference Summary. It was used to train and evaluate deep learning and transformer-based models including FFN Transformer Encoder–Decoder, LSTM Seq2Seq, CNN, and IndicBART models using Python libraries such as TensorFlow, PyTorch, Hugging Face Transformers, Pandas, NumPy, and Scikit-learn.

Categories

Artificial Intelligence, Natural Language Processing, Text Processing

Licence