IndicDialogue Dataset

Published: 11 June 2024| Version 2 | DOI: 10.17632/wcb4bxbyxx.2
Contributors:
Noor Mairukh Khan Arnob,
,
,

Description

The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).

Files

Institutions

University of Asia Pacific

Categories

Natural Language Processing, Dialogue

Licence