IndicDialogue Dataset
Published: 11 June 2024| Version 2 | DOI: 10.17632/wcb4bxbyxx.2
Contributors:
Noor Mairukh Khan Arnob, , , Description
The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).
Files
Institutions
University of Asia Pacific
Categories
Natural Language Processing, Dialogue