WoLLaI: Word Level Language Identification of Malayalam-English Code-Mixed text

Published: 29 November 2023| Version 1 | DOI: 10.17632/pn6xjx4zkd.1


WoLLaI Malayalam-English Mixed-Text is a carefully curated an annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,502 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence number, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr.



Pondicherry University


Natural Language Processing, Machine Learning, Ambient Intelligence, Information Extraction, Deep Learning