SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text
Description
SeLLaI Mal-Eng is a thoroughly curated and annotated dataset designed for sentence-level language identification in Malayalam-English code-mixed text. The dataset comprises 22,400 sentences composed using English alphabets. The dataset file is organized into two columns: sentence and language. The language annotation is categorized into three distinct classes: Manglish, Code-Mixed, and English. The sentences that belong to the Malayalam language and are composed using English alphabets are annotated as Manglish. Manglish sentences may consist of either only Malayalam words written in English alphabets or a combination of Malayalam and English words. Sentences containing words formed by combining Malayalam and English words where Malayalam suffixes are added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Code-Mixed. Sentences belonging to the English language and easily recognizable by English speakers are annotated as English.