SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text

Name: SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text
Creator: AFSAL CP
Published: 2024-01-15T16:33:49.018Z
Keywords: Artificial Intelligence, Natural Language Processing, Text Extraction, Information Classification, Text Processing, Deep Learning, Transformer-Based Deep Learning

CP, AFSAL; KS, Kuppusamy

doi:10.17632/5p4zbpy8wz.1

SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text

Published: 15 January 2024| Version 1 | DOI: 10.17632/5p4zbpy8wz.1

Contributors:

AFSAL CP,

Description

SeLLaI Mal-Eng is a thoroughly curated and annotated dataset designed for sentence-level language identification in Malayalam-English code-mixed text. The dataset comprises 22,400 sentences composed using English alphabets. The dataset file is organized into two columns: sentence and language. The language annotation is categorized into three distinct classes: Manglish, Code-Mixed, and English. The sentences that belong to the Malayalam language and are composed using English alphabets are annotated as Manglish. Manglish sentences may consist of either only Malayalam words written in English alphabets or a combination of Malayalam and English words. Sentences containing words formed by combining Malayalam and English words where Malayalam suffixes are added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Code-Mixed. Sentences belonging to the English language and easily recognizable by English speakers are annotated as English.

Files

Institutions

Pondicherry University

SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text

Description

Files

Institutions

Categories

Licence