MCodeScript

Published: 27 December 2024| Version 1 | DOI: 10.17632/zcgbnn8zbm.1
Contributor:
Madhuri Kumbhar

Description

MCodeScript, a suite of unsupervised and supervised Marathi-English code-mixed and script-mixed datasets with comments collected from various sources, such as social sites, community groups, and news websites. McodeScript-un is unsupervised gold dataset with 1.3L code-mixed and script-mixed comments. McodeScript-un-annot is transformed version of McodeScript-un by automated procedure. The MCodeScript-MeSent, a transformed MeSent supervised Romanized Marathi dataset into a script-mixed format through our automated process.

Files

Steps to reproduce

Set up the Environment: 1. Install necessary libraries and dependencies: Python 3.x Libraries: beautifulsoup4, requests, pandas, nltk, regex, tensorflow, torch, etc. 2. Data Collection: Source Selection: Choose the online platforms or data sources (e.g., social media, forums) where users are likely to post code-mixed content. Web Scraping/Collection: Use a scraping tool or API to collect posts or comments containing code-mixed Marathi-English content. 3. Preprocessing: Text Normalization: Clean the data by removing unnecessary characters (e.g., HTML tags, special symbols, or excessive punctuation). Normalize text (e.g., lowercasing, handling Unicode). Optionally, remove or convert emojis and other non-textual elements. Tokenization: Tokenize the text into words and identify instances of code-mixing and script-mixing. 4.Script Identification: Script Detection: Identify if the word is written in Devanagari (Marathi) or Latin script. 5. Check comment for code-mixing and script-mixing comments and store in dataset

Categories

Natural Language Processing, Language, Bidirectional Encoder Representations From Transformers

Licence