Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research

Published: 11 June 2024| Version 2 | DOI: 10.17632/vdtcp2yt9n.2
Puneet Arora


This dataset is a unique collection of Hinglish (a mix of Hindi and English) sentences, consisting of both synthetically generated text using various Large Language Models (LLMs) such as ChatGPT, Gemini AI, Claude, Groq, and Deep Seek, as well as manually written sentences. The dataset encompasses a diverse range of text sources, including meeting minutes, debates, articles, short essays, emails, letters, tweets, communications, and quotes, all composed in Hinglish. The primary objective of this dataset is to support and facilitate research in the field of Natural Language Processing (NLP), particularly in the context of code-mixed languages like Hinglish. By providing a substantial corpus of Hinglish text from various domains and sources, this dataset aims to enable researchers to develop and test novel NLP techniques, models, and applications tailored to handle the unique challenges posed by code-mixed languages. The synthetic portion of the dataset, generated using state-of-the-art LLMs, offers a large volume of diverse Hinglish text that can be used for training and fine-tuning NLP models. The manually written sentences, on the other hand, provide a valuable benchmark for evaluating the performance of these models on human-generated Hinglish text.


Steps to reproduce

This is how the dataset was created 0. Brainstorming Ideas: Sentence Types: * Declarative: These sentences state facts or opinions (e.g., "Chai bahut achha hai" - The tea is very good). * Interrogative: These sentences ask questions (e.g., "Aap kab aa rahe hain?" - When are you coming?). * Imperative: These sentences give commands (e.g., "Please book a cab jaldi se" - Book a cab quickly, please). * Exclamatory: These sentences express strong emotions (e.g., "Wow! Kya baat hai!" - Wow! That's amazing!). Hinglish Characteristics: * Code-mixing: English words were integrated into Hindi sentences (e.g., "Mujhe bahut miss kar rahi hoon" - I miss you a lot). * Romanized Hindi: Hindi words written in the Roman script (e.g., "Kal milte hain" - Let's meet tomorrow). * Informal tone: Hinglish is often used in casual communication. 1. Sentence List and Manual Writing: Created a list of sentence types. Include examples covering declarative, interrogative, imperative, exclamatory structures and many more . Write sample sentences for each type. Some examples with Hinglish characteristics: * Declarative: "Weekend plans cancel ho gaye" (Weekend plans got cancelled). * Interrogative: "Kya aap chai peeyenge?" (Would you like some tea?). * Imperative: "Abhi mujhe phone mat karo" (Don't call me right now). * Exclamatory: "Yeh movie bohat boring hai!" (This movie is so boring!). Learn from sentence characteristics: Analyze the written sentences, identifying common word order, verb conjugations, and Hinglish-specific features like code-mixing and Romanization. 2. Writing Prompts based on Learnings: Use the identified characteristics to write prompts for synthetic sentence generation. Here's one of the template: ``` Generate a Hinglish sentence that: * Type: [Declarative/Interrogative/Imperative/Exclamatory] * Subject: [Optional: Specify a subject] * Verb: [Optional: Specify a verb or verb tense] * Object: [Optional: Specify an object] * Hinglish features: [Include code-mixing, Romanization, etc.] * Additional details: [Optional: Add specific details or context] ``` Example Prompt: Generate a Hinglish sentence that: * Type: Interrogative * Subject: You * Verb: be (present tense) * Object: hungry? * Hinglish features: Code-mixing * Additional details: Informal tone This prompt could generate a sentence like: "Hungry ho kya?" 3. Execution and Curation: Use a large language model (LLM) to generate synthetic sentences based on such prompts. Curate the generated data: Review the synthetic sentences and discard nonsensical or grammatically incorrect ones. You can also refine the prompts based on the generated output. Repeat steps 1-3 to generate a diverse dataset covering various Hinglish structures and scenarios.


Natural Language Generation, Sentence Comprehension, Sentence Parsing, Sentence Processing, Public Sentiment, Language Modeling