Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research
Description
This dataset is a unique collection of Hinglish (a mix of Hindi and English) sentences, consisting of both synthetically generated text using various Large Language Models (LLMs) such as ChatGPT, Gemini AI, Claude, Groq, and Deep Seek, as well as manually written sentences. The dataset encompasses a diverse range of text sources, including meeting minutes, debates, articles, short essays, emails, letters, tweets, communications, and quotes, all composed in Hinglish. The primary objective of this dataset is to support and facilitate research in the field of Natural Language Processing (NLP), particularly in the context of code-mixed languages like Hinglish. By providing a substantial corpus of Hinglish text from various domains and sources, this dataset aims to enable researchers to develop and test novel NLP techniques, models, and applications tailored to handle the unique challenges posed by code-mixed languages. The synthetic portion of the dataset, generated using state-of-the-art LLMs, offers a large volume of diverse Hinglish text that can be used for training and fine-tuning NLP models. The manually written sentences, on the other hand, provide a valuable benchmark for evaluating the performance of these models on human-generated Hinglish text.
Files
Steps to reproduce
1) Write manually , Learn from the sentence characteristics 2) Write a prompt based on the above learning 3) Execute and Curate dataset