Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research

Name: Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research
Creator: Puneet Arora
Published: 2024-05-23T19:29:02.663Z
Keywords: Natural Language Processing, Natural Language Semantics, Natural Language Generation, Sentence Comprehension, Sentence Parsing, Sentence Processing, Sentence Production, Public Sentiment

Arora, Puneet

doi:10.17632/vdtcp2yt9n.1

Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research

Published: 23 May 2024| Version 1 | DOI: 10.17632/vdtcp2yt9n.1

Contributor:

Puneet Arora

Description

This dataset is a unique collection of Hinglish (a mix of Hindi and English) sentences, consisting of both synthetically generated text using various Large Language Models (LLMs) such as ChatGPT, Gemini AI, Claude, Groq, and Deep Seek, as well as manually written sentences. The dataset encompasses a diverse range of text sources, including meeting minutes, debates, articles, short essays, emails, letters, tweets, communications, and quotes, all composed in Hinglish. The primary objective of this dataset is to support and facilitate research in the field of Natural Language Processing (NLP), particularly in the context of code-mixed languages like Hinglish. By providing a substantial corpus of Hinglish text from various domains and sources, this dataset aims to enable researchers to develop and test novel NLP techniques, models, and applications tailored to handle the unique challenges posed by code-mixed languages. The synthetic portion of the dataset, generated using state-of-the-art LLMs, offers a large volume of diverse Hinglish text that can be used for training and fine-tuning NLP models. The manually written sentences, on the other hand, provide a valuable benchmark for evaluating the performance of these models on human-generated Hinglish text.

Files

Steps to reproduce

1) Write manually , Learn from the sentence characteristics 2) Write a prompt based on the above learning 3) Execute and Curate dataset

Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research

Description

Files

Steps to reproduce

Categories

Licence