Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis

Published: 27 December 2023| Version 7 | DOI: 10.17632/y63frd6pmf.7
pratibha verma preeti


This dataset was meticulously curated to encapsulate tweets that vividly express raw emotions, sentiments, feelings, and textual gestures related to various situations of conflict. This includes, but is not limited to, wars, crises, civil unrest, and world wars I, II, III, etc. The fundamental objective of creating this dataset is to enable the development of digital applications capable of discerning the genuine emotional pulse of digitally empowered citizens. In essence, these tools aim to empower both the general public and government authorities with real-time insights into public sentiment, enhancing their understanding and responsiveness to critical situations.


Steps to reproduce

Topic is collection of multi-model tweet data in hinglish. Hence, using keywords, hashtags, or user accounts and areas of interest, such as conflicts, wars, crises, etc. Do the following Set up Twitter API Access: You will need to access Twitter's data. To do this, you must apply for a Twitter Developer Account, create a project, and then generate API keys. Utilize a Tweet Scraper/Scraper Library: Use a Python library like Tweepy or Twint to interface with the Twitter API. These libraries allow you to retrieve tweets according to your specified parameters. Filter Tweets: After retrieving the tweets, you need to filter and process them. This involves removing retweets, filtering by language (if necessary), and perhaps cleaning the text (removing URLs, Twitter handles, special characters, etc.). Extract Relevant Information: From the retrieved and filtered tweets, extract relevant information like the tweet text, tweet creation time, user's name, user's location (if available), retweet count, favorite count, and other relevant data. Perform Sentiment Analysis (Optional): Depending on your research objectives, you might want to perform sentiment analysis on the tweets to categorize them into positive, neutral, or negative sentiment. Store the Data: Store the collected and processed data in a suitable format for future analysis. This could be a CSV file, a database, or any other storage medium that fits your needs. Update the Dataset Periodically: Depending on your research needs, you might want to keep your dataset up-to-date by periodically running your scraping and processing code. Remember, when collecting data from social media platforms like Twitter, it is important to respect user privacy and comply with the terms of service of the platform. Also, due to the nature of this dataset, it's crucial to handle the information ethically, considering the sensitive context (wars, crises, civil unrest, etc.) it relates to. This dataset is updated and now comprises 10,040 tweets. (English, Hindi) and Hinglish tweets).


Chitkara Institute of Engineering and Technology


Natural Language Processing, Pragmatic Processing, Sentiment Analysis