MarathiSarc

Published: 16 December 2024| Version 1 | DOI: 10.17632/273n4sr2z3.1
Contributor:
Pravin Patil

Description

Sarcasm Detection is a task of predicting whether the given text is sarcastic or not. Considering the challenges in detecting sarcasm in a sentiment bearing text, sarcasm detection has become one of the hot research area in Natural Language Processing. Considerable amount of research has been done in this area for foreign languages such as English, Czech, Italian, Dutch, Indonesian etc. Small amount of work in this area is also available for Indian languages such as Hindi, Tamil, Bengali etc. However, Marathi being the third most popular language in India, lags far behind in this area. One of the most crucial reasons for this is the absence of proper dataset. We present MarathiSarc - a dataset of labelled Marathi tweets for sarcasm detections Considering the limitation of Twitter API, we preferred to use the Twint library of twitter for collecting the tweets. Using this, we were able to collect 2361 tweets in Marathi language. In the first stage, using the hashtag based supervision technique we collected Marathi tweets containing hashtags such as #sarcasm, #sarcastic, #sarcasmic #irony, #ironic etc. The time period of the corpus is from December 2011. We have manually labelled the entire dataset into three classes as follows: • Tweets that contained the hashtags such as #sarcasm, #sarcastic, #sarcasmic, #व्यंग #irony, #ironic and found to be actually sarcastic are labelled as sarcastic. (1) • Tweets that contained the hashtags such as #sarcasm, #sarcastic, #sarcasmic, #व्यंग,#irony, #ironic but are found to be actually non sarcastic are labelled as non- sarcastic. (-1) • Tweets which can be possibly sarcastic depending on the conversational history and the context are marked as possibly sarcastic

Files

Categories

Natural Language Processing, Machine Learning

Licence