Annotated The dUCk Tweets Dataset

Published: 13 August 2021| Version 2 | DOI: 10.17632/876tc4dkts.2
Contributors:
Hui Ming Tham,
Keng Hoon Gan

Description

This dataset is made up of unique annotated English-Malay code-switching, pure English, and pure Malay tweets using raw_tweets_012019_to_062020.csv on Kaggle (Carlson, 2020). The raw tweets file is the collected users’ tweets about a Malaysian brand called, ‘The dUCk Group’ which is founded by Vivy Yusof focuses on selling scarves, bags, cosmetics, stationaries, and Home & Living products. When preparing this dataset, the duplicated, invalid and unusable data rows are removed. The tweets are then annotated with the language category “ENG” for pure English tweets, “BM” for pure Malay tweets, and “ENG-BM” for the code-switching tweets. Besides, the tweets are annotated with sentiment value 0 for neutral, 1 for positive, and -1 for negative. The sub-folders contain in this dataset are as follows: 1) Full Training Dataset: This sub-folder contains a full set of annotated pure English, pure Malay, and English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to train machine learning models. The tweets are kept in both CSV and XML format files namely 'full_training_dataset.csv' and 'full_training_dataset.xml'. 2) Full Testing Dataset: This sub-folder contains a full set of annotated pure English, pure Malay, and English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to test the performance of learning models. The tweets are kept in both CSV and XML format files namely 'full_testing_dataset.csv' and 'full_testing_dataset.xml'. 3) Code-Switching Training Dataset: This sub-folder comprises only annotated English-Malay code-switching tweets regarding ‘The dUCk Group’ brand for training the learning models. The tweets are kept in both CSV and XML format files namely 'eng_malay_training_dataset.csv' and 'eng_malay_training_dataset.xml'. 4) Code-Switching Testing Dataset: This sub-folder comprises only annotated English-Malay code-switching tweets regarding ‘The dUCk Group’ brand, which can be used to evaluate the performance of the learning models. The tweets are kept in both CSV and XML format files namely 'eng_malay_testing_dataset.csv' and 'eng_malay_testing_dataset.xml. *Note: 'Language' column represents the language category of the tweet belongs to 'TweetText' column represents the whole tweet 'TweetSentiment' column represents the sentiment value of the tweet (0, 1, and -1)

Files