Natural Language Processing Journal

1 result

Gender Classification Using Twitter Data
Babatunde Onikoyi, Nonso Nnamoko, Ioannis Korkontzelos
Mendeley Data | Published 14 November 2022
This dataset is an expansion of the Twitter User Gender Classification dataset, which is freely available on Kaggle. The aim of this data for research is to predict user gender based on textual data available on Twitter. The original dataset contained 12,894 distinct male and female twitter users with one tweet each. This was significantly expanded to 269,108 tweets by the same 12,894 users where each user had multiple tweets. Expansion method was using Tweepy to access the Twitter API. The uploaded files contains the Train and Test split used for the experiment. It contains the following: user_id - a unique id for each user gender - male or female gender:confidence - a float representing confidence in the provided gender (1 for 100%) created_at - date and time when the tweet was created tweet_id - the unique id of the text of a random tweet by the users Attached also is a simple script on Jupyter Notebook using Tweepy. This is built to retrieve a tweet’s complete information using its ID which is known as the hydration of a tweet ID. Some sample tweet id's are already in the script for testing purposes.
Export:APA BibTeX DataCite RIS