Amharic Social Media Dataset for Hate Speech Detection and Classification in Amharic Text with Deep Learning
Steps to reproduce
The model detection and classification is based on four categories. For each category, datasets are collected from three selected social media platforms. The selected platforms are based on their popularity and day-to-day usage in Ethiopia, especially for the content of hate speech-related posts and comments. These selected platforms are Facebook, Twitter, and YouTube. To collect data from Facebook and YouTube we have applied manual and automatic ways. For the automatic collection method, we have used the Facepager tool which is a social media crawler that exploits Graph and different other APIs. To collect the dataset from Twitter specifically, we used unpublished and an unannotated dataset which was collected using the Twitter API. This API collects tweets written in Fidel script on a daily basis starting from mid-August 2014. The collector program runs daily as a background process and fetches the tweet with its date, time, user location, and tweet ID. After collecting the dataset, we have operated the first round of data cleaning by automatically removing the non-Fidel scripted data using the “PYCLD2 Python Bindings to CLD2 tool”. For the non-Amharic but Fidel scripted languages like Argobba, Harari, Inor, Tigre, Tigrinya, and other more Ethiopian languages we used a manual way of cleaning the data. After cleaning we consolidate every piece of data and filtered racial, religious, and gender hate speeches using our own list of hate speech keywords. Which are collected by analyzing some sample hate speeches. These identified keywords include 14 gender keywords, 30 religious keywords, 168 hate-related keywords, 70 offensive keywords which can be a head start for hate speeches, and 56 known Ethiopian popular ethnic group names. For the normal free speech category, we identified and collected normal speeches during annotation. This lets us get normal free speeches for each of the hate speech categories too, so the model learns the difference between normal and hate speeches within the same category.