Amharic Social Media Dataset for Hate Speech Detection and Classification in Amharic Text with Deep Learning

Published: 12 August 2022| Version 1 | DOI: 10.17632/p74pfhz3yx.1
Contributor:
Samuel Minale

Description

This dataset is prepared for hate speech detection and classification into four categories of speech. Namely, Normal speech, Racial Hate speech, Religious Hate speech, Gender Hate speech and Disability Hate speech. This dataset is collected from three social media sites: Facebook, Twitter, and YouTube. The collection is done automatically and the data is annotated by human annotators. The dataset is collected only for Amharic Language. To make a clear annotation process we have developed and prepared an annotation guideline. We have made the annotation process a twofold round. The first round annotation is done by 100 annotators who have different demographic and sociocultural backgrounds. Before the annotation process is started, besides giving the developed guideline, a brief introduction is given to the annotators which includes: ● What hate speech is ● Social media and hate speech ● Impact of hate speech ● Types of hate speech ● How to control hate speech and also ● How to use the annotation website system to annotate the hate speech dataset To start the annotation, process the annotators have to sign up and login to our custom built annotators tool (https://annotate.shegerapps.com) called “Amharic Hate Speech Annotation Tool”. As the schema shows in Figure 5.2 the annotation tool which is the website has a database with ten tables in it, eight of the tables hold an annotated or labeled dataset. The rest two tables are to hold users (annotators, curators, and admin) for authentication purposes, and finally, the tenth table holds the raw data. Raw data table is a container where the to be annotated dataset is dumped then the annotators fetch the data from this raw data table, when data is annotated it is inserted into the respective eight tables. On this annotation part, we have annotated texts in eight categories but for this research, we need only the four categories. We included the other four hate speech categories for future studies so any interested researcher or ourselves can continue researching without the need for a new annotation. This annotation tool database is MySQL, the backend is developed using PHP and the frontend is done using HTML, JavaScript, and jQuery. Some of the advantages of this annotation tool are to create an efficient team-based annotation experience, it maintains control for data preparation, it is used to manage annotators’ tasks and their progress, and also it makes exporting the annotated dataset easier. After finalizing the annotation, the dataset is given to the respective model as input in CSV format. During training time this data is split into three with an 80:10:10 ratio for training, validation, and testing purposes

Files

Steps to reproduce

The model detection and classification is based on four categories. For each category, datasets are collected from three selected social media platforms. The selected platforms are based on their popularity and day-to-day usage in Ethiopia, especially for the content of hate speech-related posts and comments. These selected platforms are Facebook, Twitter, and YouTube. To collect data from Facebook and YouTube we have applied manual and automatic ways. For the automatic collection method, we have used the Facepager tool which is a social media crawler that exploits Graph and different other APIs. To collect the dataset from Twitter specifically, we used unpublished and an unannotated dataset which was collected using the Twitter API. This API collects tweets written in Fidel script on a daily basis starting from mid-August 2014. The collector program runs daily as a background process and fetches the tweet with its date, time, user location, and tweet ID. After collecting the dataset, we have operated the first round of data cleaning by automatically removing the non-Fidel scripted data using the “PYCLD2 Python Bindings to CLD2 tool”. For the non-Amharic but Fidel scripted languages like Argobba, Harari, Inor, Tigre, Tigrinya, and other more Ethiopian languages we used a manual way of cleaning the data. After cleaning we consolidate every piece of data and filtered racial, religious, and gender hate speeches using our own list of hate speech keywords. Which are collected by analyzing some sample hate speeches. These identified keywords include 14 gender keywords, 30 religious keywords, 168 hate-related keywords, 70 offensive keywords which can be a head start for hate speeches, and 56 known Ethiopian popular ethnic group names. For the normal free speech category, we identified and collected normal speeches during annotation. This lets us get normal free speeches for each of the hate speech categories too, so the model learns the difference between normal and hate speeches within the same category.

Institutions

Addis Ababa University College of Natural Sciences

Categories

Natural Language Processing, Machine Learning, Deep Learning, Long Short-Term Memory Network

Licence