Afaan Oromo Facebook posts and comments hate speech dataset

Published: 31 March 2022| Version 1 | DOI: 10.17632/r2vwjw5rbx.1
Contributor:
Baharudin Sherif

Description

This dataset is collected from Facebook pages those who mainly using Afaan Oromo for their posts. It is collected from pages of political activists, broadcasting media, religious organization, politician, and famous personal blogs to get many and representative data, as there are many reactions in comment sessions of such posts. we set the following rules for selecting Facebook Pages: 1. A page that mostly uses the Afaan Oromo language for posts. 2. Pages having likes and followers greater than 20,000. 3. Pages of religious media, famous vlogers, politicians and broadcasting media are selected to get more representative data Based on the above rules, we selected 18 different Facebook pages and 20,000 unique datasets are collected from posts and comments. The datasets are annotated by 8 annotators based on the annotation guideline given by the researcher. The annotation guideline is prepared based on difinitions of hate speech from Ethiopian hate speech proclamation and also by refering Ethiopian criminal law. Also, the social media hate speech guideline prepared by Center for Advancement of Right and Democracy organization is refered for the guideline preparation. Based on the guideline all annotators given same number of datasets, where 4 of them given the same datasets for measuring inter annotator agreement. We used Fleiss kappa for inter annotator agreement and we got 0.64 which is showing good level of agreement. The dataset is annotated to binary class containing labels hate and free. Among the 20,000 collected datasets which are obtained after cleaning and removing unnecessary characters, 9985 are annotated as Free and the rest 1015 are annotated as Hate.

Files

Steps to reproduce

This dataset is collected from Facebook pages those who mainly using Afaan Oromo for their posts. It is collected from pages of political activists, broadcasting media, religious organization, politician, and famous personal blogs to get many and representative data, as there are many reactions in comment sessions of such posts. we set the following rules for selecting Facebook Pages: 1. A page that mostly uses the Afaan Oromo language for posts. 2. Pages having likes and followers greater than 20,000. 3. Pages of religious media, famous vlogers, politicians and broadcasting media are selected to get more representative data Based on the above rules, we selected 18 different Facebook pages and 20,000 unique datasets are collected from posts and comments. The datasets are annotated by 8 annotators based on the annotation guideline given by the researcher. The annotation guideline is prepared based on difinitions of hate speech from Ethiopian hate speech proclamation and also by refering Ethiopian criminal law. Also, the social media hate speech guideline prepared by Center for Advancement of Right and Democracy organization is refered for the guideline preparation. Based on the guideline all annotators given same number of datasets, where 4 of them given the same datasets for measuring inter annotator agreement. We used Fleiss kappa for inter annotator agreement and we got 0.64 which is showing good level of agreement. The dataset is annotated to binary class containing labels hate and free. Among the 20,000 collected datasets which are obtained after cleaning and removing unnecessary characters, 9985 are annotated as Free and the rest 1015 are annotated as Hate.

Institutions

Mettu University Faculty of Engineering and Technology

Categories

Natural Language Processing, Learning

License