Amharic Facebook Dataset for Hate Speech detection

Published: 02-11-2020| Version 1 | DOI: 10.17632/ymtmxx385m.1
Contributor:
Surafel Getachew

Description

This dataset is collected from Facebook pages of activists who write their posts using Geez script and comments of their followers. It is collected manually by going through each post and comment. To select activists page and specific posts from those pages we set the following rules 1. Facebook pages should have more than 50,000 like (followers). 2. The post should have more than 300 comments because it shows they have a lot of active participants for their posts. 3. The pages and activists have to use the Amharic language as the main language for their posts. Based on the above rules 30,000 datasets were collected from selected Facebook pages and, all of the unique posts annotated by using ten annotators from different background of culture, religion, and ethnic group, among those ten annotators, five of them given the same number of posts to measure the inter-annotator agreement among them, which achieves Fleiss Kappa of 0.662 and the labeling task is performed based on the guideline given from the researcher. To help the annotator to label posts and comments collected from Facebook guidelines have prepared based on the definition from the UN general definition of hate speech and offensive language and Ethiopian hate speech law by using the annotation style other researchers used. The 30,000 datasets labeled into the binary class of hate and free based on the guideline. Finally, the data cleaning and normalization of Amharic characters are performed, Because In Amharic language characters like ሀ, ኀ, ሐ, ኃ, ኻ, ሓ and ሃ, ሰ and ሠ, ጸ and ፀ, አ, ኣ, ዐ and ዓ represent the same consonants with the same pronunciation and it can be used interchangeably without any meaning differences.

Files

Steps to reproduce

This dataset is collected from Facebook pages of activists who write their posts using Geez script and comments of their followers. It is collected manually by going through each post and comment. To select activists page and specific posts from those pages we set the following rules 1. Facebook pages should have more than 50,000 like (followers). 2. The post should have more than 300 comments because it shows they have a lot of active participants for their posts. 3. The pages and activists have to use the Amharic language as the main language for their posts. Based on the above rules 30,000 datasets were collected from selected Facebook pages and, all of the unique posts annotated by using ten annotators from different background of culture, religion, and ethnic group, among those ten annotators, five of them given the same number of posts to measure the inter-annotator agreement among them, which achieves Fleiss Kappa of 0.662 and the labeling task is performed based on the guideline given from the researcher. To help the annotator to label posts and comments collected from Facebook guidelines have prepared based on the definition from the UN general definition of hate speech and offensive language and Ethiopian hate speech law by using the annotation style other researchers used. The 30,000 datasets labeled into the binary class of hate and free based on the guideline. Finally, the data cleaning and normalization of Amharic characters are performed, Because In Amharic language characters like ሀ, ኀ, ሐ, ኃ, ኻ, ሓ and ሃ, ሰ and ሠ, ጸ and ፀ, አ, ኣ, ዐ and ዓ represent the same consonants with the same pronunciation and it can be used interchangeably without any meaning differences.