Political Arabic Article Dataset

Published: 12 May 2020| Version 2 | DOI: 10.17632/spvbf5bgjs.2
Contributor:

Description

PAAD: Political Arabic Article Dataset is a collection of political Arabic text, which covers modern Arabic language used in newspaper, blogs and social network. PAAD can be used in different Arabic NLP tasks such as Text Classification, Target, Article Orientation and Word Embedding. The text contains alphabetic, numeric, English word and symbolic words. The documents in the dataset are categorized into 3 classes: Reform "اصلاحي", Conservative "محافظ" and Revolutionary "ثوري". The number of documents for each class: Reform = 80 Conservative = 58 Revolutionary = 68 PAAD contains a total number of 206 articles. Articles were manually collected and using python scripts specifically for Excel file. There are two Excel file first original file this file same raw data but in excel file second file with Arabic normalization: 1- إأٱآا = ا 2- ي = ى 3- ؤ ئ = ء 4- ة = ه 5- Remove diacritics as (ُ,ْ,َ,ِ,ّ,~,ً,ٍ,ٌ) How to use it: ___________ 1. Unzip compressed resources. 2. There are three main folders each folder labelled by the category's name as Reform = S, Conservative = M and Revolutionary = T. 3. Each folder contains a set of article files corresponding to its category. 4. There are 2 excel file first one as raw data but in one file with the label for each article second excel file with Arabic normalization. V1 = original corpus V2 = preprocessing corpus V3 = root stemming corpus (ISRI) V4 = light stemming corpus (ISRI)

Files

Categories

Natural Language Processing, Machine Learning, Arabic Language, Categorization, Text Processing, Targeting, Sentiment Analysis

Licence