SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization

Name: SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization
Creator: Omar Einea
Published: 2019-03-21T11:18:28.047Z
Keywords: Natural Language Processing, Machine Learning, Classification System, Arabic Language, Categorization, Text Processing

Einea, Omar; Elnagar, Ashraf; Al-Debsi, Ridhwan

doi:10.17632/57zpx667y9.1

SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization

Published: 21 March 2019| Version 1 | DOI: 10.17632/57zpx667y9.1

Contributors:

Omar Einea, Ashraf Elnagar, Ridhwan Al-Debsi

Description

SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona. All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 194'797 articles. How to use it: ___________ 1. Unzip compressed resources. 2. Each folder contains 6-7 sub-folders which are labeled by the category's name. 3. Each sub-folder contains a set of article files corresponding to its category.

Files

Institutions

University of Sharjah

SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization

Description

Files

Institutions

Categories

Licence