Ultimate Arabic News Dataset

Published: 4 July 2022| Version 2 | DOI: 10.17632/jz56k5wxz7.2
Ahmed Hashim Al-Dulaimi


The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles. Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources. - The data we collect consists of two Primary files: UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification. UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification. - We have added two folders containing additional detailed datasets: 1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets: Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website. 2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases. - The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.



Yalova Universitesi, Yalova UniversitesiMuhendislik Fakultesi


Computer Science, Textual Database, Arabic Language, News Collection Service