Ultimate Arabic News Dataset

Published: 9 May 2022| Version 1 | DOI: 10.17632/jz56k5wxz7.1
Ahmed Hashim Al-Dulaimi


The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles. Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources. - The data we collect consists of two Primary files: UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification. UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification. - We add two samples of data collected by web scraping techniques: Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website. - The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.



Yalova Universitesi, Yalova UniversitesiMuhendislik Fakultesi


Computer Science, Textual Database, Arabic Language, News Collection Service