DataSet for Arabic Classification

Published: 30-07-2018| Version 2 | DOI: 10.17632/v524p5dhpj.2
mohamed BINIZ


The dataset is a collection of Arabic texts, which covers modern Arabic language used in newspapers articles. The text contains alphabetic, numeric and symbolic words. The existence of numeric and symbolic words in this dataset could tell the efficiency and robustness of many Arabic text classification and indexing documents. The dataset consists of 111,728 documents (cf. Table 1) and 319,254,124 words (cf. Table 2) structured in text files, and collected from 3 Arabic online newspapers: Assabah [9], Hespress [10] and Akhbarona [11] using semi-automatic web crawling process. The documents in the dataset are categorized into 5 classes: sport, politic, culture, economy and diverse. The number of documents and words for each class varies from one class to another (cf. Tables 1-2).


Steps to reproduce

The main steps of the web crawling process realized by our program are: 1. In initialization of the queue of the URLs with the starting URLs 2. As long as there are unvisited URLs " Take the next available URL from the queue " Download the content and mark the URL as visited " Extract the hyperlinks from the newly uploaded document and add them to the queue if they satisfy the necessary criteria. " Extract relevant content and its class using selectors, for example: for hesspress web site, we use the content selector "#article_body p" and the selector "title" to retrieve the class. Finally, we put the contents of each visited link in a file, and save it in the folder of the appropriate class. " Reevaluate the conditions to continue visiting sites (maximum depth, maximum number of documents retrieved, maximum execution time, empty URLs, etc.) " Mark a pause to avoid knocking down the server before continuing the execution.