DataSet for Arabic Classification

Published: 30 Jul 2018 | Version 2 | DOI: 10.17632/v524p5dhpj.2
Contributor(s):

Description of this data

The dataset is a collection of Arabic texts, which covers modern Arabic language used in newspapers articles. The text contains alphabetic, numeric and symbolic words. The existence of numeric and symbolic words in this dataset could tell the efficiency and robustness of many Arabic text classification and indexing documents.
The dataset consists of 111,728 documents (cf. Table 1) and 319,254,124 words (cf. Table 2) structured in text files, and collected from 3 Arabic online newspapers: Assabah [9], Hespress [10] and Akhbarona [11] using semi-automatic web crawling process.
The documents in the dataset are categorized into 5 classes: sport, politic, culture, economy and diverse. The number of documents and words for each class varies from one class to another (cf. Tables 1-2).

Experiment data files

Steps to reproduce

The main steps of the web crawling process realized by our program are:

  1. In initialization of the queue of the URLs with the starting URLs
  2. As long as there are unvisited URLs
    " Take the next available URL from the queue
    " Download the content and mark the URL as visited
    " Extract the hyperlinks from the newly uploaded document and add them to the queue if they satisfy the necessary criteria.
    " Extract relevant content and its class using selectors, for example: for hesspress web site, we use the content selector "#article_body p" and the selector "title" to retrieve the class. Finally, we put the contents of each visited link in a file, and save it in the folder of the appropriate class.
    " Reevaluate the conditions to continue visiting sites (maximum depth, maximum number of documents retrieved, maximum execution time, empty URLs, etc.)
    " Mark a pause to avoid knocking down the server before continuing the execution.

Latest version

  • Version 2

    2018-07-30

    Published: 2018-07-30

    DOI: 10.17632/v524p5dhpj.2

    Cite this dataset

    mohamed, BINIZ (2018), “DataSet for Arabic Classification”, Mendeley Data, v2 http://dx.doi.org/10.17632/v524p5dhpj.2

Statistics

Views: 1377
Downloads: 371

Previous versions

Compare to version

Institutions

Universite Sultan Moulay Slimane de Beni-Mellal, Universite Chouaib Doukkali Faculte des Sciences

Categories

Word Processing, Classification System

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report