Review of Text Clustering Methods and Suggested Solutions for Theme Based Clustering of the Quran

Published: 22 Jan 2020 | Version 1 | DOI: 10.17632/kb92kdjtcz.1
Contributor(s):

Description of this data

In the datasets, documents of modern, unedited, and unmarked Arabic texts were utilised, which consisted of a sample of nearly 1,680 documents obtained from various online Arabic resources. The testing dataset comprised of four fields, namely: art, economics, politics, and sports articles. The other dataset collection consisted of 383,872 Arabic documents, which were primarily newswire dispatches as released by the Agency France Press (AFP) between years 1994 and 2000. Standard TREC classes and ground truth were thus established for this collection, whereby 10 classes were thus classified as part of TREC 2001. The last datasets is the Qur'an data, which converted from softcopy to database to contactes with each chapters and verses.

Experiment data files

Related links

Latest version

  • Version 1

    2020-01-22

    Published: 2020-01-22

    DOI: 10.17632/kb92kdjtcz.1

    Cite this dataset

    abdul salam, rosalina; bsoul, qusay; Atwan, Jaffar ; Ahmad, Hishomudin (2020), “Review of Text Clustering Methods and Suggested Solutions for Theme Based Clustering of the Quran”, Mendeley Data, v1 http://dx.doi.org/10.17632/kb92kdjtcz.1

Statistics

Views: 11
Downloads: 3

Categories

Feature Selection, Information Classification, Arabic Language, Extraction Methods, Cluster Testing, Classifier Evaluation, Cross-Language Information Retrieval

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?
You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report