WiHArD: Wikipedia based Hierarchical Arabic Dataset

Name: WiHArD: Wikipedia based Hierarchical Arabic Dataset
Creator: Djelloul BOUCHIHA
Published: 2023-01-23T14:57:30.768Z
Keywords: Artificial Intelligence, Natural Language Processing, Machine Learning, Classification System, Arabic Language, Categorization, Text Processing, Deep Learning

BOUCHIHA, Djelloul; BOUZIANE, Abdelghani; DOUMI, Noureddine; BERBOUCHI, Farouk Omar; KEBIR, Aymen Abdelghani; MEBARKI, Nihad; BENAMEUR, Badiâ Achouak

doi:10.17632/kdkryh5rs2.2

WiHArD: Wikipedia based Hierarchical Arabic Dataset

Published: 23 January 2023| Version 2 | DOI: 10.17632/kdkryh5rs2.2

Contributors:

Djelloul BOUCHIHA, Abdelghani BOUZIANE, Noureddine DOUMI, Farouk Omar BERBOUCHI, Aymen Abdelghani KEBIR, Nihad MEBARKI, Badiâ Achouak BENAMEUR

Description

WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ) and Math (رياضيات). Texts in this level describe general notions related to these domains. • "Level 2" classes are Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). Texts in this level describe specific notions related to these sub-domains. Four files are shared for the benefit of the NLP and IA communities, especially researchers working on Arabic language: 1. WiHArD_Directory_Hierarchy.zip contains the directory hierarchy. 2. WiHArD.csv, a CSV file of three columns: "text" column contains the Arabic texts; "category_path" and "category_code" columns contain respectively the category path and the category code. 3. WiHArD_Level1.csv, a CSV file restricted to the texts the first level, namely Culture (ثقافة), History (تاريخ) and Math (رياضيات). 4. WiHArD_Level2.csv, a CSV file restricted to the texts of the second level, namely Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة).

WiHArD: Wikipedia based Hierarchical Arabic Dataset

Description

Files

Categories

Licence