WiHArD: Wikipedia based Hierarchical Arabic Dataset

Published: 23 January 2023| Version 2 | DOI: 10.17632/kdkryh5rs2.2
Djelloul BOUCHIHA,


WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ) and Math (رياضيات). Texts in this level describe general notions related to these domains. • "Level 2" classes are Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). Texts in this level describe specific notions related to these sub-domains. Four files are shared for the benefit of the NLP and IA communities, especially researchers working on Arabic language: 1. WiHArD_Directory_Hierarchy.zip contains the directory hierarchy. 2. WiHArD.csv, a CSV file of three columns: "text" column contains the Arabic texts; "category_path" and "category_code" columns contain respectively the category path and the category code. 3. WiHArD_Level1.csv, a CSV file restricted to the texts the first level, namely Culture (ثقافة), History (تاريخ) and Math (رياضيات). 4. WiHArD_Level2.csv, a CSV file restricted to the texts of the second level, namely Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة).



Artificial Intelligence, Natural Language Processing, Machine Learning, Classification System, Arabic Language, Categorization, Text Processing, Deep Learning
