MAAD : Multi-Label Arabic Articles Dataset

Published: 22 January 2025| Version 1 | DOI: 10.17632/hbfc9j8hj8.1
Contributors:
,
,
,

Description

The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.

Files

Institutions

Ibb University

Categories

Natural Language Processing, Machine Learning, Text Extraction, Automated Data Collection, Arabic Language, Natural Language Generation, Text Processing, Deep Learning

Licence