NADCG :New Arabic dataset for text classification and generation

Published: 5 September 2024| Version 1 | DOI: 10.17632/mrh6fy2dkj.1
Contributor:
Hezam Gawbah

Description

- NADCG New Arabic dataset for text classification and generation. -NADCG 2,136,311 Rows. -NADCG is a large collection of Arabic news headline, category and articles that can been used in several NLP tasks. -NADCG tasks Text generation, text classification, summarization and producing word-embedding. -NADCG fields Headline, summary, article, and category. - NADCG is larger than other data sets, as its size is 2,136,311 classified news items, in UTF-8 encoding and CSV format. - NADCG is contains vast number of Arabic news have eight categories (Politics, Economics, Sports, Health, Technology, Culture, Arts, Accidents), in general, the corpus adopted the labeling of each article as appeared in its news portal source. In summary, NADCG's large size and variety of fields make it stand out from the crowd, so it can be used for many tasks and also for training large transformer models, and it is also available for free. - NADCG_SUBSET is a balanced benchmark dataset (from NADCG) that is used in our research work (80K). It contains the training (90%), validation (5%) and testing (5%) sets. Training set size: 72000 row, Validation set size: 4000 row, and Testing set size: 4000 row.

Files

Institutions

Ibb University

Categories

Natural Language Processing, Classification System, Text Processing, Generational Difference, Summer, Word Embedding

Licence