Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes

Published: 18 October 2021| Version 1 | DOI: 10.17632/v362rp78dc.1
Contributors:
,
,
,

Description

There are eight separate directories for each category of news. Each directory contains CSV files that have five columns, including the news article, category, heading, publication date, and source of the news (newspaper). The data is kept in raw format as is; no cleaning, stemming or any type of preprocessing is applied after scraping. There are about 665K articles and 12.5M sentences with 185.5M words in the dataset. We add a folder called balanced_dataset that contains a balanced dataset where each category has 40K articles, and the total number of articles is 320K.

Files

Institutions

King Abdulaziz University Faculty of Computing and Information Technology

Categories

Computer Science, Natural Language Processing, Machine Learning, Deep Learning, Textual Analysis

Licence