Bangla News Dataset

Published: 9 December 2019| Version 2 | DOI: 10.17632/xp92jxr8wn.2
Contributors:
Aisha Khatun,
,

Description

A corpus on Bangla newspaper articles created using a custom web crawler containing 12 different topics. The total number of word tokens in this dataset is 28.5+ million. The number of unique words is around 3% of the entire vocabulary of the dataset. The Dataset is imbalanced. 20% of the dataset was separated as a held-out dataset.

Files

Institutions

Shahjalal University of Science and Technology

Categories

Natural Language Processing, Bengali Language

Licence