Urdu News Dataset 1M

Name: Urdu News Dataset 1M
Creator: Khalid Hussain
Published: 2021-01-27T16:44:11.356Z
Keywords: Natural Language Processing, Machine Learning, Urdu Language, Deep Learning

Hussain, Khalid; Mughal, Nimra; Ali, Irfan; Hassan, Saif; Daudpota, Sher Muhammad

doi:10.17632/834vsxnb99.3

Urdu News Dataset 1M

Published: 27 January 2021| Version 3 | DOI: 10.17632/834vsxnb99.3

Contributors:

,

Description

The dataset – A Large-Scale News Dataset for Urdu Text Processing, to the best our knowledge, is the only available large-scale dataset on Urdu lan- guage for many NLP, Machine/Deep Learning tasks: text processing, classification, summarization, Named Entity Recognition, Topic Modeling, and Text Generation. This dataset offers above 1 Million Urdu news stories text corpus for four distinct categories: Business & Economics, Science & Technology, Entertain- ment, and Sports. These four distinct categories are appropriately chosen to elim- inate any ambiguity therefore this dataset is suitable for many Urdu NLP tasks.

Files

Steps to reproduce

The major Urdu news sources publishing news in four distinct categories; Business & Economics, Science & Technology, Entertainment, and Sports identified. The web scrapping policies of these new sources carefully evaluated where available in place before scrapping news stories and found that content can be used for non-commercial research purpose only by crediting the news source. The customised separate Python script using Beautiful-soup and Request libraries were used for data extraction for each website. The preprocessing techniques employed using customised functions and regular expressions in Python to keep Urdu text and numbers only in dataset corpus.

Institutions

Sukkur Institute of Business Administration

Urdu News Dataset 1M

Description

Files

Steps to reproduce

Institutions

Categories

Licence