Urdu News Dataset 1M
Description
The dataset – A Large-Scale News Dataset for Urdu Text Processing, to the best our knowledge, is the only available large-scale dataset on Urdu lan- guage for many NLP, Machine/Deep Learning tasks: text processing, classification, summarization, Named Entity Recognition, Topic Modeling, and Text Generation. This dataset offers above 1 Million Urdu news stories text corpus for four distinct categories: Business & Economics, Science & Technology, Entertain- ment, and Sports. These four distinct categories are appropriately chosen to elim- inate any ambiguity therefore this dataset is suitable for many Urdu NLP tasks.
Files
Steps to reproduce
The major Urdu news sources publishing news in four distinct categories; Business & Economics, Science & Technology, Entertainment, and Sports identified. The web scrapping policies of these new sources carefully evaluated where available in place before scrapping news stories and found that content can be used for non-commercial research purpose only by crediting the news source. The customised separate Python script using Beautiful-soup and Request libraries were used for data extraction for each website. The preprocessing techniques employed using customised functions and regular expressions in Python to keep Urdu text and numbers only in dataset corpus.