Roman Urdu word and reviews dataset.

Published: 5 April 2024| Version 3 | DOI: 10.17632/v5jfhsvtmd.3
Mudasar Ahmed


We have created two datasets of Roman Urdu, from which one dataset is specifically dedicated to Roman Urdu words. In this dataset, an Excel file was used, in the file five columns named "Var-1 to Var-5" for Roman Urdu word spellings. Each column contains one to five variations of Roman Urdu spellings. The last column is named "common," where one spelling of each word is selected and placed there. We have chosen the spelling from this column which has been used the most frequently. We have included 5,244 Roman Urdu words in this dataset, when combined with spelling variations, result in a total of 19,527 Roman Urdu words. We have created another dataset for Roman Urdu reviews, in which we have assigned sentiments to these reviews. Since on the web, various Roman Urdu spellings are used by peoples make own spellings, we have standardized these reviews into unique Roman Urdu word spellings. This dataset is the first of its kind and contains the highest number of reviews, totaling 35,139 reviews. These reviews have been categorized into five classes. This dataset can prove beneficial where nowadays newspapers are written in Roman Urdu, and it will be even more beneficial where people purchase any online product and provide their reviews in Roman Urdu. Analyzing these reviews through this dataset would be beneficial.



Quaid-e-Awam University of Engineering Science and Technology


Statistical Natural Language Processing, Text Mining, Sentiment Analysis