Roman Urdu Word Variations and Normalized Sentiment Review Dataset (RUWV-NSR)

Published: 7 October 2024| Version 5 | DOI: 10.17632/v5jfhsvtmd.5
Contributor:
Mudasar Ahmed

Description

We have developed two unique Roman Urdu datasets, translated into English. The first dataset focuses on Roman Urdu words and their spelling variations. This dataset is structured in an Excel file with five columns labeled "Var-1" to "Var-5," each representing up to five variations of Roman Urdu spellings for individual words. The final column, "common," contains the most frequently used spelling for each word. In total, this dataset includes 5,244 unique Roman Urdu words, which, when combined with their variations, amount to 19,527 words. The second dataset contains Roman Urdu reviews, each labeled with a sentiment. Given the variability in Roman Urdu spellings found on the web, where users often create their own spelling variations, we have normalized the spelling of words across these reviews. This dataset is the first of its kind, containing the largest collection of Roman Urdu reviews, with a total of 28,090 reviews categorized into five sentiment classes. This dataset is particularly valuable for analyzing Roman Urdu content in contexts such as online product reviews or Roman Urdu articles, which are becoming increasingly common. It offers significant potential for sentiment analysis and language processing applications.

Files

Institutions

Quaid-e-Awam University of Engineering Science and Technology

Categories

Statistical Natural Language Processing, Text Mining, Sentiment Analysis

Licence