A Dataset of TripAdvisor Guest Reviews for Major Hotels in Salalah, Oman
Description
This dataset contains TripAdvisor guest reviews for major hotels in Salalah, Oman, collected through web scraping. It provides insights into guest satisfaction, sentiment, and ratings, making it a valuable resource for marketing, hospitality and tourism research, sentiment analysis, and tourism marketing studies. ๐๐จ๐ญ๐๐ฅ๐ฌ ๐๐ง๐๐ฅ๐ฎ๐๐๐ ๐ข๐ง ๐ญ๐ก๐ ๐๐๐ญ๐๐ฌ๐๐ญ The dataset features guest reviews from the following hotels in Salalah: โข Al Baleed Resort Salalah by Anantara โข Belad Bont Resort โข Crowne Plaza Resort Salalah โข Fanar Hotel and Residences โข Hilton Salalah Resort โข Juweira Boutique Hotel โข Millennium Resort Salalah โข Salalah Gardens Hotel โข Salalah Rotana Resort ๐๐ข๐ฆ๐ ๐๐จ๐ฏ๐๐ซ๐๐ ๐ The dataset captures all available guest reviews from the beginning of each hotel's presence on TripAdvisor up until February 2025. ๐๐๐ฅ๐๐ฏ๐๐ง๐๐ ๐ญ๐จ ๐๐ก๐๐ซ๐๐๐ ๐๐จ๐ฎ๐ซ๐ข๐ฌ๐ฆ ๐๐ฆ๐๐ง ๐๐ข๐ฌ๐ข๐จ๐ง 2040 This dataset is particularly beneficial for the following government agencies: โข Ministry of Heritage and Tourism - Oman โข Oman Chamber of Commerce & Industry (OCCI) โข Dhofar Municipality and Dhofar Tourism Department โข National Centre for Statistics and Information (NCSI) โข Oman Vision 2040 Implementation Follow-up Unit โข Ministry of Commerce, Industry, and Investment Promotion โข Oman Tourism Development Company (OMRAN) โข Ministry of Transport, Communications, and Information Technology (MTCIT) โข Dhofar Governorate Office โข Ministry of Environment and Climate Affairs It also serves as a valuable resource for researchers, policymakers, and marketing, hospitality & tourism professionals to enhance Salalahโs tourism sector, improve guest satisfaction, and support Omanโs long-term vision for a thriving and sustainable tourism industry. Salalah experiences a surge in visitors during the Khareef season (monsoon season), a critical period for the hospitality industry. This dataset can help analyze guest experiences, identify service gaps, and optimize offerings during this peak tourism period. Oman Vision 2040 Goals The dataset aligns with Omanโs Vision 2040, which prioritizes tourism sector growth, economic diversification, and enhanced customer experiences. By leveraging sentiment analysis and guest insights, policymakers and hotel managers can develop data-driven strategies to improve hospitality services, attract more visitors, and enhance Salalahโs reputation as a premier travel destination. Potential Use Cases Sentiment Analysis: Understanding guest satisfaction trends over time Tourism & Hospitality Research: Evaluating service quality and hotel performance across different years Marketing Insights: Identifying key drivers of positive and negative reviews for strategic decision-making Machine Learning & NLP: Training models for text classification, sentiment prediction, and recommendation systems
Files
Steps to reproduce
A structured data extraction and preprocessing workflow has been implemented to systematically collect, transform, and optimize publicly available information from the TripAdvisor website, particularly from the front-end interface. This workflow enables efficient data acquisition for downstream data analytics, sentiment analysis, and predictive modeling. The process begins with the extraction of unstructured review data using the TripAdvisor Review Scraper by ExtensionsBox, which exports the information into a CSV file containing multiple attributes. Additionally, the HTML link of each hotel was used to extract data directly from the hotelโs TripAdvisor pages, ensuring a structured dataset for analysis. The dataset includes nine hotels, classified based on TripAdvisorโs "Best of the Best" Award and "Travelersโ Choice" Award criteria, along with other major hotels, ensuring the dataset focuses on top-rated accommodations in Salalah, Oman. To enable comprehensive analysis, the nine individual datasets were merged into a single combined dataset, allowing for comparative insights across all hotels. To ensure efficient data processing and transformation, Python is utilized alongside the pandas and NumPy libraries. Since the dataset contains personally identifiable information (PII), privacy measures are enforced by systematically deleting sensitive attributes, such as Review ID, User ID, Display Name, Username, User Profile, User Avatar, Photos, and URLs. This step ensures compliance with data privacy best practices while maintaining the integrity and usability of the dataset. Moreover, the "Additional Ratings" columnโoriginally containing multiple review aspects in a single unstructured fieldโis parsed and transformed into structured features. This transformation involves converting the column into a dictionary-like format and extracting numerical values for Value, Rooms, Location, Cleanliness, Service, and Sleep Quality. To maintain data consistency, missing valuesโwhere users did not provide ratingsโare replaced with 0, ensuring all ratings are stored as integer data types for uniformity in further analysis. Once the transformation is complete, the original "Additional Ratings" column is dropped, resulting in a structured dataset ready for advanced analytics, including sentiment analysis, machine learning models, and consumer behavior insights.