soMLier Vivino Rating Data

Published: 2 September 2022| Version 1 | DOI: 10.17632/dtbm7n6npz.1
Josh Redelinghuys


This dataset consists of 278,765 ratings made by 92,514 users for 1,640 wines; all ratings (made in the interval [1, 5] in increments of 0.5) of these wines were scraped from during August 2021. This data can be used to develop and compare recommender systems that make use of collaborative filtering or matrix factorisation, for example. This data is already partitioned (at random) into training, validation and test sets in the proportions 70%, 20% and 10% respectively. Note that each user exists in the training set, but not all users are present in the validation or test sets. The data can be extracted in R using load('VivinoRatingData.RData'). This data consists of a training ratings matrix 'set.train' consisting of 92,514 user rows and 1,640 wine columns where the column names correspond to the Vivino wine ID. 'known.position' is a list of all the matrix indices which contain a known value - some of these values are NA as they have been hidden in the valid or test sets. Likewise, 'test.position' and 'valid.position' contain the matrix indices of the ratings hidden in the test and valid sets, respectively. 'set.test' and 'set.valid' contain the rating values hidden from the matrix 'set.train'. For example, set.train[valid.position[1:3]] = {Na, Na, Na} are the first 3 validation ratings hidden in the training set - their corresponding rating values are set.valid[1:3] = {4, 3.5, 4}.


Steps to reproduce

Using the R libraries: rvest, stringr and jsonlite a script was used to scrape ratings for each wine on using a Vivino wineID and a call to their API (''). This script would search for a wine on and use the aforementioned URL to extract its corresponding ratings.


University of Cape Town


Wine, Recommendation System