Winograd Images Dataset

Published: 18 June 2025| Version 1 | DOI: 10.17632/z6jb259pcd.1
Contributors:
Shravan Murlidaran,

Description

Understanding the factors influencing eye movements while freely viewing (no instructions from the experiment) is challenging. On one side, there has been an extensive body of work that suggests low-level saliency (Itti et al., 1998; Harel et al., 2006) predicts where people look, while more recently, work done by Henderson et al. (2018) on meaning maps has shown that local cropped image regions judged to be meaningful are a much better predictor. In this study, we hypothesize that people try to understand scenes by default while freely viewing and directing their fixations to areas that contribute to understanding the scene. In most natural scenes, the low-level saliency and locally meaningful regions are correlated with the objects important to understanding a scene. To dissociate these factors, we have developed the Winograd image pairs inspired by the Winograd Schema Challenge for sentences (Levesque et al., 2012). Each image pair visually looks very similar, but when asked to describe it, people describe it entirely differently. This allows us to study scene understanding while preserving the low-level visual aspects. This dataset gives access to the 18 pairs of images used in the study. We also introduce a new quantitative approach to measure the contribution of an object to scene understanding by assessing the impact of deleting each object from the image on the scene description relative to the gold standard description (Scene Understanding Maps (SUM)). This dataset gives access to the images with each object deleted as well as the Scene Understanding Maps (SUM) As part of the study, we conducted four eye movement conditions (free viewing, scene description, object search, and counting objects; between-subject design, N=50 per condition) and also compared the ability of our SUM model and other fixation prediction models (DeepGaze, GBVS, and Meaning maps) to predict the fixation frequency. We provide all the generated heat maps as part of this dataset. The eye movement data and the code to access it are provided in the GitHub Repository (https://github.com/shravan1394/WinogradDataset) All descriptions collected as part of this study are also provided in the GitHub repository The dataset contains the following: Winograd Image Pairs: 18 image pairs were used in the study. Each pair is split into two different folders (Set_1 and Set_2) SourceData: Numpy files that plot the fixation distribution across object categories and cumulative fixation line plots. HeatMaps: Measured (from eye movement data for 50 subjects per condition) and predicted fixation heatmaps (from models including our SUM model) for each experiment (free viewing, scene description, object search, counting objects). We also provided a summary Word document showing all the heat maps for each image in our dataset. If using this data set or images, please reference this paper: "The Curious Mind: Eye Movements to Maximize Scene Understanding."

Files

Steps to reproduce

The Winograd Image pairs were carefully curated using a trial-and-error process to ensure that most people would describe the scenes consistently. The scenes were created so that each of them had at least one critical object, without which people could not understand what was happening in the scene. The Object erasion is done using the photo editor app of Samsung Galaxy S21 (version: 3.4.2.43). It is not necessary to use the same platform as there are many other open-source models available to erase objects from scenes. The DeepGaze maps were generated using the GitHub link (https://github.com/matthias-k/DeepGaze) Meaning maps is a crowdsourced model where we ask people to rate the informativeness of local patches of images (Henderson et. al (2018), (https://jov.arvojournals.org/article.aspx?articleid=2685927)) GBVS maps were generated using the Github Link (https://github.com/matthias-k/pysaliency) The details for the SUM map are provided in the paper associated with this dataset (https://osf.io/preprints/psyarxiv/6c8gf) and the code implementation is given in the GitHub Repository The Fixation maps were generated by convolving a Gaussian kernel of 0.5 dva standard deviation to the fixation points at each x,y location (details in the paper).

Institutions

University of California Santa Barbara

Categories

Scene Understanding

Licence