Data on regional, ethnicity, and minorities representation in movies

Published: 20 February 2025| Version 1 | DOI: 10.17632/kzv2m4hsvw.1
Contributor:
FERNANDO TAMBERLINI ALVES

Description

The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.

Files

Steps to reproduce

The proposed dataset is developed by going through the following steps: 1. Download the files from MovieLens 25M, IMDb Non-Commercial Data, and ANCINE Open Data . 2. Map all available movie data across these files. 3. Map additional movie data from other Internet sources (e.g., IMDb, Letterboxd, Metacritic, and Rotten Tomatoes) and collect it through web scraping and public APIs. 4. Perform an inner join between the MovieLens dataset and the IMDb dataset. Then, conduct a left join with the ANCINE dataset using the resulting merged data. 5. For movies present in all three datasets (MovieLens, IMDb, and ANCINE), retrieve the corresponding subtitle files from Internet sources (e.g., Subscene e Subdl). 6. For each movie in the final merged dataset, submit questions regarding Gender, Race, Religion, Nationality, Birthplace, Ethnicity, and Minority data to the LLM tool, using the retrieved subtitle files as input. Note that subtitles were only captured for a subset of the films. 7. Process, aggregate, and summarize the collected data Sources: https://grouplens.org/datasets/movielens/ https://developer.imdb.com/non-commercial-datasets/ https://www.gov.br/ancine/pt-br/oca/dados-abertos https://rapidapi.com/ https://imdb.com https://www.metacritic.com/ https://www.rottentomatoes.com/ https://letterboxd.com

Institutions

Universidade Federal Fluminense Instituto de Computacao

Categories

Cinema, Cultural Diversity, Recommendation System

Licence