Data on regional, ethnicity, and minorities representation in movies
Description
The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.
Files
Steps to reproduce
The proposed dataset is developed by going through the following steps: 1. Download the files from MovieLens 25M, IMDb Non-Commercial Data, and ANCINE Open Data . 2. Map all available movie data across these files. 3. Map additional movie data from other Internet sources (e.g., IMDb, Letterboxd, Metacritic, and Rotten Tomatoes) and collect it through web scraping and public APIs. 4. Perform an inner join between the MovieLens dataset and the IMDb dataset. Then, conduct a left join with the ANCINE dataset using the resulting merged data. 5. For movies present in all three datasets (MovieLens, IMDb, and ANCINE), retrieve the corresponding subtitle files from Internet sources (e.g., Subscene e Subdl). 6. For each movie in the final merged dataset, submit questions regarding Gender, Race, Religion, Nationality, Birthplace, Ethnicity, and Minority data to the LLM tool, using the retrieved subtitle files as input. Note that subtitles were only captured for a subset of the films. 7. Process, aggregate, and summarize the collected data Sources: https://grouplens.org/datasets/movielens/ https://developer.imdb.com/non-commercial-datasets/ https://www.gov.br/ancine/pt-br/oca/dados-abertos https://rapidapi.com/ https://imdb.com https://www.metacritic.com/ https://www.rottentomatoes.com/ https://letterboxd.com