Synopsis, reviews, and keywords for model keyword extraction study in the movie domain
Description
The use of keywords is increasingly being applied across diverse domains, including the movie industry, whose main platforms are adopting advanced natural language processing techniques. Algorithms for automatic extraction of keywords can provide relevant information in this domain. The data presented here have been generated to perform a keyword extractio nmodel study in the movie domain. The Excel file “movie_inputs.xlsx” contains the movie synopses and reviews, or the concatenation of both, used to extract the movie keywords. There are 4 columns: "id_title", and "title", which refer to the movie title; "type", which can be "both", "reviews", or "synopsis"; and "text", which contains the movie content. In the case of both and reviews, movie ids go from 1 to 21, whereas movie ids goes from 1 to 100 in the case of synopsis. The Excel file “movie_keywords.xlsx” contains the 20 golden keywords assigned to each movie in order to perform the evaluation. There are 5 columns: "id_title", "title", and "type" indicate the same as the previous file; "keyword_id" refers to the keyword ids of each movie, which go from 1 to 20; and "keyword" column contains the keyword itself. All details regarding data collection and dataset construction are provided in the following paper. This paper is encouraged to be cited in case of any scientific research publication is produced using this dataset: Carlos González-Santos, Miguel A. Vega-Rodríguez, Carlos J. Pérez, Iñaki Martínez-Sarriegui, and Joaquín M. López-Muñoz. A keyword extraction model study in the movie domain with synopsis and reviews, Knowledge and Information Systems, 2025, https://doi.org/10.1007/s10115-025-02350-4 This research has been supported by Ministry of Science and Innovation - Spain and State Research Agency - Spain (Projects PID2022-137275NA-I00 and PID2021-122209OB-C32 funded by MCIN/AEI/10.13039 /501100011033), Junta de Extremadura - Spain (Projects IDA3-19-0001-3, GR21017, and GR21057), and European Union (European Regional Development Fund).
Files
Institutions
Categories
Funding
Agencia Estatal de Investigación
PID2022-137275NA-I00
Agencia Estatal de Investigación
PID2021-122209OB-C32
Government of Extremadura
IDA3-19-0001-3, GR21017, and GR21057
European Commission
IDA3-19-0001-3, GR21017, and GR21057