Synopsis, reviews, and keywords for model keyword extraction study in the movie domain

Published: 7 February 2025| Version 1 | DOI: 10.17632/322hmdsyrm.1
Contributors:
Carlos González-Santos,
,
,
,

Description

The use of keywords is increasingly being applied across diverse domains, including the movie industry, whose main platforms are adopting advanced natural language processing techniques. Algorithms for automatic extraction of keywords can provide relevant information in this domain. The data presented here have been generated to perform a keyword extractio nmodel study in the movie domain. The Excel file “movie_inputs.xlsx” contains the movie synopses and reviews, or the concatenation of both, used to extract the movie keywords. There are 4 columns: "id_title", and "title", which refer to the movie title; "type", which can be "both", "reviews", or "synopsis"; and "text", which contains the movie content. In the case of both and reviews, movie ids go from 1 to 21, whereas movie ids goes from 1 to 100 in the case of synopsis. The Excel file “movie_keywords.xlsx” contains the 20 golden keywords assigned to each movie in order to perform the evaluation. There are 5 columns: "id_title", "title", and "type" indicate the same as the previous file; "keyword_id" refers to the keyword ids of each movie, which go from 1 to 20; and "keyword" column contains the keyword itself. All details regarding data collection and dataset construction are provided in the following paper. This paper is encouraged to be cited in case of any scientific research publication is produced using this dataset: Carlos González-Santos, Miguel A. Vega-Rodríguez, Carlos J. Pérez, Iñaki Martínez-Sarriegui, and Joaquín M. López-Muñoz. A keyword extraction model study in the movie domain with synopsis and reviews, Knowledge and Information Systems, 2025, https://doi.org/10.1007/s10115-025-02350-4 This research has been supported by Ministry of Science and Innovation - Spain and State Research Agency - Spain (Projects PID2022-137275NA-I00 and PID2021-122209OB-C32 funded by MCIN/AEI/10.13039 /501100011033), Junta de Extremadura - Spain (Projects IDA3-19-0001-3, GR21017, and GR21057), and European Union (European Regional Development Fund).

Files

Institutions

Universidad de Extremadura

Categories

Cinema, Natural Language Processing, Information Extraction

Funding

Agencia Estatal de Investigación

PID2022-137275NA-I00

Agencia Estatal de Investigación

PID2021-122209OB-C32

Government of Extremadura

IDA3-19-0001-3, GR21017, and GR21057

European Commission

IDA3-19-0001-3, GR21017, and GR21057

Licence