Dataset of sentence pairs from renowned authors in North American literature

Published: 23 September 2021| Version 1 | DOI: 10.17632/tg6pxsnxr5.1
Harã Heique dos Santos


This dataset contains pairs of sentences taken from 35 literary works by three renowned authors of North American literature, namely: William Cuthbert Faulkner, Ernest Miller Hemingway and Philip Milton Roth. There are three versions of the dataset: 1 - Pairs of pre-processed sentences with removal of punctuation, removal of alphanumeric characters and normalization of words in lowercase; 2 - Pairs of pre-processed sentences including the removal of stopwords; 3 - Pairs of pre-processed sentences including the removal of stopwords and the lemmatization of words. The datasets were created inspired by a Kaggle challenge of identifying duplicate sentence pairs. Each dataset contains 72600 sentences by each author, 72000 being reserved for training and validation of the LSTM Siamese neural network used and 600 for testing/prediction.


Steps to reproduce

When opening the file, it will contain two folders: one containing the three datasets used in the network training process and the other containing the three datasets for testing. The .csv files for training and validation have seven columns. The first two correspond to the numerical identification of sentence pairs, the third and fourth columns show the authors' sentence pairs themselves, the fifth has a label that can be 0 or 1, where 0 indicates that the sentences are from different authors and 1 that are by the same author. The last two columns contain the names of the authors of the sentences present in columns 3 and 4, respectively. The .csv files for testing have 5 columns. The first two are the sentences of the authors, the third and fourth columns are the authors of the sentences of the first two columns. Finally, the last column has a label that can be 0 or 1 and has the same function as training and validation datasets.


Instituto Federal de Educacao Ciencia e Tecnologia do Espirito Santo


Natural Language Processing, Similarity Measure, Recurrent Neural Network, Deep Learning, North American Literature