Semantic Coherence Dataset - SCD

Published: 16 September 2022| Version 1 | DOI: 10.17632/s4dtmfmzxw.1


Textual data are central to assess metrics built on top of language models. The dataset contains speech transcripts, which were arranged into two main classes, intended to experiment on intra-subject semantic coherence, and on inter-subject semantic coherence. Transcripts collected have been extracted from talks during almost 13 hours (overall 12:45:17) for the former class, and almost 30 hours (29:47:34) for the latter one. Data delivered in this dataset have been employed to investigate whether the perplexity metric provides reliable results, both in within-subject condition and in across-subject condition. More specifically, perplexity is a measure originally conceived to assess the probabilistic inference properties of language models: it has been recently proved to be an appropriate device to categorize speech transcripts from healthy subjects vs. subjects suffering from Alzheimer Disease. This dataset has been employed to investigate the reliability of the perplexity metrics; data herein can be reused to conduct analysis on measures that rely on probabilistic models and that are aimed at analyzing the linguistic features of text documents.



Universita degli Studi di Torino


Computational Linguistics, Natural Language Processing, Language Modeling, Lexical Semantics