CORHOH: Text Corpus Of Holocaust Oral Histories

Published: 17 February 2025| Version 3 | DOI: 10.17632/gz7v268252.3
Contributor:
Daban Q Jaff

Description

This paper outlines the compilation and annotation process of the CORHOH: Text Corpus of Holocaust Oral Histories. The corpus consists of 500 oral histories from Holocaust survivors, with each narrative retrieved from the Let Them Speak Project (Toth 2021). The text is processed and annotated with metadata detailing both the testimony givers and the interviews themselves. All technical content has been removed, and a unique identifier has been assigned to each question (posed by the interviewer) and answer (provided by the survivor). The corpus complies with TEI guidelines (TEI Consortium 2023). The dataset includes 106,519 questions and 107,125 answers, making it a valuable interdisciplinary resource. Researchers can retrieve and analyse questions and answers separately based on their specific research objectives. This corpus is particularly suited for studies on trauma expression and psychological concepts embedded in survivors' narratives. Additionally, it offers potential for data mining to uncover patterns (e.g., migration trends) and supports natural language processing techniques such as topic modelling, sentiment analysis, and named entity recognition. The CORHOH data is sourced from the United States Holocaust Memorial Museum (USHMM) and is publicly available under the CC BY-NC-SA 4.0 license.

Files

Institutions

Universitat Erfurt

Categories

Linguistics

Funding

German Academic Exchange Service

Licence