Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension

Published: 29-08-2018| Version 1 | DOI: 10.17632/ygsyczp8vr.1
Klara Jagrova,
Andrea Fischer,
Tania Avgustinova,
Irina Stenger


The file webresults_cloze_publication.xlsx contains two types of data: a) transcripts of think-aloud protocols and b) respones collected in a web-based intercomprehension experiment for the same stimuli respectively. Part a) Three Polish stimuli sentences were presented to pairs of Czech native speakers in an experimental setting where both participants saw the stimulus sentence on their computer screens. Placed in different rooms, they were asked to communicate over skype and work together in order to come up with a good Czech translation of the sentence. Hence, the experiment output are audio recordings of the two participants trying to decode the stimuli and the written translations they have entered during the experiment. The transcripts are in sheet 1, 3, and 5 of the .xlsx file. Part b) Czech readers (n=23) were asked to translate certain words or phrases within Polish sentences (those that turned out problematic in part a) into Czech in a web-based translation experiment in cloze task design over the website The responses of part b) and corresponding sociodemographic data are in sheet 2, 4, and 6 of the .xlsx file. The responses were checked manually for correctness. Responses with typos were counted as correct, for the main interest was to find out if respondents had understood the stimuli. The column "Total Time Spent (ms)" is the time respondents have spent on entering their response into the gaps in the cloze test until pressing enter. The file surprisal_scores_CS_LM.txt contains surprisal scores obtained from a statistical trigram language model with Kneser-Ney smoothing trained on a Czech corpus (Czech part of InterCorp merged with the Czech part of the Russian National Corpus, size: 175,190 words).

