Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension

Name: Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension
Creator: Klara Jagrova
Published: 2018-08-29T13:57:42.256Z
Keywords: Psycholinguistics, Language Modeling, Multilingualism, Contrastive Linguistics, Czech Language, Polish Language, Reading Comprehension

Jagrova, Klara; Fischer, Andrea; Avgustinova, Tania; Stenger, Irina

doi:10.17632/ygsyczp8vr.1

Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension

Published: 29 August 2018| Version 1 | DOI: 10.17632/ygsyczp8vr.1

Contributors:

Klara Jagrova, Andrea Fischer, Tania Avgustinova, Irina Stenger

Description

The file webresults_cloze_publication.xlsx contains two types of data: a) transcripts of think-aloud protocols and b) respones collected in a web-based intercomprehension experiment for the same stimuli respectively. Part a) Three Polish stimuli sentences were presented to pairs of Czech native speakers in an experimental setting where both participants saw the stimulus sentence on their computer screens. Placed in different rooms, they were asked to communicate over skype and work together in order to come up with a good Czech translation of the sentence. Hence, the experiment output are audio recordings of the two participants trying to decode the stimuli and the written translations they have entered during the experiment. The transcripts are in sheet 1, 3, and 5 of the .xlsx file. Part b) Czech readers (n=23) were asked to translate certain words or phrases within Polish sentences (those that turned out problematic in part a) into Czech in a web-based translation experiment in cloze task design over the website http://intercomprehension.coli.uni-saarland.de/en/. The responses of part b) and corresponding sociodemographic data are in sheet 2, 4, and 6 of the .xlsx file. The responses were checked manually for correctness. Responses with typos were counted as correct, for the main interest was to find out if respondents had understood the stimuli. The column "Total Time Spent (ms)" is the time respondents have spent on entering their response into the gaps in the cloze test until pressing enter. The file surprisal_scores_CS_LM.txt contains surprisal scores obtained from a statistical trigram language model with Kneser-Ney smoothing trained on a Czech corpus (Czech part of InterCorp merged with the Czech part of the Russian National Corpus, size: 175,190 words).

Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension

Description

Files

Categories

Licence