Skip to main content

Computer Speech & Language

ISSN: 0885-2308

Visit Journal website

Datasets associated with articles published in Computer Speech & Language

Filter Results
1970
2024
1970 2024
7 results
  • Data for: Predicting emotion reactions to news articles in social networks
    Corpus of Spanish news articles with annotated emotional reaction distribution from tweet responses. 288 news articles, published from 01-01-20015 to 01-01-2017, were collected from three Mexican newspapers (El Universal, La Jornada and Excelsior). The annotation task was developed by four different annotators during a three month period and they tagged the emotions expressed in tweet responses to each news article. Counting of the emotions expressed in tweet responses was used to determine the distribution of these emotions in the news articles.
    • Dataset
  • Data for: Learning English-Chinese Bilingual Word Representations from Sentence-Aligned Parallel Corpus
    This file includes three datasets for our tasks bilingual dictionary induction, cross-lingual analogy reasoning, and cross-lingual word semantic relatedness. We release them for the NLP community to explore the related issues.
    • Dataset
  • Data for: Exploiting social and local contexts propagation for inducing Chinese microblog-specific sentiment lexicons
    This data set includes UCI data set (microblogPCU), Weibo data set (my_weibo_data), three general sentiment lexicons. The results of our framework include UCI and Weibo sentiment nouns, UCI sentiment features and Weibo sentiment features.
    • Dataset
  • Data for: Language Models, Surprisal and Fantasy in Slavic Intercomprehension
    The file webresults_cloze_publication.xlsx contains two types of data: a) transcripts of think-aloud protocols and b) respones collected in a web-based intercomprehension experiment for the same stimuli respectively. Part a) Three Polish stimuli sentences were presented to pairs of Czech native speakers in an experimental setting where both participants saw the stimulus sentence on their computer screens. Placed in different rooms, they were asked to communicate over skype and work together in order to come up with a good Czech translation of the sentence. Hence, the experiment output are audio recordings of the two participants trying to decode the stimuli and the written translations they have entered during the experiment. The transcripts are in sheet 1, 3, and 5 of the .xlsx file. Part b) Czech readers (n=23) were asked to translate certain words or phrases within Polish sentences (those that turned out problematic in part a) into Czech in a web-based translation experiment in cloze task design over the website http://intercomprehension.coli.uni-saarland.de/en/. The responses of part b) and corresponding sociodemographic data are in sheet 2, 4, and 6 of the .xlsx file. The responses were checked manually for correctness. Responses with typos were counted as correct, for the main interest was to find out if respondents had understood the stimuli. The column "Total Time Spent (ms)" is the time respondents have spent on entering their response into the gaps in the cloze test until pressing enter. The file surprisal_scores_CS_LM.txt contains surprisal scores obtained from a statistical trigram language model with Kneser-Ney smoothing trained on a Czech corpus (Czech part of InterCorp merged with the Czech part of the Russian National Corpus, size: 175,190 words).
    • Dataset
  • The test-retest speech feature data
    This study uses intra-class correlation coefficients to evaluate the test-retest reliability of speech features of commonly used speech tasks in a healthy population.The feature data file contains 40 participants whose native language is Mandarin, including 25 males and 15 females. The Excel file shows the raw data of participants' speech features, where the id is the participant's number, the session is the participant's measurement of two tests, and subsequent columns represent the names of speech features.Based on the extracted speech feature data, calculate the intra-class correlation values of each feature to evaluate the retest reliability of each speech feature.
    • Dataset
  • Research data supporting “Source Sentence Simplification for Statistical Machine Translation”
    This data set contains subsets of English-German test sets from the Workshop for Machine Translation (WMT) which have been annotated with manual text simplification information on the source side in the form of gap begin and gap end symbols (, ). The data was tokenized and truecased using the processing scripts distributed with the Moses SMT system. The source simplifications were produced by workers recruited on the crowdsourcing platform Crowdflower (https://www.crowdflower.com). We asked workers to simplify a sentence by deleting words and punctuation, while trying to retain the most important information in the shortened sentence. Their performance was controlled using test questions and a second Crowdflower task which asked workers to identify bad simplifications from the first task. The outcomes of the second task were aggregated by combining an agreement score and the average worker trust score for each simplification. We selected randomly from the remaining simplifications with a combined score of at least 0.5.
    • Dataset
  • Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations"
    The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" http://dx.doi.org/10.1016/j.csl.2016.06.008. The files in the zip file are of three types:- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.- .lur, which provides a more detailed decomposition of the word error rate across different tags. The following is a description about the naming convention of the files: TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article. All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.
    • Dataset