Discourse Segment Type vs. Linguistic Features

Published: 24 February 2017| Version 3 | DOI: 10.17632/4bh33fdx4v.3
Anita de Waard


1. Ten full-text papers in biology were annotated, see 170220_deWaard_Corpus for full references. The papers were selected according to three criteria: 1.1. Papers related to the Voorhoeve paper (Voorhoeve). (*) 1.2. Papers regarding neuropharmacology (Neuro). (**) 1.3. Papers from the Genia corpus (Genia). (***) 2. The papers were obtained by downloading the html and converted into text and then copied into an Excel spreadsheet. 3. Each paper was annotated as follows: 3.1. The first letter of the first author name was added (column 1) 3.2. The papers were (manually) split into discourse segments, as described in [2] 3.3. The section names were added; 3.4. Segment types were identified, according to the categories defined in [2]; 3.5. Verb tense/modality/voice was annotated, according to the categories defined in [2]; 3.6. Verb class was added from a taxonomy described in [3]; 3.7. Modality features were added according to categories described in [4]; 4. The final results with the text enclosed can be found in the file 170220_deWaard_DST_With_Text 5. The final results with only numerical results, for ease of statistical processing, can be found in the files 170220_deWaard_DST_Codes 6. The CodeBook describing the map of the numerical results to the values can be found in the file 170220_deWaard_Value_Labels [2] de Waard, A. and Pander Maat, H. (2009). Categorizing Epistemic Segment Types in Biology Research Articles. In Proceedings of the Workshop on Linguistic and Psycholinguistic Approaches to Text Structuring (LPTS 2009) [3] de Waard , Anita & Pander Maat, Henk. (2010). A classification of research verbs to facilitate discourse segment identification in biological texts. Proceedings from The Interdisciplinary Workshop on Verbs. The identification and representation of verb features. Pisa, Italy [4] de Waard, A. and Pander Maat, H. (2012). Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features, In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (DSDD), ACL 2012


Steps to reproduce

NB these are notes pertaining to the description, not steps to reproduce! (*) Voorhoeve et al (2006) was the first paper I annotated: I selected it in 2006 because (a) it was a paper in Cell – an Elsevier journal – and (b) I knew one of the authors (Alex Griekspoor) and thought it would be useful to talk to the authors, if needed. Since I was interested early on in citations, I selected three other papers that cited (Valastyan) and were cited by (Li and Westbrook) the Voorhoeve paper. (**) In 2009, I discussed a possible collaboration with ISI in Los Angeles, and decided to annotate a paper Gully Burns had annotated (Loiseau). I then wanted to annotate two other papers in this same field, to see if there were trends in this cluster vs. the cell biology cluster. (***) As part of a collaboration with EBI in Cambridge, and the National Centre for Text Mining in Manchester, the three of us annotated the same three papers, which led to a comparison and a publication [1]. [1] Liakata, M., Paul Thompson, Anita de Waard, Raheel Nawaz and Sophia Ananiadou (2012). "A Three-Way Model of Scientific Discourse Annotation for Enhanced Knowledge Extraction". Workshop on Detecting Structure in Scholarly Discourse (DSSD), ACL 2012: Jeju Island, Korea, pp. 37-46


Scholarly Communication, Discourse Processing