CaRS-50 Dataset: Annotated corpus of rhetorical Moves and Steps in 50 article introductions

Name: CaRS-50 Dataset: Annotated corpus of rhetorical Moves and Steps in 50 article introductions
Creator: Charles Lam
Published: 2025-07-07T17:58:26.622Z
Keywords: Linguistics, Artificial Intelligence, Computational Linguistics, Data Science, Natural Language Processing, Machine Learning, Few-Shot Learning

Lam, Charles; Nnamoko, Nonso

doi:10.17632/kwr9s5c4nk.1

CaRS-50 Dataset: Annotated corpus of rhetorical Moves and Steps in 50 article introductions

Published: 7 July 2025| Version 1 | DOI: 10.17632/kwr9s5c4nk.1

Contributors:

,

Description

This dataset contains 50 XML files representing the introduction sections of research articles annotated according to the CaRS (Create a Research Space) model by Swales (1990). The articles were sourced from BioRxiv and evenly sampled across five categories: Animal Behavior & Cognition, Biochemistry, Biophysics, Ecology, and Physiology (10 articles per category). ========== Annotation & Reliability ============ Each sentence in the introductions has been manually annotated with both a Move (coded 1–3) and a Step (coded a–d), capturing the rhetorical function of the sentence within the research introduction structure. For instance, “Move 2 Step b” is coded as 2b. The annotator (CL) is a senior lecturer in English for Academic Purposes (EAP), with extensive experience in teaching academic writing and rhetorical structure actually carried out the annotation. However, to ensure high-quality and consistent labeling aligned with the CaRS rhetorical framework, we initially carried out an inter-rater agreement analysis on a small sample (38 sentences representing approximately 3%) of the overall sentences in the dataset. This involved CL and a second annotator (NN) who is a lecturer in computer science, purposefully selected to bring an outsider perspective and to mitigate potential bias arising from prior linguistic knowledge. The overall observed agreement was moderate, with kappa = 0.426 and alpha = 0.424, indicating reasonable inter-rater reliability (given annotator NN's limited expertise) and supporting the trustworthiness of the dataset. The rest of the dataset was then annotated by CL. The annotated dataset was stored in XML following a consistent hierarchical structure shown below: <biology_intro>: Root element encapsulating the entire annotated introduction. <fulltextID>: Unique identifier. <title>: Paper title. <authors>: Author list. <doi>: DOI link to original article. <source>: Repository (e.g., biorxiv). <category>: Disciplinary category of the paper. <fulltext>: Container for the annotated introduction text. <paragraph>: Paragraphs. <sentence>: Sentences in a paragraph. <sentenceID>: A unique identifier for the sentence (e.g., t001s0001). <text>: Raw text of the sentence. <step>: The annotation label e.g., 1b, 2a, 3b, where the first digit is the Move and the letter is the Step. ========== How to cite the dataset ============ Charles Lam and Nonso Nnamoko. 2024. Quantitative metrics to the CARS model in academic discourse in biology introductions. In Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024), pages 71–77, St. Julians, Malta. Association for Computational Linguistics. And Omotola O, Nnamoko N et al. 2025. Automatic detection of the CaRS framework in scholarly writing using Natural Language Processing. Electronics.

Files

Institutions

Edge Hill University
University of Leeds

CaRS-50 Dataset: Annotated corpus of rhetorical Moves and Steps in 50 article introductions

Description

Files

Institutions

Categories

Licence