Chinese EFL Learners' Writing Evaluation by ChatGPT

Published: 18 April 2023| Version 1 | DOI: 10.17632/8fbzsg82p9.1


The data mainly provide ChatGPT's rating on 82 Chinese EFL learners' writings with scores and comments as well as the scores by reliable manual rating. With the data, researchers can do quantitative or qualitative research on the reliability of EFL writing evaluation with ChatGPT by taking reliable manual ratings as a reference. It includes two parts: 1) ChatGPT's rating with scores and comments, and 2) statistics on overall, average, and specific scores of manual and ChatGPT's rating. 1. EFL Writings with ChatGPT's Rating There are 270 EFL expository compositions in the Spoken and Written Corpus of Chinese Learners Version 2.0. (Wen et al., 2008) written by 270 Chinese EFL learners within a time limit of 30 minutes. Their IDs are from "WEXP0001" to "WEXP0270". Eighty-two compositions are randomly sampled from the 270 compositions. The sample size is determined by the power analysis software G*Power (Faul et al., 2009; Faul et al., 2007). A set of random 82 numbers from 270 are generated by using the Random Numbers Generator. The ChatGPT's rating is generated by asking ChatGPT to rate the 82 EFL writings one by one. The next day, the same 82 writings were rated by ChatGPT again with the same prompts to obtain another set of scores. 2. Scores of Manual and ChatGPT's Rating The spreadsheet provides not only ChatGPT's rating on the EFL compositions with overall and specific scores but also corresponding scores of manual rating. For the manual rating, the compositions were rated by three experienced raters on aspects of language (40%), content (30%), and organization (30%) and the total score was the sum of the three parts. Then the average scores of the total score and scores of each aspect from the three raters were calculated. The inter-rater reliability analysis between scores from every two raters was conducted. The result showed that they have significant (p < 0.01) and high inter-rater reliabilities, which were from 0.710 to 0.785. References Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior research methods, 41(4), 1149-1160. Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods, 39(2), 175-191. Wen, Q., Wang, L., & Liang, M. (2008). Spoken and Written English Corpus of Chinese Learners (Version 2.0). Foreign Language Teaching and Research Press.



Universiti Putra Malaysia


Artificial Intelligence, Writing