School Holiday Essay Corpus

Name: School Holiday Essay Corpus
Creator: Muhamad Fadzllah Zaini
Published: 2026-03-12T11:13:44.880Z
Keywords: Linguistics, Statistics, Applied Linguistics, Corpus Linguistics, Malay Language, Vocabulary, Socioeconomic Studies

Zaini, Muhamad Fadzllah; Wan Halim, Wan Athirah Adilah; Muhammad, Mazura Mastura; Mohd Rusli , Nur Farahkhanna; Md.Tahir , Mohd Haniff; Ahmad Sani , Nurshafawati; Ismail, Habibah; Zakaria , Suriati; Md. Yusoff, Md. Zahril Nizam; jamaluddin, norliza; Janan, Dahlia

doi:10.17632/pybsmy8vfd.1

School Holiday Essay Corpus

Published: 12 March 2026| Version 1 | DOI: 10.17632/pybsmy8vfd.1

Contributors:

Muhamad Fadzllah Zaini,

,

Description

The School Holiday Essay Corpus is a linguistic data collection developed to study the use of language in students' writing based on their personal experiences during school holidays. The theme of school holidays was chosen because it is a topic close to students' daily experiences and allows them to write narratively based on the actual activities they experience. In the context of language education, this topic is often used in writing exercises because it provides space for students to express experiences, emotions, and social interactions through language in a more spontaneous and authentic way. In this research, the School Holiday Essay Corpus was developed as part of a student language data collection project involving 670 student essay texts with a total of 177,027 word tokens (177K tokens). This data was collected from students who came from six geographical zones in Malaysia, namely the North, South, East, West, Sabah and Sarawak Zones. The division of these zones aims to ensure a more balanced geographical representation and to allow researchers to see variations in language use among students from different socioeconomic backgrounds and educational environments. In addition, the data collection of this corpus also involves students from three levels of education, namely Primary School (around 12 years old), Vocational College (around 16 years old) and Pre-University (around 18 years old). This approach allows for the analysis of language development to be carried out across age levels and educational levels. Through such a data structure, this corpus not only provides an overview of students' language use at a certain level, but also opens up space for broader comparative linguistic studies. Within the framework of corpus linguistics, the construction of a corpus of student writing provides an opportunity to examine various aspects of language such as lexical diversity, word frequency, spelling errors, sentence formation, as well as the use of metaphors and emotional expressions in student writing. Texts themed around school holidays in particular often contain narratives of experiences involving family activities, travel, community activities and personal experiences. Therefore, these texts provide a rich linguistic context for the analysis of more natural language forms compared to formal texts. The construction of the School Holiday Essay Corpus also contributes to the development of authentic Malay language data sources in the fields of corpus linguistics and language education. Compared to corpora consisting of formal texts such as newspapers or literary works, the student essay corpus shows more natural language use and reflects the reality of language literacy in an educational context. Therefore, this corpus has the potential to be an important source for studies related to language development, linguistic variation, language errors, and the development of Malay language teaching and learning materials.

Files

Steps to reproduce

Phase 1: Pre-data collection Step 1: Applying for permission to conduct the study at the school The researcher first submits a formal application for approval to conduct the research at the school. This is the main entry point before any interaction with the participants can be made. Step 2: Negotiations between the school and Sultan Idris Education University (UPSI) After the application is made, negotiations are held between the school and UPSI to coordinate administrative matters, participant access, instrument implementation and study logistics requirements. Step 3: Consent form agreement Before data is collected, formal consent is obtained through a consent form. This step is important because the study involves a vulnerable group, so compliance with educational ethics and human research ethics is the core of this protocol. Phase 2: During data collection Step 4: Briefing to teachers and distribution of instruments Teachers are briefed on the study procedures, their roles, and how the instruments are used. At this stage, the study instruments are also distributed to ensure that the data collection is uniform. Step 5: Implementation of written and oral data collection Language data is then collected in two main forms, namely written data and spoken/oral data. This is the core step in building a corpus or database of student language. Phase 3: Post-data collection Step 6: Processing the raw data that has been collected All materials obtained from the field are organized and processed as raw data. This includes initial review, file organization and preparation of data for further analysis. Step 7: Data transcription For oral data, recordings are transcribed into text form. This step is important so that the spoken data can be analyzed together with the written data in a more uniform format. Step 8: Coding the file After processing and transcription, the data is systematically coded. This coding usually involves labeling the data identity, participant categories or certain variables so that the file is easy to track and analyze. Step 9: Data validation The final step is data verification or validation to ensure the accuracy, consistency and reliability of the data that has been constructed before being used in subsequent analyses.

School Holiday Essay Corpus

Description

Files

Steps to reproduce

Institutions

Categories

Funders

Related Links

Licence