Veterinary Medicine Corpus (VMC)

Published: 27 December 2023| Version 1 | DOI: 10.17632/jwvhbb7hsj.1
Contributors:
, Erdem Akbas

Description

The Veterinary Medicine Corpus (VMC, Özer and Akbaş, 2023), is a machine-generated collection of high-quality, open-access Research Articles (RAs) published between 2010 and 2022. Compiled in March 2022 using AntCorGen, a software utilizing the PLOS webpage's Application Programming Interface (API), the VMC captures and organizes complete articles. The VMC comprises articles that maintain structural consistency, featuring five primary sections: abstract, introduction, materials and methodology, results and discussion, and conclusion. Notably, metadata is absent from the corpus. Initially, the system produced a corpus of 1,488 veterinary medicine articles, totaling 8,000,000 words. Subsequent to the procedures outlined in Özer & Akbaş (2024) titled ''Assembling a justified list of academic words in veterinary medicine: The veterinary medicine academic word list (VMAWL)'', the corpus was refined to 1,449 articles. These articles were automatically labeled and sorted using Digital Object Identifiers (DOIs). The PLOS API employed a machine-aided term-mapping algorithm to suggest categories for the published papers, utilizing the PLOS thesaurus (GitHub, 2022). However, the categorization produced by the PLOS website was not optimal. The default categorization provided by the PLOS thesaurus resulted in an uneven distribution across categories rather than a flat one. Additionally, the suggested categories in Table 2 were generated by machines and based on lexis rather than scientific criteria. To assess the statistical outcomes, we conducted a mock analysis using LancsBox, which revealed high dispersion values (as indicated by the coefficient of variation, CV) for frequently occurring tokens. This raised concerns about the reliability of developing a discipline-specific word list. As a solution, we opted to refine the categorization process by seeking expert advice. Consequently, we established 4 parent categories and 17 child categories to streamline the classification, as illustrated below: 1. Pre-clinic (n=352) Disease control and prevention/epidemiology Veterinary microbiology Veterinary parasitology Veterinary pathology Veterinary pharmaceutics and pharmacology Veterinary virology 2. Internal medicine (n=426) Veterinarians Veterinary biometry Veterinary diagnostics Veterinary hospitals Veterinary medicine. Veterinary surgery 3. Zootechnics (n=121) Livestock care and wildlife sciences 4. Veterinary diseases (n=550) Bacterial diseases Parasitic diseases Viral diseases Other (inherited/genetic diseases/disorders, genetical studies of species regarding diseases) The filnal VMC is 1,449 artcles with 7,962,021 tokens.

Files

Steps to reproduce

Please contact erdemakbas@erciyes.edu.tr or mustafa.ozer@agu.edu.tr for potential questions about the dataset.

Institutions

Erciyes Universitesi

Categories

Veterinary Medicine, Corpus Linguistics, English Vocabulary, English for Special Purposes, English for Academic Purposes

Licence