Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines

Published: 31 August 2023| Version 1 | DOI: 10.17632/y6wbt9snv7.1
andrea frosolini


Purpose ChatGPT has gained popularity as a web application since its release in 2022. While artificial intelligence (AI) systems’ potential in scientific writing is widely discussed, their reliability in reviewing literature and providing accurate references remains unexplored. This study examines the reliability of references generated by ChatGPT language models in the Head and Neck field. Methods Twenty clinical questions were generated across different Head and Neck disciplines, to prompt ChatGPT versions 3.5 and 4.0 to produce texts on the assigned topics. The generated references were categorized as “true,” “erroneous,” or “inexistent” based on congruence with existing records in scientific databases. Results ChatGPT 4.0 outperformed version 3.5 in terms of reference reliability. However, both versions displayed a tendency to provide erroneous/non-existent references. Conclusions It is crucial to address this challenge to maintain the reliability of scientific literature. Journals and institutions should establish strategies and good-practice principles in the evolving landscape of AI-assisted scientific writing.


Steps to reproduce

A panel of researchers (A.F., G.G., S.B., L.A.V.) generated 20 clinical questions in five different Head and Neck disciplines (maxillofacial and oral surgery, head and neck oncology, facial trauma, otology and rare disease), asking the AI to produce references on the assigned topic. Both ChatGPT versions 3.5 and 4.0 [7] were tested (test date: June 10, 2023) using the prompts reported in the supplementary material. Each article retrieved by the AI was searched for congruence on relevant scientific databases (Pubmed, Web of Science, Scopus, OpenGrey and Google scholar). If a matching equivalent citation was found on such databases, the article was categorized as “true”. Articles were categorized as “erroneous” or “inexistent”, if they matched only partially an existent reference (having incongruent authors, title, year of publication or journal) or if there was no correspondence at all, respectively. Statistical analysis was performed using Jamovi 2.3 (The Jamovi Project 2022, Sidney, Australia). The Kruskal–Wallis and Fisher’s exact tests were used, as appropriate.


Universita degli Studi di Siena


Medicine, Artificial Intelligence, Otorhinolaryngology, Diagnostic Head and Neck Procedure, ChatGPT