Comparative Evaluation of ChatGPT and Gemini Responses to Vertigo-Related Questions: Accuracy, Information Quality, and Readability

Published: 25 March 2026| Version 1 | DOI: 10.17632/vd45ckjtxh.1
Contributors:
,
,

Description

This study was designed as a cross-sectional methodological analysis to evaluate the accuracy, quality, and readability of responses generated by large language models (ChatGPT and Gemini) to frequently asked questions about vertigo. A total of 50 questions were initially generated from three sources: ChatGPT, Gemini, and Google’s “People also ask” section. After removing duplicates and irrelevant items, 20 representative questions were selected. Each question was entered into both models, and only the first response was recorded without any follow-up prompts. The responses were evaluated by five blinded experts (two otolaryngologists, two audiologists, and one physiotherapist). Medical accuracy was assessed using a 4-point Likert scale, and information quality was evaluated using the DISCERN instrument. Mean scores across experts were used for analysis. Readability was assessed using multiple standard indices, including Flesch Reading Ease, Flesch–Kincaid Grade Level, Gunning Fog Index, SMOG, Coleman–Liau Index, and the Automated Readability Index. Additional textual features such as word count, sentence number, average sentence length, and percentage of complex words were also analyzed to determine linguistic complexity.

Files

Steps to reproduce

Data were obtained through a structured and reproducible workflow involving large language models (LLMs) and standardized evaluation procedures. Frequently asked questions related to vertigo were identified using three sources: ChatGPT (GPT-4), Gemini, and Google’s “People also ask” section. A predefined prompting strategy was applied to both LLMs to generate patient-oriented questions, while Google queries were conducted using the keyword “vertigo” and related symptom-based terms. All retrieved questions were compiled and managed in Microsoft Excel, where duplicates and non-relevant items were systematically excluded according to predefined criteria. A final set of 20 representative questions was selected for analysis. Model-generated responses were collected by entering each question into the ChatGPT and Gemini web interfaces under controlled conditions. Only the first response was recorded for each query, without additional prompts or iterative refinement, to ensure standardization. The collected responses were stored in text format and evaluated using two validated assessment protocols: (1) expert-based scoring using a 4-point Likert scale for medical accuracy, and (2) the DISCERN instrument for information quality. Evaluations were conducted independently by a blinded panel of five domain experts. Readability analysis was performed using established linguistic indices (e.g., Flesch Reading Ease, SMOG, Gunning Fog), calculated through readability analysis tools/software (e.g., text analysis scripts or readability calculators). Additional textual metrics, including word count, sentence length, and proportion of complex words, were also computed as part of the analysis pipeline.

Institutions

Categories

Audiology, Artificial Intelligence, Peripheral Vertigo, Ear Nose Throat Infection, Positional Vertigo

Licence