Performance of Large Language Model Artificial Intelligence on Dermatology Board Exam Style Questions
Description
Google BARD performed better than ChatGPT in all question genres (General Dermatology, Dermatopathology, Surgery, Pediatric Dermatology). Differences in scores were detected to be statistically significant for the ‘Question Genre’ (p<0.05) but not the ‘Type.' (p>0.05) for ChatGPT and Google BARD. Compared to General Dermatology, performance in Dermatopathology was worse for both ChatGPT and Google BARD.
Files
Steps to reproduce
We typed out each question in an Excel file and then copied and pasted each question into ChatGPT and Google BARD to prompt an answer. Depending on the correctness of the answer it gave, we assigned a binary value, "1" or "0". Question genre and Taxonomy were also answered for analysis. Statistical analysis was carried out using Jamovi v2.3.21.0 software. Analysis of Variance (ANOVA) was used to determine differences between sub-specialty knowledge and performance of the AI. Omnibus likelihood ratio tests were used to detect differences between correct and incorrect responses with respect to both question subject types and taxonomy classes. If detected, a binomial logistic regression test was utilized to compare differences between correct and incorrect answers with respect to both question genre and question type (taxonomy).