Evaluating the Performance of ChatGPT on Dermatology Board-Style Exams: A Meta-Analysis of Text-Based and Image-Based Question Accuracy
Description
1. Supplemental Figure I: A PRISMA diagram illustrating the study selection process. This diagram shows how articles were retrieved from PubMed and SCOPUS databases, with the search terms related to ChatGPT, dermatology, and exam-style questions. 2. Supplemental Figure II: This figure presents the total accuracy for each GPT model (ChatGPT-3, ChatGPT-3.5, and ChatGPT-4) across all the questions tested. It highlights the number of studies reporting performance for each model and provides 95% confidence intervals. 3. Supplemental Figure III: This figure breaks down the performance of each GPT model according to specific dermatology categories as per the American Board of Dermatology (ABD) criteria, including dermatopathology, general dermatology, pediatric dermatology, science research, and surgical dermatology. It also provides 95% confidence intervals for each category. 4. Supplemental Figure IV: This figure compares the performance of each GPT model on visual versus text-based questions. Since GPT-3 and GPT-3.5 lack visual recognition capabilities, this comparison primarily focuses on the performance of ChatGPT-4 in interpreting visual questions. The figure also provides 95% confidence intervals for each performance category