Supplementary Materials for “Fitzpatrick skin type prompting does not improve diagnostic accuracy in multimodal large language models: A within-image experimental study”
Description
This supplementary dataset supports the study “Fitzpatrick skin type prompting does not improve diagnostic accuracy in multimodal large language models: A within-image experimental study.” It includes the study protocol, prompt templates, diagnostic scoring ontology, master analytic dataset, model outputs with associated diagnostic accuracy scores, statistical analysis code, and supplementary analyses evaluating baseline Fitzpatrick skin type accuracy disparities and malignant lethal miss rates by prompt condition. These materials are provided to support reproducibility of the reported analyses and interpretation of model performance across Fitzpatrick skin type prompting conditions.
Files
Steps to reproduce
This study used all 656 biopsy-confirmed clinical photographs from the publicly available Diverse Dermatology Images dataset with dermatologist-assigned Fitzpatrick skin type (FST) groups. Images were grouped as FST I–II, III–IV, or V–VI. Each image was evaluated using four prompt conditions: image-only, correct FST, and two incorrect FST prompts corresponding to the two non-ground-truth FST groups. Incorrect FST prompts were categorized as adjacent mismatch or extreme mismatch, with FST III–IV images contributing only adjacent mismatch conditions. Each image-prompt combination was submitted independently via a web browser interface to ChatGPT 5.2 and Gemini 3.1 Pro using the prespecified prompt templates. Clean “Temporary Chat” sessions were used for each submission to prevent context carryover. Model outputs were recorded, trimmed of non-diagnostic preamble or caveats when needed, and scored against the biopsy-confirmed ground-truth diagnosis using the prespecified diagnostic scoring ontology. Diagnostic accuracy was scored ordinally from 0 to 3 based on whether the ground-truth diagnosis appeared in the model’s top three ranked differential diagnoses: 3 = ranked first, 2 = ranked second, 1 = ranked third, and 0 = absent from the top three. Diagnoses were mapped to predefined canonical categories, with selected subtype families collapsed as specified in the scoring ontology. Primary analyses used a cumulative link mixed model with image as a random intercept and model identity and prompt condition as fixed effects. Prespecified comparisons were controlled using the Benjamini-Hochberg procedure at a 10% false discovery rate. Supplementary analyses evaluated image-only diagnostic accuracy disparities across FST groups and malignant lethal miss rates by prompt condition. Full prompt templates, scoring rules, the study/IRB protocol, model outputs, supplementary analysis tables, and the completed CLEAR Derm checklist for image-based AI algorithm development in dermatology are included in this uploaded dataset to document the workflow and support reproducibility.
Institutions
- Indiana University School of MedicineIndiana, Indianapolis