neuroGPT-X: Towards an Accountable Expert Opinion Tool for Vestibular Schwannoma
Description
We hypothesize that a well-trained, context-enriched GPT will perform at the level of or better than an expert surgeon in generating comprehensive answers to questions surrounding commonly posed in day-to-day practice regarding vestibular schwannoma. In this study, we make three key contributions to assessing the feasibility of LLMs as a clinical decision-making adjunct. 1. We develop a framework to context-enrich GPT with context relevant to vestibular schwannoma. 2. We compare the performance of ChatGPT (Jan. 30, 2023 model) and a context-enriched GPT model against leading neurosurgical experts worldwide, evaluating the ability of large language models (LLMs) to assist in clinical decision-making. 3. We introduce a proof-of-concept clinical decision-making tool, neuroGPT-X, which incorporates working memory, sources with each answer, and a web-based chat platform to address challenges in using LLMs in a clinical setting, including interpretability, reliability, accountability, and safety.
Files
Steps to reproduce
The data includes (1) data acquisition to build a database of relevant PubMed and Wikipedia articles, (2) data processing and embeddings for semantic searching of relevant articles, (3) survey responses and evaluations for questions posed to ChatGPT, context-enriched GPT, and 4 expert neurosurgeons, (4) data analysis from the survey results, and (5) code to build a chatbot interface (neuroGPT-X) using the Python Flask microweb framework. The directory structure of the data and description of important files within the "Final Data" directory is as follows: code_noapi - abstracts: code and data for PubMed abstracts and Wikipedia articles pulled using web scraping - flaskapp: code to create the neuroGPT-X website - processing: code and data for building a dataset (vs_scrape.ipynb, embedding_model_final_NEW.ipynb), dataset thematic analysis (clustering.ipynb), creating embeddings (embedding_model_final_NEW.ipynb), and answering questions (embedding_model_final_NEW.ipynb) evaluation_analysis - contains evaluation results from 3 neurosurgeon judges and code for analysis - complete_imputed.csv: imputed values using the mode for judge 2 - complete_noimpute.csv: raw data combining all 3 judge evaluations - impute.ipynb: Python notebook that computes imputed values using the mode for judge 2 - agreement_analysis.ipynb: Python notebook that computes various metrics for inter-rater agreement - updated_agreement.ipynb: Python notebook that computes Krippendorff alpha and Fleiss kappa for inter-rater agreement - unblinded.csv: unblinded affective survey results figures - contains image files for all figures in the paper "neuroGPT-X: Towards an Accountable Expert Opinion Tool for Vestibular Schwannoma" neuro_website_output - downloaded website that shows an example of a Q&A conversation between neuroGPT-X and a human neurosurgeon_responses - answers to 15 questions curated by a neurosurgeon by 4 neurosurgeon experts timing_analysis - code and data for how fast neurosurgeon experts, ChatGPT, and context-enriched GPT takes to answer questions