Large Language Models in Materials Science: Evaluating RAG Performance in Graphene Synthesis Using RAGAS
Description
Retrieval-Augmented Generation (RAG) systems show promise for specialized scientific domains, but their effectiveness depends on reliable evaluation frameworks. This study assesses the RAGAS framework for evaluating RAG-LLM performance in materials science, focusing on graphene synthesis. We curated a dataset of 100 graphene research papers and developed a test suite comparing a RAG-LLM against both a domain expert and baseline Large Language Models (LLMs) using three key evaluation metrics: Factual Correctness (FC), Context Relevance (CR), and Faithfulness (FF). These metrics, scaled from 0 to 1 with higher values indicating better performance, assess factual accuracy, contextual relevance to the query, and adherence to retrieved sources, respectively. The RAG-LLM achieved moderate factual accuracy (FC=0.43, σ=0.27) but demonstrated strong context relevance (CR=0.80, σ=0.40) and high faithfulness to sources (FF=0.86, σ=0.25). While the expert demonstrated superior factual accuracy (FC=0.51) and context relevance (CR=0.82), the lower faithfulness score (FF=0.62) revealed a tendency to incorporate knowledge beyond provided contexts. Baseline LLMs performed significantly worse (FC=0.13, σ=0.14), highlighting the value of retrieval augmentation. These results validate RAGAS as an effective evaluation framework for materials science RAG-LLM applications while revealing important limitations in current RAG systems' ability to process domain-specific knowledge. Our findings establish a foundation for evaluating RAG systems in scientific domains and identify key areas for improvement in both evaluation metrics and RAG implementation.