Large Language Models in Materials Science: Evaluating RAG Performance in Graphene Synthesis Using RAGAS

Published: 3 March 2025| Version 1 | DOI: 10.17632/ry7phxn4js.1
Contributors:
,
,
,
,
,

Description

Retrieval-Augmented Generation (RAG) systems show promise for specialized scientific domains, but their effectiveness depends on reliable evaluation frameworks. This study assesses the RAGAS framework for evaluating RAG-LLM performance in materials science, focusing on graphene synthesis. We curated a dataset of 100 graphene research papers and developed a test suite comparing a RAG-LLM against both a domain expert and baseline Large Language Models (LLMs) using three key evaluation metrics: Factual Correctness (FC), Context Relevance (CR), and Faithfulness (FF). These metrics, scaled from 0 to 1 with higher values indicating better performance, assess factual accuracy, contextual relevance to the query, and adherence to retrieved sources, respectively. The RAG-LLM achieved moderate factual accuracy (FC=0.43, σ=0.27) but demonstrated strong context relevance (CR=0.80, σ=0.40) and high faithfulness to sources (FF=0.86, σ=0.25). While the expert demonstrated superior factual accuracy (FC=0.51) and context relevance (CR=0.82), the lower faithfulness score (FF=0.62) revealed a tendency to incorporate knowledge beyond provided contexts. Baseline LLMs performed significantly worse (FC=0.13, σ=0.14), highlighting the value of retrieval augmentation. These results validate RAGAS as an effective evaluation framework for materials science RAG-LLM applications while revealing important limitations in current RAG systems' ability to process domain-specific knowledge. Our findings establish a foundation for evaluating RAG systems in scientific domains and identify key areas for improvement in both evaluation metrics and RAG implementation.

Files

Institutions

Nanyang Technological University

Categories

Materials Science Engineering, Large Language Model

Licence