Prioritizing Metamorphic Relations for Bias Detection
Description
This research hypothesizes that sentence diversity metrics can enhance the prioritization of metamorphic relations (MRs) for fairness testing in Large Language Models (LLMs) such as GPT 4.0 and LLaMA 3.0. The goal is to improve fault detection rates and reduce time to first failure (TFF) compared to existing methods like random, distance-based, and fault-based ordering. To test this, 4,700 test cases were generated from templates containing placeholders for sensitive attributes, which were systematically modified using various MRs. The responses from LLMs were analyzed using diversity metrics, including cosine similarity (embedding-level similarity), lexical diversity (vocabulary variation), Named Entity Recognition (NER) diversity (changes in named entities), semantic similarity (using SentenceTransformer embeddings), sentiment similarity, and tone diversity (emotional tone consistency). The findings reveal that diversity-based prioritization significantly outperforms existing methods. It achieved higher fault detection rates and reduced TFF, allowing quicker and more effective identification of fairness faults. Among the metrics, tone diversity detected the highest number of fairness bugs, highlighting its utility in uncovering biases related to emotional tone. NER diversity effectively identified biases linked to named entities, while semantic and sentiment similarity captured more nuanced fairness violations. Additionally, intersectional biases—arising from combinations of sensitive attributes such as religion, political views, and economic status—frequently revealed fairness issues, emphasizing the need for targeted intersectional analysis. The results demonstrate that integrating sentence diversity metrics into MR prioritization provides a more efficient and comprehensive approach to fairness testing. By reducing the time required to identify faults and improving test coverage, this methodology can enhance fairness evaluation in high-stakes applications such as healthcare, finance, and education. Furthermore, the scalability of this approach offers a generalized framework for testing other AI systems, contributing to the development of more equitable and robust AI technologies.
Files
Steps to reproduce
Steps to Reproduce: a) Proposed Approach:Use the predefined test cases - source and follow-up test cases for each MR to execute the metrics using the python script provided. Then, prioritize the MRs. b) Fault based approach:Use the fault information for each MR presented as a CSV file to run the fault based prioritization approach c) Distance based approach: Use the predefined test cases - source and follow-up test cases for each MR to execute the distance based metric using the python script provided. Then, prioritize the MRs. c) Compare the performance of proposed, distance and fault based approach using the fault detection rate and time to first failure. Run the python script presented to calculate the value.