Replication Package: An Empirical Investigation of Large Language Models for Automated Software Compliance Testing: Evidence from Web Accessibility

Published: 3 June 2026| Version 1 | DOI: 10.17632/bwsyv45b9m.1
Contributors:
,

Description

This replication package accompanies the paper "An Empirical Investigation of Large Language Models for Automated Software Compliance Testing: Evidence from Web Accessibility" published in Information and Software Technology. Contents: - Test specifications for 384 W3C ACT test cases across 39 WCAG 2 AA success criteria - Evaluation scripts for seven Large Language Models (deepseek-reasoner, deepseek-chat, groq_qwen-qwq-32b, groq_deepseek-r1-distill-llama-70b, gemini-2.0-flash-thinking, groq_qwen-2.5-coder-32b, gemini-2.0-flash) - Three prompting strategies (Query 000, 110, 111) with five replications each - Raw LLM outputs for all 40,320 evaluations - Analysis code and derived datasets - Traditional tool evaluation results (axe-core, Lighthouse, Pa11y)

Files

Institutions

Categories

Computer Science, Software, Artificial Intelligence Applications

Funders

  • University of the Basque Country (UPV/EHU)
    Grant ID: US24/10

Licence