Evaluating Large Language Models in Legal Use Cases

Published: 14 May 2025| Version 1 | DOI: 10.17632/jnztrkb4f2.1
Contributors:
,
,
,
,
,
,
,

Description

This contains the full dataset for a rapid literature review into academic research on Large Language Models (LLMs) in legal use cases. In addition to the full data spreadsheet, additional sheets provide overviews of the metrics used to evaluate LLMs in the studies; the different LLMs studied and the families they belong to; the legal domains and use cases in which LLMs are applied in law; and the LLM activities or tasks that the LLMs perform in the studies.

Files

Steps to reproduce

The search strategy was developed following a rapid review 2-step screening methodology , notably with a simplified search strategy during a shorter timeframe. The motivation for this approach is twofold. Firstly, because of the rapidly evolving nature of AI research, we wanted to provide a snapshot of studies published from 2023–the year that ChatGPT-4 was launched. Secondly, our aim in conducting this review is to support projects that are currently working to develop benchmarks and metrics for evaluating LLMs. By showing the limitations of current academic research, we contribute to the development of better benchmarks and metrics in the future. Our search was conducted on Scopus and included only studies in English published from 2023 to the time of search. Scopus was chosen as it is the largest database of peer-reviewed literature. It was searched using the search terms “LLMs in law”|“Large language models in legal use cases”|“evaluating LLMs in law” on 13/05/2024. The search identified 101 records. After title and abstract screening, a total of 83 were retrieved and further screened and checked for agreement. We carried out an additional search on SCOPUS using the same terms on 11/02/2025 to account for papers published after our review began. This yielded an additional 150 studies. After removing duplicates and applying the exclusion criteria across both searches, we had 140 papers. Papers were included so long as they provided an evaluation of the application of LLMs to legal use cases. We included both empirical studies that identified specific use cases and tasks and then evaluated LLMs using a range of quantitative and qualitative metrics. We also included theoretical papers, so long as they discussed LLMs in legal use cases, and discussed what it would mean for LLMs to be successful in those use cases. Reasons for exclusion included being outside the scope of LLMs, summaries of conferences, or focusing on irrelevant domains such as healthcare. Information collected included the legal domain, use case and tasks; the legal system of the country being studied; the LLMs studied in the paper; the evaluation methods and metrics; whether the LLMs performed well or poorly in the study; and the proposed target groups of the use case (clients, lawyers, judges etc). In what follows, we focus on models used, how use cases are broken down into tasks, and the metrics used to evaluate those tasks.

Institutions

University of Nottingham, The University of Sheffield, Queen Mary University of London Queen Mary Intellectual Property Research Institute, University of Warwick

Categories

Law, Artificial Intelligence, AI Ethics, Large Language Model

Funding

UK Research and Innovation

Licence