What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction

Li L; Mathrani A; Susnjak T

doi:10.1017/rsm.2025.10066

What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction

dc.citation.volume	Online first
dc.contributor.author	Li L
dc.contributor.author	Mathrani A
dc.contributor.author	Susnjak T
dc.date.accessioned	2026-02-24T22:44:30Z
dc.date.issued	2026-01-26
dc.description.abstract	Automating data extraction from full-text randomized controlled trials for meta-analysis remains a significant challenge. This study evaluates the practical performance of three large language models (LLMs) (Gemini-2.0-flash, Grok-3, and GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customized prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customized prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.
dc.description.confidential	false
dc.identifier.citation	Li L, Mathrani A, Susnjak T. (2026). What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction. Research Synthesis Methods. Online first.
dc.identifier.doi	10.1017/rsm.2025.10066
dc.identifier.eissn	1759-2887
dc.identifier.elements-type	journal-article
dc.identifier.issn	1759-2879
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/74213
dc.language	English
dc.publisher	Cambridge University Press
dc.publisher.uri	http://cambridge.org/core/journals/research-synthesis-methods/article/what-level-of-automation-is-good-enough-a-benchmark-of-large-language-models-for-metaanalysis-data-extraction/2EA4DAFAAC11E76216DC0A512CA29D59
dc.relation.isPartOf	Research Synthesis Methods
dc.rights	(c) The author/s	en
dc.rights.license	CC BY 4.0	en
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en
dc.subject	automated meta-analysis
dc.subject	data extraction
dc.subject	evidence synthesis
dc.subject	human-in-the-loop
dc.subject	large language models (LLMs)
dc.subject	prompt engineering
dc.title	What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction
dc.type	Journal article
pubs.elements-id	609707
pubs.organisational-group	Other

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 609707 PDF.pdf
Size:: 1.84 MB
Format:: Adobe Portable Document Format
Description:: Published version.pdf

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 9.22 KB
Format:: Plain Text
Description:

Download

Collections

Journal Articles