What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction

dc.citation.volumeOnline first
dc.contributor.authorLi L
dc.contributor.authorMathrani A
dc.contributor.authorSusnjak T
dc.date.accessioned2026-02-24T22:44:30Z
dc.date.issued2026-01-26
dc.description.abstractAutomating data extraction from full-text randomized controlled trials for meta-analysis remains a significant challenge. This study evaluates the practical performance of three large language models (LLMs) (Gemini-2.0-flash, Grok-3, and GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customized prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customized prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.
dc.description.confidentialfalse
dc.identifier.citationLi L, Mathrani A, Susnjak T. (2026). What level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction. Research Synthesis Methods. Online first.
dc.identifier.doi10.1017/rsm.2025.10066
dc.identifier.eissn1759-2887
dc.identifier.elements-typejournal-article
dc.identifier.issn1759-2879
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/74213
dc.languageEnglish
dc.publisherCambridge University Press
dc.publisher.urihttp://cambridge.org/core/journals/research-synthesis-methods/article/what-level-of-automation-is-good-enough-a-benchmark-of-large-language-models-for-metaanalysis-data-extraction/2EA4DAFAAC11E76216DC0A512CA29D59
dc.relation.isPartOfResearch Synthesis Methods
dc.rights(c) The author/sen
dc.rights.licenseCC BY 4.0en
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en
dc.subjectautomated meta-analysis
dc.subjectdata extraction
dc.subjectevidence synthesis
dc.subjecthuman-in-the-loop
dc.subjectlarge language models (LLMs)
dc.subjectprompt engineering
dc.titleWhat level of automation is “good enough”? A benchmark of large language models for meta-analysis data extraction
dc.typeJournal article
pubs.elements-id609707
pubs.organisational-groupOther

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
609707 PDF.pdf
Size:
1.84 MB
Format:
Adobe Portable Document Format
Description:
Published version.pdf

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:

Collections