Article

AI-assisted quality assessment in HEOR: What works, what doesn’t, and how to apply GenAI responsibly

Maria Arregui, PhD
Erika Wissinger
Evelyn Gomez-Espinosa, PhD
Maria Koufopoulou, MSc

Quality appraisals and Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) compliance checks are essential for credible evidence synthesis in health economics and outcomes research (HEOR), but they are also time intensive steps. Across three internal case studies, we explored where a closed system generative artificial intelligence (GenAI) chatbot can genuinely add value, and where expert human judgment still makes the difference.

Why this question matters now

Systematic literature reviews (SLRs) sit at the heart of health technology assessment (HTA) and HEOR. They inform reimbursement decisions, shape clinical and policy guidance, and underpin value narratives across markets. Yet the credibility of an SLR depends on more than comprehensive searching—it relies on transparent reporting and rigorous appraisal of the quality of the underlying evidence.

This emphasis on evidence quality is increasingly reflected in regulatory HTA frameworks. Under the EU HTA Regulation, Joint Clinical Assessments (JCAs) are explicitly required to evaluate the validity, strengths, limitations, and uncertainty of the clinical evidence base. Recent JCA methodological guidance formalises expectations around assessment of internal and external validity across clinical study designs, underscoring that rigorous appraisal is central to comparative clinical assessment at the European level.

Quality assurance tasks such as PRISMA compliance checking and structured study quality assessment are therefore essential. At the same time, they are labour-intensive. As evidence packages grow larger and HTA timelines shorten, HEOR teams are increasingly challenged to do more, faster—without compromising rigor.

GenAI has entered this conversation with considerable momentum. Policy and methodological bodies have begun to acknowledge its potential to support evidence synthesis, while simultaneously emphasising the need for cautious, transparent, and well-governed adoption. The question is no longer whether GenAI could be used in HEOR, but how—and where—it should be used responsibly.

What we set out to learn

Rather than asking whether GenAI could replace expert reviewers, we focused on a more practical question: Can a closed system GenAI chatbot reliably support structured, checklist based, study-level quality appraisal (QA) tasks—and under what conditions does it fall short?

To answer this, we conducted three internal evaluations, each addressing a core QA activity commonly required in HEOR:

PRISMA 2020 compliance checking for published SLRs
Study quality assessment across multiple study designs within an SLR
Quality assessment of economic evaluations using the Drummond checklist

All three studies used the same foundational principles: a secure internal environment, standardised prompts aligned to established tools, and direct comparison with trained human reviewers.

A common, governed approach

Across the evaluations, the GenAI chatbot was deployed in a deliberately constrained way. Publications were uploaded individually, prompts were explicitly mapped to checklist items, and outputs were required to include both a categorical judgment and verbatim supporting text from the source document.

Human assessments served as the reference standard and were subject to second reviewer validation. This design allowed us to evaluate not only agreement rates, but also why disagreements occurred.

Case study 1: PRISMA 2020 checklist

In the first evaluation, six published SLRs were assessed against PRISMA 2020 using 42 checklist-derived questions. Overall, the GenAI chatbot achieved 93% full agreement with human reviewers.

As anticipated, performance was strongest in domains characterised by standardised, explicit reporting. Items related to the title, abstract, introduction, and other information showed complete alignment between the human and chatbot review. In contrast, discrepancies clustered in areas that routinely challenge human reviewers as well, particularly interpretive discussion items.

The findings suggest that GenAI is particularly effective at identifying and organizing information that is explicitly reported, which makes it a valuable tool for initial PRISMA checks. However, when interpretation is needed, human oversight continues to play an important role.

Case study 2: Study quality assessment

The second evaluation extended this question to quality assessment across heterogeneous study designs. In total, 28 studies were evaluated: 6 randomised controlled trials (RCTs), 4 prospective cohort studies and 18 retrospective cohort studies. Using design appropriate tools (i.e., RoB 1, Newcastle–Ottawa Scale, and the Motheral checklist, respectively), agreement between GenAI and human reviewers was remarkably consistent, ranging from 81% to 83%.

Disagreements followed recurring patterns. Allocation concealment in RCTs, data source reliability in retrospective studies, and cohort comparability in prospective observational studies all proved challenging, particularly when reporting was incomplete.

Notably, GenAI performance declined in lower-quality studies. While the chatbot reliably extracted reported information, it struggled with the absence of evidence, a nuance that human reviewers can interpret using methodological context and experience.

Case study 3: Economic evaluations

The third evaluation examined GenAI-assisted quality assessment of economic evaluations using the Drummond checklist. Across eight studies, agreement ranged from 65.7% to 100%, with a median of 94.3%.

Where discrepancies arose, they were again linked to ambiguous or incomplete reporting. A recurring pattern showed a tendency toward optimistic judgments, with the GenAI chatbot occasionally assigning a “Yes” where human reviewers judged criteria as unmet.

What these case studies tell us

Viewed together, these evaluations point to several lessons for HEOR practice.

First, high concordance is achievable when GenAI is applied to well-bounded, checklist-driven tasks using carefully designed prompts. Second, GenAI struggles most in the same places humans do—when reporting is unclear, incomplete, or requires contextual interpretation.

Third, reporting quality sets the ceiling for automation. Better reporting enables better QA, whether conducted by humans or AI.

These findings argue against viewing GenAI as a replacement for expert reviewers. Instead, its value lies in acting as a second reviewer or accelerator: standardizing outputs, surfacing supporting text efficiently, and freeing human expertise for the judgments that truly require it.

Concluding thoughts

Across three distinct but related use cases, a closed system GenAI chatbot demonstrated clear potential to accelerate quality assurance in HEOR without undermining rigor—provided it is deployed thoughtfully and within a governed, human in the loop framework.

As guidance on responsible AI use in evidence synthesis continues to evolve, these case studies offer a pragmatic perspective: GenAI works best when its role is clearly defined, its outputs are transparent, and expert judgment remains in the loop. Used this way, GenAI can support the evidence standards that are fundamental to rigorous HTA and HEOR studies.

Note: Sources listed below

Disclaimer:
This article summarises Cencora’s understanding of the topic based on publicly available information at the time of writing (see listed sources) and the authors’ expertise in this area. Any recommendations provided in the article may not be applicable to all situations and do not constitute legal advice; readers should not rely on the article in making decisions related to the topics discussed.

Entrez en contact avec notre équipe

Notre équipe d’experts de la valeur se consacre à la transformation des données probantes, des informations sur les politiques et des renseignements sur le marché en stratégies efficaces d’accès au marché mondial. Laissez-nous vous aider à naviguer dans le paysage complexe des soins de santé d’aujourd’hui en toute confiance. Contactez-nous pour découvrir comment nous pouvons vous aider à atteindre vos objectifs.

Nous contacter

Sources

Arregui M, Gomez Espinosa E, Wissinger E, Koufopoulou M. Evaluating the performance of an artificial intelligence–powered tool for assessing systematic literature reviews using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 checklist. Value Health. In press. 2026.
Arregui M, Koufopoulou M, Cadarette S, Wissinger E. Leveraging artificial intelligence to streamline study quality assessment in systematic literature reviews. Value Health. 2025;28(6 suppl 1):S400. doi:10.1016/j.jval.2025.04.1783
Arregui M, Koufopoulou M. EE446: Evaluating the performance of an artificial intelligence–powered tool for assessing quality of published economic evaluations: a comparison with human reviewers using the Drummond checklist. Value Health. 2025;28(12 suppl 1):S194. doi:10.1016/j.jval.2025.09.829
Cochrane; Guidelines International Network; Campbell Collaboration. RAISE: Responsible AI use in evidence synthesis. Published 2025. https://www.cochrane.org/about-us/news/setting-standards-responsible-ai-use-evidence-synthesis
Drummond MF, Sculpher MJ, Torrance GW, O’Brien BJ, Stoddart GL. Methods for the Economic Evaluation of Health Care Programmes. Oxford University Press; 2005.
European Commission; Directorate-General for Health and Food Safety. Guidance on validity of clinical studies for joint clinical assessments. Adopted July 4, 2024. Accessed 18 May 2026. https://health.ec.europa.eu/publications/guidance-validity-clinical-studies-joint-clinical-assessments_en
Higgins JPT, Altman DG, Gøtzsche PC, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928.
Motheral B, Brooks J, Clark MA, et al. A checklist for retrospective database studies. Value Health. 2003;6(2):90-97.
National Institute for Health and Care Excellence. Use of AI in Evidence Generation: NICE position statement. Published 2024. Accessed 18 May 2026. https://www.nice.org.uk/corporate/ecd11
Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement. BMJ. 2021;372:n71.
Wells GA, Shea B, O’Connell D, et al. The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. Published 2014.
Wright C, Swanston L, Nicholson L, Marjenberg Z. HTA360: A comparative assessment of SLR requirements for HTA. Value Health. 2023;26(12):S389-S390.

Ressources connexes

Infolettre

HTA Quarterly Summer 2026

Article

Beyond public dashboards: Leveraging Germany’s National Cancer Registry for pharmaceutical research – Insights from a ZfKD Research Initiative

Webinaire

Breaking down barriers: Why early commercialization planning is key to a successful market entry strategy