Quick Take

  • Commercial large language models (LLMs) showed variable agreement with JBI-trained human reviewers; Claude 3.7 Sonnet achieved the highest alignment (Cohen’s κ = 0.732) while Gemini 2.0 Flash performed worst (κ = 0.394).
  • Five of 11 JBI checklist items (Q1, Q4, Q5, Q7, Q11) reached full human–LLM agreement; disagreements clustered on high-judgment domains (inclusion criteria, search strategy, publication bias, independent appraisal).
  • Practical implication for pharmacy: LLMs can reliably extract simple, text-findable checklist items and may accelerate pre-screening, but they miss complex methodological judgments — so human expert re-review is mandatory for P&T, medication-safety, and guideline decisions.

Why it Matters

  • Critical appraisal of systematic reviews is a foundational, time-consuming step in evidence-based practice; many clinicians lack appraisal training and methodologically weak reviews risk informing unsafe medication decisions.
  • Commercial LLMs deliver rapid, consistent text extraction but lack the contextual reasoning to detect complex methodological flaws (e.g., inappropriate inclusion criteria, incomplete search strategies, unassessed publication bias), limiting their reliability as standalone decision support for inpatient pharmacy workflows.

What They Did

  • Independently appraised four full-text systematic reviews on interventions to reduce medication administration errors using the 11-item JBI Critical Appraisal Checklist; two JBI-trained human reviewers (11 and 4 years' evidence-based practice experience) provided the reference standard.
  • Tested five paid commercial LLMs (Perplexity Sonar Pro, Claude 3.7 Sonnet, Gemini 2.0 Flash, GPT-4.5, Grok-2) with identical full-text PDFs and a standardized 11-item JBI prompt, each run in a fresh session.
  • Human reviewers rated independently and resolved disagreements by consensus while blinded to AI outputs; comparisons used percent agreement and Cohen’s kappa for each JBI item.
  • Single-timepoint internal inference study (March 2025) with n=4 reviews and no external validation — intended as a feasibility evaluation of AI-assisted critical appraisal, not model training.

What They Found

  • Agreement varied by item and model: 5/11 JBI items reached complete human–LLM agreement (Q1, Q4, Q5, Q7, Q11).
  • Inter-rater reliability (human–LLM Cohen’s κ) ranged from 0.394 (Gemini 2.0 Flash) to 0.732 (Claude 3.7 Sonnet); GPT-4.5 κ = 0.621, Perplexity ≈ 0.497, Grok-2 ≈ 0.480.
  • Disagreements clustered on high-judgment domains: publication bias (Q9) accounted for 6/44 responses (13.6%); inclusion criteria (Q2), search strategy (Q3), and independent appraisal (Q6) each had 5/44 discrepancies (11.4%).
  • All LLMs failed to identify a critical methodological flaw in the Koyama et al. review (human reviewers rated key items as 'No' while models answered 'Yes').
  • Model–model concordance exceeded model–human concordance (example: Claude–GPT-4.5 κ = 0.876). Claude had only 3/44 discrepancies (6.8%) versus Perplexity and Gemini at 7/44 (15.9%), demonstrating substantial inter-model variance.
  • Operational takeaway: LLMs reliably flag simple, text-findable checklist items (useful for rapid pre-screening) but cannot be relied upon for methodological judgments central to P&T, medication-safety, or guideline adoption — human re-review remains mandatory.

Takeaways

  • Expect current commercial LLMs to be most useful for quick, structured checks on systematic reviews (for example, confirming whether the review question is stated or which databases are listed), not for full methodological appraisal.
  • Treat LLM outputs as a highlighter that locates information in the text, not as a substitute for expert judgment about methodological appropriateness.
  • Because models struggled with inclusion criteria, search strategy, independent appraisal, and publication bias, over-reliance risks allowing methodologically weak reviews to influence medication policy or practice.
  • Performance is model-dependent and inconsistent; pharmacist judgment and independent human re-review must remain the final authority when evidence will inform patient care, P&T, or safety protocols.

Strengths and Limitations

Strengths:

  • First head-to-head comparison of multiple paid commercial LLMs against JBI-trained human reviewers using identical full texts and a standardized JBI prompt.
  • Blinded human consensus, standardized prompting, and item-level kappa plus percent-agreement analyses provide transparent, reproducible benchmarking of model performance.

Limitations:

  • Small sample: only four English-language systematic reviews on a single clinical topic (interventions to reduce medication administration errors), limiting generalizability to broader or more complex evidence bases.
  • Subjective JBI items, session- and prompt-sensitive LLM outputs, single-timepoint inference testing, and evaluation limited to paid model versions constrain real-world applicability and external validity.

Bottom Line

Commercial LLMs can accelerate simple pre-screening of systematic reviews by extracting explicit, text-findable checklist items, but they are not reliable for the methodological judgments that determine whether evidence is safe to adopt in pharmacy practice; expert human re-review remains mandatory.