Quick Take

  • Large language models (LLMs) showed task-dependent performance on a clinician-curated DDI benchmark (AIChemist): GPT-4o-mini achieved 86.6% accuracy in constrained pairwise (3-drug) choices (≈93.8% on Category X) but fell to 71.3% on listwise (4–6 drug) identification; the best pointwise (2-drug) F1 was 59.7% (LLaMA3-70B).
  • Practical implication: LLMs are most useful for constrained A/B discrimination or as an adjunctive comparator. Retain rules-based clinical decision support (CDS) and pharmacist verification as the primary safety net because listwise accuracy (≈69–80%) and run-to-run stability reported here are insufficient for autonomous primary screening.

Why it Matters

  • DDI screening is a core medication-safety task in inpatient pharmacy; polypharmacy and complex profiles make isolating the single clinically actionable pair difficult during order verification.
  • Rules-based references (e.g., LexiDrug) produce deterministic outputs for interactions contained in their database, but LLMs are being considered as context-aware filters — the key question is whether they can reliably identify and characterize interactions across realistic, multi-drug profiles rather than only isolated pairs.
  • Given finite pharmacist capacity and frequent alerts, deploying models that are variable or inconsistent risks adding noise and reducing safety; severity-specific, listwise validation and repeatability checks are therefore essential before clinical integration.

What They Did

  • Created AIChemist: a clinician-curated benchmark of 750 unique DDI scenarios (3 × 250), reviewed by three board-certified pharmacists and mapped to LexiDrug severity categories A–X.
  • Evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) across three judgment formats reflecting increasing clinical complexity: pointwise (2-drug classification), pairwise (3-drug discrimination), and listwise (4–6 drug selection), using standardized zero-shot prompts and strict output formats (A/B tokens or JSON).
  • Quantified reliability by querying each unique prompt nine times (temperature=0.7), applied label shuffling and balanced negative sampling to reduce position and acquiescence bias, and reported case-level accuracy and pointwise precision/recall/F1 as well as self-consistency (proportion of runs per case that were all correct).

What They Found

  • Pointwise (2-drug) performance was modest: LLaMA3-70B had the highest F1 (59.7%) with recall ≈61.7%; GPT-4o-mini showed higher precision (~65.7%) and MedGemma-27B had the lowest recall (~38.2%). Self-consistency across nine repeats was low (≈31–43%), indicating frequent instability across identical prompts in the open classification setting.
  • Pairwise (3-drug) discrimination improved substantially: GPT-4o-mini reached 86.6% accuracy and ≈93.8% accuracy on Category X cases. Pairwise self-consistency was much higher (roughly 81–87%), showing that constraining the answer space improves reliability.
  • Listwise (4–6 drug) accuracy declined: models ranged ≈68.6–80.0% overall (GPT-4o-mini 71.3%, MedGemma 80.0%, LLaMA3 68.6%), with the weakest detection for Category C interactions (≈58–74%). Depending on the model, roughly one in five to one in four Category X pairs were missed in the more complex lists (≈20–26% missed).
  • Net finding: LLMs show task-dependent strengths (strong when the answer space is constrained) but uneven accuracy and notable run-to-run instability as medication complexity increases, so they are not yet reliable as autonomous primary DDI screeners without additional engineering and local validation.

Takeaways

  • Treat LLM outputs as smart comparators or triage signals rather than authoritative alerts — keep rules-based CDS and pharmacist judgment as the primary safety layer.
  • Constrained-choice workflows (A vs B) are the most promising near-term use case because they generate higher accuracy and much greater consistency than open classification or list scanning.
  • Expect the most noise around moderate-risk (Category C) interactions; in complex lists, use LLM outputs to prioritize review rather than to make definitive treatment changes.
  • Before piloting LLM-augmented workflows, require site-specific validation: measure severity-specific listwise accuracy, repeatability across runs, and handling of local formulary differences and outliers.

Strengths and Limitations

Strengths:

  • Clinician-curated AIChemist benchmark (750 scenarios) reviewed by three board-certified pharmacists and mapped to LexiDrug — increases clinical relevance and provides a pharmacist-reviewed ground truth for benchmarking.; Multi-format evaluation with strict outputs, label-shuffling, nine repeated runs per prompt, balanced negative sampling, and bootstrap confidence intervals — enabling measurement of both accuracy and stability across judgment formats.

Limitations:

  • Dataset size is modest by LLM standards and contains medication-only prompts without laboratory values, comorbidities, or other patient context, limiting direct EHR generalizability.; Evaluation froze zero-shot models (temperature=0.7) and did not test retrieval-augmented generation (RAG), fine-tuning, or advanced prompt engineering; the study also relied on LexiDrug as the single severity reference, which may differ from other sources.

Bottom Line

LLMs are a promising adjunct for constrained A/B discrimination and triage but are not yet reliable for autonomous DDI screening; require severity-specific, listwise accuracy and repeatability (self-consistency) checks on local formulary data before any clinical pilot, and retain deterministic rules-based CDS plus pharmacist verification as the primary safety net.