Constrained Choices Shine, List Scanning Falters for Drug–Drug Interactions

Quick Take

Large language models (LLMs) showed task-dependent performance on a clinician-curated DDI benchmark (AIChemist): GPT-4o-mini achieved 86.6% accuracy in constrained pairwise (3-drug) choices (≈93.8% on Category X) but fell to 71.3% on listwise (4–6 drug) identification; the best pointwise (2-drug) F1 was 59.7% (LLaMA3-70B).

Practical implication: LLMs are most useful for constrained A/B discrimination or as an adjunctive comparator. Retain rules-based clinical decision support (CDS) and pharmacist verification as the primary safety net because listwise accuracy (≈69–80%) and run-to-run stability reported here are insufficient for autonomous primary screening.

Why it Matters

DDI screening is a core medication-safety task in inpatient pharmacy; polypharmacy and complex profiles make isolating the single clinically actionable pair difficult during order verification.

Rules-based references (e.g., LexiDrug) produce deterministic outputs for interactions contained in their database, but LLMs are being considered as context-aware filters — the key question is whether they can reliably identify and characterize interactions across realistic, multi-drug profiles rather than only isolated pairs.

Given finite pharmacist capacity and frequent alerts, deploying models that are variable or inconsistent risks adding noise and reducing safety; severity-specific, listwise validation and repeatability checks are therefore essential before clinical integration.

What They Did

Created AIChemist: a clinician-curated benchmark of 750 unique DDI scenarios (3 × 250), reviewed by three board-certified pharmacists and mapped to LexiDrug severity categories A–X.

Evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) across three judgment formats reflecting increasing clinical complexity: pointwise (2-drug classification), pairwise (3-drug discrimination), and listwise (4–6 drug selection), using standardized zero-shot prompts and strict output formats (A/B tokens or JSON).

Quantified reliability by querying each unique prompt nine times (temperature=0.7), applied label shuffling and balanced negative sampling to reduce position and acquiescence bias, and reported case-level accuracy and pointwise precision/recall/F1 as well as self-consistency (proportion of runs per case that were all correct).

What They Found

Pointwise (2-drug) performance was modest: LLaMA3-70B had the highest F1 (59.7%) with recall ≈61.7%; GPT-4o-mini showed higher precision (~65.7%) and MedGemma-27B had the lowest recall (~38.2%). Self-consistency across nine repeats was low (≈31–43%), indicating frequent instability across identical prompts in the open classification setting.

Pairwise (3-drug) discrimination improved substantially: GPT-4o-mini reached 86.6% accuracy and ≈93.8% accuracy on Category X cases. Pairwise self-consistency was much higher (roughly 81–87%), showing that constraining the answer space improves reliability.

Listwise (4–6 drug) accuracy declined: models ranged ≈68.6–80.0% overall (GPT-4o-mini 71.3%, MedGemma 80.0%, LLaMA3 68.6%), with the weakest detection for Category C interactions (≈58–74%). Depending on the model, roughly one in five to one in four Category X pairs were missed in the more complex lists (≈20–26% missed).

Net finding: LLMs show task-dependent strengths (strong when the answer space is constrained) but uneven accuracy and notable run-to-run instability as medication complexity increases, so they are not yet reliable as autonomous primary DDI screeners without additional engineering and local validation.

Takeaways

Treat LLM outputs as smart comparators or triage signals rather than authoritative alerts — keep rules-based CDS and pharmacist judgment as the primary safety layer.

Constrained-choice workflows (A vs B) are the most promising near-term use case because they generate higher accuracy and much greater consistency than open classification or list scanning.

Expect the most noise around moderate-risk (Category C) interactions; in complex lists, use LLM outputs to prioritize review rather than to make definitive treatment changes.

Before piloting LLM-augmented workflows, require site-specific validation: measure severity-specific listwise accuracy, repeatability across runs, and handling of local formulary differences and outliers.

Strengths and Limitations

Strengths:

Clinician-curated AIChemist benchmark (750 scenarios) reviewed by three board-certified pharmacists and mapped to LexiDrug — increases clinical relevance and provides a pharmacist-reviewed ground truth for benchmarking.; Multi-format evaluation with strict outputs, label-shuffling, nine repeated runs per prompt, balanced negative sampling, and bootstrap confidence intervals — enabling measurement of both accuracy and stability across judgment formats.

Limitations:

Dataset size is modest by LLM standards and contains medication-only prompts without laboratory values, comorbidities, or other patient context, limiting direct EHR generalizability.; Evaluation froze zero-shot models (temperature=0.7) and did not test retrieval-augmented generation (RAG), fine-tuning, or advanced prompt engineering; the study also relied on LexiDrug as the single severity reference, which may differ from other sources.

Bottom Line

LLMs are a promising adjunct for constrained A/B discrimination and triage but are not yet reliable for autonomous DDI screening; require severity-specific, listwise accuracy and repeatability (self-consistency) checks on local formulary data before any clinical pilot, and retain deterministic rules-based CDS plus pharmacist verification as the primary safety net.