Quick Take

  • PharmacyGPT, using an iterative few-shot prompt optimization approach with GPT-4, predicted hospital mortality with accuracy 0.75 (precision 0.37; recall 0.70) and produced 24-hour medication regimens; lexical overlap with actual 24-hour orders was modest (ROUGE-1 0.07).
  • Findings indicate potential for PharmacyGPT to prioritize pharmacist review of higher-risk ICU patients and to draft structured order-set candidates, but outputs require human validation and medication-specific evaluation metrics before any workflow integration.

Why it Matters

  • ICU medication management is high-risk and rapidly changing: critically ill patients commonly receive 13–20 medications, doses can be titrated minute-to-minute, and organ dysfunction plus narrow therapeutic indices increase the risk of harm. Alphanumeric, nonstandard EHR medication strings make accurate machine parsing of drug, dose, route, and frequency challenging.
  • Adverse drug events (ADEs) are common, costly, and largely preventable; critical care pharmacists reduce medication-related harm but staffing and budget constraints limit coverage. Progress in medication-focused AI is slowed by limited, high-quality ICU datasets and the need for domain-specific engineering.
  • Medication-aware clinical decision support (CDS) that supports stewardship and risk triage could help stretched pharmacy teams target comprehensive medication management (CMM) where it is most needed.

What They Did

  • Retrospective cohort of 1,000 adult ICU patients (first ICU admission, ≥24 h) sampled from a single-center EHR (Carolina Data Warehouse).
  • Developed PharmacyGPT by applying GPT-4 with a dynamic, iterative few-shot prompt optimization loop: prompts were adjusted based on prior outputs; no model fine-tuning or retraining was performed.
  • Generated patient clusters using GPT-3.5 embeddings and hierarchical clustering, then used GPT-4 to produce 24-hour medication regimens and to predict hospital mortality and APACHE II score ranges.
  • Validated outputs by comparing generated regimens to actual 24-hour orders using ROUGE (n‑gram overlap) and assessed prediction performance with accuracy, precision, recall, and F1 in a single-center, retrospective EHR study.

What They Found

  • PharmacyGPT predicted hospital mortality with accuracy 0.75, precision 0.37, and recall 0.70; class imbalance (only 46 deceased cases in the test set) materially reduced precision and F1.
  • APACHE II range prediction performed poorly (accuracy ~0.22–0.27 across prompts and models; F1 ≈ 0.14–0.23), indicating limited illness-severity discrimination from the supplied inputs.
  • 24-hour medication-regimen generation showed low lexical overlap with clinical orders (PharmacyGPT ROUGE-1 0.07, ROUGE-2 0.01, ROUGE-L 0.05; GPT-4/ChatGPT zero-shot ROUGE-1 0.04) despite clinician reviewers judging syntax and route/dose formatting generally plausible.
  • Unsupervised grouping produced 11 interpretable patient clusters aligned with ICD-10 categories, suggesting feasible medication-focused cohorting to help prioritize pharmacist review and targeted CMM workflows.
  • The mechanisms driving observed performance improvements were not fully delineated and warrant further analysis.

Takeaways

  • Deploy advisory-only ICU triage worklists in the EHR that leverage the ICD-10–aligned clusters to prioritize pharmacist-led comprehensive medication management; tune alert thresholds to daily pharmacist capacity.
  • Use the regimen generator as a drafting aid: convert outputs into tabular drug/dose/route/frequency fields, compare with actual 24-hour orders, and surface discrepancies for pharmacist review and sign-off — do not use for autonomous ordering.
  • Operationalize prompt engineering: maintain a prompt library with few-shot exemplars from demographically and diagnostically similar cases, log high- and low-quality exemplars, and standardize medication strings (dose, units, frequency) in the model input to stabilize outputs.
  • Adopt a governance-first rollout: treat PharmacyGPT as a screening tool that highlights patients for pharmacist attention (a ‘‘metal detector’’ analogy), and implement clear use policies, audit logs, safety monitoring (including Medication Regimen Complexity–ICU [MRC-ICU] checks and correctness audits), and documented pharmacist override workflows.

Strengths and Limitations

Strengths:

  • Clear iterative, dynamic few-shot prompting with feedback loops; produced interpretable patient clusters consistent with ICD-10 categories.
  • Used ICU EHR data containing granular medication elements (drug, dose, route, timing), and clinician review confirmed generally plausible formatting and syntax of generated regimens.

Limitations:

  • Single-center, retrospective cohort with limited inputs (no laboratory results, vital sign time series, or fine-grained temporal data) and pronounced mortality-class imbalance (46 deaths), which reduces generalizability and degrades precision.
  • Medication-regimen evaluation relied on ROUGE applied to drug names only; variable medication string formats and the absence of a prospective, expert-validated ground truth limit assessment of clinical appropriateness.

Bottom Line

PharmacyGPT is ready for supervised pilot use as an advisory triage and drafting aid to prioritize pharmacist review, but it is not appropriate for autonomous ordering without structured validation, prospective evaluation, and medication-specific performance metrics.