Bottom Line
Ong et al. (2025) demonstrated that a GPT-4 powered CDSS, when paired with a pharmacist, improved detection of prescribing errors, especially high-severity ones, but overall accuracy was ~60%.
The LLM “co-pilot” is promising but not foolproof. It should be piloted cautiously, with robust oversight, validation, and integration into workflows.
Key Results
- Co-Pilot Best Performance: Pharmacist+LLM co-pilot achieved highest accuracy (~61%) for error detection, outperforming both LLM alone and pharmacist alone.
- Serious Harm Errors: Co-pilot caught ~1.5× more high-severity errors than pharmacist-alone.
- LLM Alone: Reasonable but inferior to co-pilot. Recall improved from 0.47 to 0.61 with pharmacist oversight.
- Precision Trade-off: Co-pilot had lower precision (more false positives) than pharmacist-alone.
- Comparative LLMs: GPT-4 with RAG performed best among tested models.
Methods Snapshot
Design: Prospective, cross-over simulation study using 91 prescribing error scenarios derived from 40 clinical vignettes across 16 medical/surgical specialties. A multidisciplinary expert panel established “ground truth” by classifying drug-related problems (DRPs) and rating error severity (using PCNE criteria and NCC-MERP harm index).
Intervention: Five LLM-based models were developed with a retrieval-augmented generation (RAG) approach. The best-performing LLM model (e.g., GPT-4 with domain fine-tuning) was tested in three modes:
- LLM-CDSS alone
- Pharmacist + LLM-CDSS (co-pilot)
- Pharmacist alone
Junior hospital pharmacists (≤2 years experience) participated in the co-pilot and solo arms.
Outcome Measures: Primary outcome was accuracy in identifying DRPs. Secondary metrics included precision, recall, and F1-score. Researchers also recorded performance on serious harm errors and differences between various underlying LLMs.
Background & Rationale
Medication errors are a major patient safety concern, often arising from complex regimens and cognitive overload. Traditional rule-based alerts (e.g., drug-interaction pop-ups in EHRs) have high false-alert rates and limited context awareness, leading to alert fatigue.
LLMs (e.g., GPT-4) can interpret free-text clinical scenarios and guidelines, raising interest in using them to catch errors that rigid rule systems might miss. An LLM-based “co-pilot” could review medication orders with contextual understanding, potentially reducing serious errors.
Investigators hypothesized that an LLM CDSS, especially when paired with a pharmacist, would augment error detection beyond what either could achieve alone. This stems from the idea that AI can rapidly cross-check vast medical knowledge, while humans provide judgment and oversight.
Strengths / Limitations
Strengths
- Prospective, cross-specialty design (16 specialties).
- Realistic co-pilot workflow tested.
- Comprehensive metrics with expert-validated gold standard.
Limitations
- Simulated cases only, not live patients.
- Only junior pharmacists tested; unclear generalizability to experienced clinicians.
- Moderate accuracy (~61% overall).
- False positives may cause alert fatigue.
- Single-center (Singapore); limited generalizability.
AI-Specific Considerations
- Hallucinations: Risk of incorrect AI recommendations.
- Prompt Sensitivity: Dependent on how queries are structured.
- Knowledge Updates: Models risk being outdated; RAG helps but needs constant validation.
- Black-Box Reasoning: Lack of transparency raises trust and regulatory concerns.
- Bias: Training data may underrepresent populations; calibration needed.
- Data Privacy: HIPAA concerns if using cloud APIs.
Related Literature Synthesis
- Pais et al., 2024 (Nat Med): LLM MEDIC reduced dispensing instruction errors by ~33%. Domain-specific fine-tuning + guardrails lowered false positives.
- Roosan et al., 2024 (J Am Pharm Assoc): GPT-4 solved 39/39 MTM cases but lacked precise dosing recommendations.
- Grossman et al., 2023 (ASHP Abstract): Free ChatGPT gave inaccurate/incomplete answers to 74% of drug queries and fabricated references.
Clinical/Operational Question
Key Question: Can a large language model (LLM)-powered clinical decision support system (CDSS) improve detection of prescribing errors (drug-related problems) compared to standard pharmacist review, thereby enhancing medication safety in a hospital setting?
Applicability to Pharmacy Informatics
- Epic Willow Integration: Could function at order entry/verification steps via CDS Hooks.
- Pyxis ES: Potential for AI alerts to propagate to dispensing checks.
- Validation Burden: Continuous validation required for every update or version change.
- Data Feeds: Needs comprehensive HL7/FHIR feeds for real-time integration.
- GPU/Cloud Requirements: Deployment decisions between on-premises vs. cloud APIs.
- Regulatory Fit: FDA guidance uncertain; transparency lacking may push toward regulation.
Evidence Quality & Recommendation (GRADE-style)
- Quality of Evidence: Low to Moderate (simulation only, modest sample, evolving technology).
- Recommendation: Conditional (weak). Consider pilot testing in controlled environments with strong safeguards. Not recommended for widespread replacement of pharmacist judgment.
Citation
Ong JCL, Jin L, Elangovan K, et al. Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Rep Med. 2025;6:102323. doi:10.1016/j.xcrm.2025.102323. [PMID: 40997804]