Quick Take
- A reasoning LLM (OpenAI o4-mini) matched expert pharmacist consensus on binary "medication-related or not" classification in 96.0% of validation cases (192/200; 95% CI 92.3–98.3). Medication-process subcategorization concordance was 76.5% (153/200).
- Suggests potential for automated first-pass triage of incident-report free text that could reduce manual sorting burden for medication-safety pharmacists, while still requiring pharmacist QA for ambiguous items and local validation before deployment.
Why it Matters
- Medication errors are a major patient-safety concern but incident reporting systems capture only a small fraction and useful signals are buried in messy, high-volume free-text narratives.
- Local classification practices are often subjective and inconsistent, making trending and cross-unit learning unreliable and forcing pharmacists into time-intensive reclassification work at scale.
- Scalable, consistent event tagging is a prerequisite to focus scarce medication-safety resources on investigation, root-cause analysis (RCA), and system fixes, and to align surveillance with stewardship and governance priorities.
What They Did
- Developed prompt-based adaptations of OpenAI’s reasoning model o4-mini using ~75,000 anonymized incident reports from the Västmanland region (2019–2024), with a 2,434-case pharmacist-reclassified subset used to encode tacit pharmacist heuristics and edge-case rules.
- Hosted o4-mini on Microsoft Azure and implemented a paired classifier + "critic" prompt workflow. Each report was executed six times to produce per-item chain-of-reasoning outputs and a multi-run reliability score, with prespecified thresholds to route ambiguous items to manual pharmacist review.
- Validated the approach against an expert reference: two hospital pharmacists independently labeled 200 new reports (Jan–Mar 2024) for binary medication-related status and assignment into eight medication-process subcategories; LLM outputs were compared to consensus.
What They Found
- Binary medication-related classification matched expert pharmacists in 96.0% of cases (192/200; 95% CI 92.3–98.3) — operationally ~40 disagreements per 1,000 reports in a similar mix of narratives.
- Medication-process subcategorization concordance was 76.5% (153/200), indicating reasonable first-pass routing for queues but likely insufficient for unattended KPI/trend reporting without human QA.
- Disagreements (4.0%) were mainly due to linguistic ambiguity or context dependence; directionality showed 5 cases where the LLM flagged reports as medication-related when pharmacists did not, and 3 the reverse.
- Failure modes included misinterpreting whether an item counted as a medication (e.g., chlorhexidine), mistaking non-drug terms for drug names, and instances where the model strictly applied prompt rules while pharmacists used broader organizational/contextual judgment.
- During prompt development, simpler prompting strategies achieved roughly 50–75% concordance on development examples. The final pipeline — encoding pharmacist tacit heuristics, adding a critic prompt, and using six repeated runs to compute a reliability score — produced the 96% binary concordance observed in the validation set.
Takeaways
- Reasoning LLMs may standardize first-pass "medication-related or not" triage from messy incident narratives, enabling medication-safety pharmacists to prioritize investigation and RCA over large-scale manual sorting, provided local validation and governance are in place.
- Treat the LLM like a high-quality triage/spam filter: useful for sorting the inbox but not a final adjudicator — ambiguous or low-reliability items should be routed to pharmacist QA.
- Use subcategory tags as routing hints to steer reports into appropriate review queues, but spot-check and validate before relying on them for KPI-level reporting or automated trending.
- When evaluating or piloting similar tools, require clarity on edge-case handling (e.g., non-drug items that resemble medications), the low-confidence routing policy that triggers human review, and a plan for local revalidation and ongoing monitoring (versioning, sampling, drift detection).
Strengths and Limitations
Strengths:
- Pharmacist-grounded prompt design: used a 2,434-case pharmacist-reclassified subset to encode tacit heuristics; the prompt specification is provided for reproducibility.
- Built-in reliability mechanism: paired classifier + critic prompts with multiple runs produce an interpretable reliability score and prespecified manual-review thresholds to support human-in-the-loop workflows.
Limitations:
- Single-region, single-language development and a modest validation sample (n=200) limit generalizability; local revalidation and prompt tuning will likely be required for other hospitals, languages, or reporting cultures.
- Evaluation reports overall concordance without full confusion matrices or sensitivity/positive predictive value stratified by severity; reliance on a closed-source model and prompt-only adaptation raises transparency, reproducibility, and drift-management concerns for long-term governance.
Bottom Line
Consider a pharmacist-trained reasoning LLM as a triage layer for incident-report free text — high reported accuracy for binary "medication-related or not" classification in this validation, but the approach needs local validation, a reliability-based human-in-the-loop policy, and QA before relying on subtypes for unattended KPI reporting.