Can an LLM Triage Medication Incident Reports at Scale?

Quick Take

A reasoning LLM (OpenAI o4-mini) matched expert pharmacist consensus on binary "medication-related or not" classification in 96.0% of validation cases (192/200; 95% CI 92.3–98.3). Medication-process subcategorization concordance was 76.5% (153/200).

Suggests potential for automated first-pass triage of incident-report free text that could reduce manual sorting burden for medication-safety pharmacists, while still requiring pharmacist QA for ambiguous items and local validation before deployment.

Why it Matters

Medication errors are a major patient-safety concern but incident reporting systems capture only a small fraction and useful signals are buried in messy, high-volume free-text narratives.

Local classification practices are often subjective and inconsistent, making trending and cross-unit learning unreliable and forcing pharmacists into time-intensive reclassification work at scale.

Scalable, consistent event tagging is a prerequisite to focus scarce medication-safety resources on investigation, root-cause analysis (RCA), and system fixes, and to align surveillance with stewardship and governance priorities.

What They Did

Developed prompt-based adaptations of OpenAI’s reasoning model o4-mini using ~75,000 anonymized incident reports from the Västmanland region (2019–2024), with a 2,434-case pharmacist-reclassified subset used to encode tacit pharmacist heuristics and edge-case rules.

Hosted o4-mini on Microsoft Azure and implemented a paired classifier + "critic" prompt workflow. Each report was executed six times to produce per-item chain-of-reasoning outputs and a multi-run reliability score, with prespecified thresholds to route ambiguous items to manual pharmacist review.

Validated the approach against an expert reference: two hospital pharmacists independently labeled 200 new reports (Jan–Mar 2024) for binary medication-related status and assignment into eight medication-process subcategories; LLM outputs were compared to consensus.

What They Found

Binary medication-related classification matched expert pharmacists in 96.0% of cases (192/200; 95% CI 92.3–98.3) — operationally ~40 disagreements per 1,000 reports in a similar mix of narratives.

Medication-process subcategorization concordance was 76.5% (153/200), indicating reasonable first-pass routing for queues but likely insufficient for unattended KPI/trend reporting without human QA.

Disagreements (4.0%) were mainly due to linguistic ambiguity or context dependence; directionality showed 5 cases where the LLM flagged reports as medication-related when pharmacists did not, and 3 the reverse.

Failure modes included misinterpreting whether an item counted as a medication (e.g., chlorhexidine), mistaking non-drug terms for drug names, and instances where the model strictly applied prompt rules while pharmacists used broader organizational/contextual judgment.

During prompt development, simpler prompting strategies achieved roughly 50–75% concordance on development examples. The final pipeline — encoding pharmacist tacit heuristics, adding a critic prompt, and using six repeated runs to compute a reliability score — produced the 96% binary concordance observed in the validation set.

Takeaways

Reasoning LLMs may standardize first-pass "medication-related or not" triage from messy incident narratives, enabling medication-safety pharmacists to prioritize investigation and RCA over large-scale manual sorting, provided local validation and governance are in place.

Treat the LLM like a high-quality triage/spam filter: useful for sorting the inbox but not a final adjudicator — ambiguous or low-reliability items should be routed to pharmacist QA.

Use subcategory tags as routing hints to steer reports into appropriate review queues, but spot-check and validate before relying on them for KPI-level reporting or automated trending.

When evaluating or piloting similar tools, require clarity on edge-case handling (e.g., non-drug items that resemble medications), the low-confidence routing policy that triggers human review, and a plan for local revalidation and ongoing monitoring (versioning, sampling, drift detection).

Strengths and Limitations

Strengths:

Pharmacist-grounded prompt design: used a 2,434-case pharmacist-reclassified subset to encode tacit heuristics; the prompt specification is provided for reproducibility.

Built-in reliability mechanism: paired classifier + critic prompts with multiple runs produce an interpretable reliability score and prespecified manual-review thresholds to support human-in-the-loop workflows.

Limitations:

Single-region, single-language development and a modest validation sample (n=200) limit generalizability; local revalidation and prompt tuning will likely be required for other hospitals, languages, or reporting cultures.

Evaluation reports overall concordance without full confusion matrices or sensitivity/positive predictive value stratified by severity; reliance on a closed-source model and prompt-only adaptation raises transparency, reproducibility, and drift-management concerns for long-term governance.

Bottom Line

Consider a pharmacist-trained reasoning LLM as a triage layer for incident-report free text — high reported accuracy for binary "medication-related or not" classification in this validation, but the approach needs local validation, a reliability-based human-in-the-loop policy, and QA before relying on subtypes for unattended KPI reporting.