Beyond ICD-10: Operationalizing LLM Surveillance for Oncology Safety | The Dose News | The Dose News

AI / Informatics

Beyond ICD-10: Operationalizing LLM Surveillance for Oncology Safety

Predictive Reviews

Authors:Md Muntasir Zitu; Ashish Manne; Yuxi Zhu, et al.

Journal:Mdpi Pharmacy

DOI:10.3390/pharmacy13060176

LLMs capture ~95% of adverse events in clinical notes, exposing the critical gaps in billing-code surveillance. We analyze the operational case for "assisted surveillance"—leveraging high-recall AI to feed CTCAE-aligned, non-interruptive pharmacist queues. Explore the strategic fit for informatics leaders: scalable detection, audit-ready governance, and robust human-in-the-loop validation.

December 3, 2025

Quick Take

Large language models (LLMs) reliably detect oncology adverse events from clinical notes, consistently outperforming ICD-10 diagnosis codes for surveillance.

High sensitivity (~95%) paired with low positive predictive value (~15%) favors a staffed, non-interruptive triage queue rather than immediate alerts.

Implementation success requires feeding the pharmacist verification workflow with Common Terminology Criteria for Adverse Events (CTCAE)-aligned, citation-backed summaries.

Why it Matters

Operational fit—surveillance not automation: LLMs function best as high-recall engines feeding the pharmacist order-verification workspace. Expect dedicated triage FTEs, tuned thresholds, and non-interruptive interfaces to manage false positives and prevent alert fatigue.

Governance and auditability: Deployments must emit CTCAE-aligned outputs with specific citations, pinned model versions, and audit trails. Pharmacy informatics and medication safety teams must own validation and incident review to mitigate hallucination and liability risks.

Strategic tradeoffs and timing: Early gains favor centers with existing informatics capacity, while smaller sites may benefit from phased pilots. Start with assisted surveillance, measure grade-to-action correctness and time-per-case, and calculate total cost of ownership before scaling.

Bottom Line

Pilot LLMs as assisted surveillance tools feeding a staffed, CTCAE-aligned review queue that retains pharmacist final sign-off.

Key Details

Evidence set: Nineteen studies applied LLMs to EHR narratives using CTCAE terms. Models ranged from BERT-family classifiers to GPT-4 and Llama, with advanced pipelines utilizing retrieval-augmented generation (RAG) to ground outputs in guideline context.

Detection performance and scale: In multi-site cohorts (e.g., 7,555 and 1,270 admissions), encounter-level sensitivity reached 94.7–98.1% with specificity of 93.7–95.7%, significantly exceeding administrative coding. One pipeline processed ~9,000 patients in roughly 10 minutes, a task estimated to take 9 weeks manually.

Grading and actions: Severity extraction proved variable; coarse CTCAE categories achieved 82–86% accuracy, but exact grading often blurred adjacent levels. Direct grade-to-action correctness was rarely measured, with most studies relying on proxies like steroid initiation or CTCAE-structured JSON outputs.

Data and compute context: Pipelines typically ingested raw clinical notes via HIPAA-eligible clouds or local servers as batch jobs. No studies reported autonomous EHR write-back, emphasizing the need for strict PHI controls and integration planning.