Quick Take

  • Large language models (LLMs) reliably detect oncology adverse events from clinical notes, consistently outperforming ICD-10 diagnosis codes for surveillance.
  • High sensitivity (~95%) paired with low positive predictive value (~15%) favors a staffed, non-interruptive triage queue rather than immediate alerts.
  • Implementation success requires feeding the pharmacist verification workflow with Common Terminology Criteria for Adverse Events (CTCAE)-aligned, citation-backed summaries.

Why it Matters

  • Operational fit—surveillance not automation: LLMs function best as high-recall engines feeding the pharmacist order-verification workspace. Expect dedicated triage FTEs, tuned thresholds, and non-interruptive interfaces to manage false positives and prevent alert fatigue.
  • Governance and auditability: Deployments must emit CTCAE-aligned outputs with specific citations, pinned model versions, and audit trails. Pharmacy informatics and medication safety teams must own validation and incident review to mitigate hallucination and liability risks.
  • Strategic tradeoffs and timing: Early gains favor centers with existing informatics capacity, while smaller sites may benefit from phased pilots. Start with assisted surveillance, measure grade-to-action correctness and time-per-case, and calculate total cost of ownership before scaling.

Bottom Line

Pilot LLMs as assisted surveillance tools feeding a staffed, CTCAE-aligned review queue that retains pharmacist final sign-off.


Key Details

  • Evidence set: Nineteen studies applied LLMs to EHR narratives using CTCAE terms. Models ranged from BERT-family classifiers to GPT-4 and Llama, with advanced pipelines utilizing retrieval-augmented generation (RAG) to ground outputs in guideline context.
  • Detection performance and scale: In multi-site cohorts (e.g., 7,555 and 1,270 admissions), encounter-level sensitivity reached 94.7–98.1% with specificity of 93.7–95.7%, significantly exceeding administrative coding. One pipeline processed ~9,000 patients in roughly 10 minutes, a task estimated to take 9 weeks manually.
  • Grading and actions: Severity extraction proved variable; coarse CTCAE categories achieved 82–86% accuracy, but exact grading often blurred adjacent levels. Direct grade-to-action correctness was rarely measured, with most studies relying on proxies like steroid initiation or CTCAE-structured JSON outputs.
  • Data and compute context: Pipelines typically ingested raw clinical notes via HIPAA-eligible clouds or local servers as batch jobs. No studies reported autonomous EHR write-back, emphasizing the need for strict PHI controls and integration planning.