Hide Uncertain Predictions for Safer ED Discharges

Quick Take

Auditor-driven suppression of low-quality machine learning (ML) predictions improved simulated human–ML collaboration: in the VUMC ED discharge simulation, uncertainty-based suppression (Collaboration3) raised AUROC to 0.914 (95% CI 0.913-0.916) versus clinician-alone 0.905 (95% CI 0.905-0.906); the difference was statistically significant (authors report P < 8.2e-6).

Practical implication: clinical decision support (CDS) that can abstain may reduce harm — favor CDS designs that expose calibrated uncertainty outputs, enable auditor/abstention logic, and add fairness monitoring to limit automation bias and protect medication-safety and discharge workflows.

Why it Matters

Machine learning (ML) risk scores are increasingly embedded in emergency department (ED) triage and discharge clinical decision support, but model performance often varies across subpopulations and clinicians can over-rely on low-quality outputs; displaying every prediction risks propagating unsafe or inequitable recommendations at the point of care.

Errors propagate into operational workflows: misclassified acuity or 30-day readmission risk can misdirect scarce resources, increase cognitive load, and generate noisy alerts—wasting pharmacist and care-team effort on misprioritized discharge reviews and follow-ups.

Auditor-driven suppression reframes CDS around a 'safe-to-show' principle — selectively surfacing confident recommendations — supporting stewardship, CDS governance, and attention-sensitive workflows while enabling fairness monitoring to curb automation bias.

What They Did

Used Vanderbilt University Medical Center (VUMC) and MIMIC-IV-ED cohorts for two prediction tasks: ED triage (ICU admission or death within 24 hours) and ED discharge (30-day ED readmission); extracted demographics, triage vitals, prior utilization, and Emergency Severity Index (ESI).

Trained practical gradient-boosted tree models (CatBoost) and an artificially high-performing oracle model (95% lookup), and estimated prediction uncertainty using virtual ensembles.

Built auditor models (logistic regression and QUEST decision-tree auditors) that learn to flag likely ML errors or high-uncertainty predictions and suppress those recommendations before display.

Simulated clinician behavior with evidence-based rules (higher AI acceptance for older patients, polypharmacy, multimorbidity; clinicians only accepted AI when it predicted worse outcomes), ran 30 repeats of 75/25 train/test resampling, and reported AUROC, area under the precision-recall curve (AUPRC), and absolute average odds difference (AAOD) fairness.

What They Found

In these simulation experiments, auditor-driven suppression improved collaborative discrimination when the ML was stronger: e.g., VUMC ED discharge with uncertainty-based suppression (Collaboration3) achieved AUROC 0.914 (95% CI 0.913-0.916) versus clinician-alone 0.905 (95% CI 0.905-0.906).

Across tasks and datasets, collaborations that suppressed low-quality predictions generally outperformed the human alone in nearly all model/dataset combinations; the exception was the gradient-boosted tree (GBT) on the MIMIC discharge task, where the human remained superior (human > suppressed collaboration, P < 3.6e-5).

Uncertainty-informed suppression was often the fairest choice: Collaboration3 was the top fairness performer in 4 of 8 experiments (per the paper's counts), and suppression usually did not degrade AAOD fairness in settings where ML already outperformed humans.

Overall, suppression selectively hid low-quality or uncertain predictions in ways that raised aggregate AUROC in several dataset/model settings and often preserved or improved fairness across race/age/gender subgroup comparisons, thereby changing which cases were surfaced to clinicians for review.

Takeaways

Treat suppression as a 'mute' for shaky ML advice: in simulation, selectively hiding likely-wrong or high-uncertainty outputs improved collaboration when the model outperformed clinicians and commonly preserved subgroup fairness.

For pharmacists and pharmacy leaders: use suppressed-output signals as quiet consultants that speak when confident — route attention to higher-risk triage/discharge cases rather than automating final medication or disposition decisions.

When evaluating vendor tools, prefer ones that quantify uncertainty and support abstention, and insist on pre-deployment fairness assessment that measures the downstream impact of suppression across race, age, and gender.

Remember simulation limits: validate locally, monitor for dataset shift and auditor/model drift, and ensure UI and governance prevent clinicians from interpreting 'no alert' as 'all clear' — continuous oversight remains essential.

Strengths and Limitations

Strengths:

Used large, real-world cohorts (VUMC and MIMIC-IV-ED) across two ED tasks and evaluated both a realistic gradient-boosted tree model and an oracle model to probe dependence on ML quality.; Applied uncertainty-aware auditing (virtual ensembles, QUEST/logistic regression auditors), repeated 30x train/test resampling, AAOD fairness metric, and formal statistical testing to strengthen methodological rigor.

Limitations:

Simulation of clinician behavior and use of retrospective data limit ecological validity; user-interface effects, real-world trust, and prospective clinician acceptance were not observed.; Auditor and uncertainty methods were evaluated on these datasets only; external validation, drift-detection safeguards, and subgroup impact testing are required before operational deployment.

Bottom Line

Early simulation-based evidence supports auditor-driven suppression: selectively hiding low-confidence or likely-incorrect ML predictions can boost human–ML collaboration and often preserve fairness when ML outperforms clinicians; pilot with vendors that expose uncertainty outputs, perform local validation, and implement ongoing monitoring and governance.