Quick Take

  • Auditor-driven suppression of low-quality machine learning (ML) predictions improved simulated human–ML collaboration: in the VUMC ED discharge simulation, uncertainty-based suppression (Collaboration3) raised AUROC to 0.914 (95% CI 0.913-0.916) versus clinician-alone 0.905 (95% CI 0.905-0.906); the difference was statistically significant (authors report P < 8.2e-6).
  • Practical implication: clinical decision support (CDS) that can abstain may reduce harm — favor CDS designs that expose calibrated uncertainty outputs, enable auditor/abstention logic, and add fairness monitoring to limit automation bias and protect medication-safety and discharge workflows.

Why it Matters

  • Machine learning (ML) risk scores are increasingly embedded in emergency department (ED) triage and discharge clinical decision support, but model performance often varies across subpopulations and clinicians can over-rely on low-quality outputs; displaying every prediction risks propagating unsafe or inequitable recommendations at the point of care.
  • Errors propagate into operational workflows: misclassified acuity or 30-day readmission risk can misdirect scarce resources, increase cognitive load, and generate noisy alerts—wasting pharmacist and care-team effort on misprioritized discharge reviews and follow-ups.
  • Auditor-driven suppression reframes CDS around a 'safe-to-show' principle — selectively surfacing confident recommendations — supporting stewardship, CDS governance, and attention-sensitive workflows while enabling fairness monitoring to curb automation bias.

What They Did

  • Used Vanderbilt University Medical Center (VUMC) and MIMIC-IV-ED cohorts for two prediction tasks: ED triage (ICU admission or death within 24 hours) and ED discharge (30-day ED readmission); extracted demographics, triage vitals, prior utilization, and Emergency Severity Index (ESI).
  • Trained practical gradient-boosted tree models (CatBoost) and an artificially high-performing oracle model (95% lookup), and estimated prediction uncertainty using virtual ensembles.
  • Built auditor models (logistic regression and QUEST decision-tree auditors) that learn to flag likely ML errors or high-uncertainty predictions and suppress those recommendations before display.
  • Simulated clinician behavior with evidence-based rules (higher AI acceptance for older patients, polypharmacy, multimorbidity; clinicians only accepted AI when it predicted worse outcomes), ran 30 repeats of 75/25 train/test resampling, and reported AUROC, area under the precision-recall curve (AUPRC), and absolute average odds difference (AAOD) fairness.

What They Found

  • In these simulation experiments, auditor-driven suppression improved collaborative discrimination when the ML was stronger: e.g., VUMC ED discharge with uncertainty-based suppression (Collaboration3) achieved AUROC 0.914 (95% CI 0.913-0.916) versus clinician-alone 0.905 (95% CI 0.905-0.906).
  • Across tasks and datasets, collaborations that suppressed low-quality predictions generally outperformed the human alone in nearly all model/dataset combinations; the exception was the gradient-boosted tree (GBT) on the MIMIC discharge task, where the human remained superior (human > suppressed collaboration, P < 3.6e-5).
  • Uncertainty-informed suppression was often the fairest choice: Collaboration3 was the top fairness performer in 4 of 8 experiments (per the paper's counts), and suppression usually did not degrade AAOD fairness in settings where ML already outperformed humans.
  • Overall, suppression selectively hid low-quality or uncertain predictions in ways that raised aggregate AUROC in several dataset/model settings and often preserved or improved fairness across race/age/gender subgroup comparisons, thereby changing which cases were surfaced to clinicians for review.

Takeaways

  • Treat suppression as a 'mute' for shaky ML advice: in simulation, selectively hiding likely-wrong or high-uncertainty outputs improved collaboration when the model outperformed clinicians and commonly preserved subgroup fairness.
  • For pharmacists and pharmacy leaders: use suppressed-output signals as quiet consultants that speak when confident — route attention to higher-risk triage/discharge cases rather than automating final medication or disposition decisions.
  • When evaluating vendor tools, prefer ones that quantify uncertainty and support abstention, and insist on pre-deployment fairness assessment that measures the downstream impact of suppression across race, age, and gender.
  • Remember simulation limits: validate locally, monitor for dataset shift and auditor/model drift, and ensure UI and governance prevent clinicians from interpreting 'no alert' as 'all clear' — continuous oversight remains essential.

Strengths and Limitations

Strengths:

  • Used large, real-world cohorts (VUMC and MIMIC-IV-ED) across two ED tasks and evaluated both a realistic gradient-boosted tree model and an oracle model to probe dependence on ML quality.; Applied uncertainty-aware auditing (virtual ensembles, QUEST/logistic regression auditors), repeated 30x train/test resampling, AAOD fairness metric, and formal statistical testing to strengthen methodological rigor.

Limitations:

  • Simulation of clinician behavior and use of retrospective data limit ecological validity; user-interface effects, real-world trust, and prospective clinician acceptance were not observed.; Auditor and uncertainty methods were evaluated on these datasets only; external validation, drift-detection safeguards, and subgroup impact testing are required before operational deployment.

Bottom Line

Early simulation-based evidence supports auditor-driven suppression: selectively hiding low-confidence or likely-incorrect ML predictions can boost human–ML collaboration and often preserve fairness when ML outperforms clinicians; pilot with vendors that expose uncertainty outputs, perform local validation, and implement ongoing monitoring and governance.