Inpatient Mortality Models: When Diagnosis Codes Leak the Answer

Quick Take

In MIMIC-IV (422,534 admissions from 180,640 patients), inpatient mortality models trained using only same-admission ICD diagnostic codes achieved near-perfect discrimination on the held-out test set (AUROC ≈ 0.971–0.976); a targeted screen found 40.2% of same-admission MIMIC studies used same-admission ICD features.

For inpatient pharmacy risk scores and vendor models, treat encounter/billing diagnostic codes as potential post-discharge label leakage: require an explicit prediction time point, documentation of variable availability tied to EHR storage time, and local time-restricted validation (silent-mode) before routing pharmacist work or changing prioritization workflows.

Why it Matters

Same-admission prediction can appear highly accurate when models use diagnostic codes that are entered or finalized after the clinical events they describe — the model is learning hindsight rather than actionable early signals.

For hospital pharmacy, reliance on end-of-encounter codes can distort prioritization (queueing the wrong patients), increase avoidable workload, and erode trust when model outputs cannot be verified at prediction time.

This elevates the need for variable-availability documentation, timestamp governance (prefer EHR storage time for availability checks rather than assumed event times), and procurement requirements that enforce local, time-restricted validation in the intended operational environment.

What They Did

Used MIMIC-IV (ICU/ED admissions 2008–2019; 422,534 admissions from 180,640 patients), partitioned by admission date into train/validation/test (70/10/20) with no patient overlap across splits.

Mapped ICD-10-CM to ICD-9-CM, removed low-variance (<0.0001) and highly collinear (>0.8) codes, and trained logistic regression, random forest, and XGBoost classifiers using only same-admission ICD diagnostic codes (intentionally excluding vitals, labs, and medications to isolate diagnostic-code–driven leakage).

Tuned on the validation set, evaluated performance on the held-out test set (AUROC, balanced accuracy) and feature importance (with Benjamini–Hochberg correction), and performed a targeted literature screen of 100 MIMIC studies, identifying 37 of 92 same-admission prediction papers that used same-admission ICD codes.

What They Found

Models using only same-admission ICD diagnostic codes achieved near-perfect discrimination on the held-out test set: AUROC 0.976 (logistic regression), 0.971 (random forest), and 0.973 (XGBoost); cohort in-hospital mortality was 2.0% (8,417/422,534 admissions).

Top logistic regression predictors were downstream events with very large effect sizes (examples: subdural hemorrhage/coma OR 389.99; cardiac arrest OR 219.58; brain death OR 112.78; encounter for palliative care OR 98.04), consistent with hindsight information driving performance.

Tree-based importance measures also highlighted late/documentation-linked features (do-not-resuscitate status, acute respiratory failure, encounter for palliative care), reinforcing that highly predictive variables would often not be available early enough for intervention.

Targeted literature screening found 40.2% (37 of 92) of same-admission MIMIC prediction studies used same-admission ICD diagnostic codes, despite MIMIC documentation that encounter ICD diagnoses are derived after discharge.

Takeaways

Near-perfect retrospective performance driven by post-discharge diagnostic codes is a classic label-leakage/hindsight signal — unreliable for real-time prioritization unless variable timing and provenance are demonstrated.

Treat diagnosis codes like a discharge summary: appropriate for retrospective characterization but risky for real-time pharmacist triage unless the team can show those codes were available at the stated prediction time point.

Require vendors and internal developers to provide a prediction-time statement (defined prediction time point, data classes used, and EHR storage-time availability for each class) and to support local silent-mode, time-restricted validation that emulates operational latency.

Keep pharmacist judgment central: insist on lead-time estimates, positive predictive value at proposed thresholds, calibration, and explicit tests that detect 'too-late' predictors before using models to influence pharmacist workflows.

Strengths and Limitations

Strengths:

Deliberate ICD-only experimental design with date-based train/validation/test splits and no patient overlap to clearly isolate temporal label leakage.

Transparent methods and reporting: large public MIMIC-IV cohort, adherence to TRIPOD+AI and STROBE guidance, supplemental tables, and publicly available code enabling reproducibility.

Limitations:

Single-dataset scope (MIMIC-IV) using final, untimestamped ICDs limits generalizability and prevents empirical assessment of real-time variable availability in operational EHRs.

Models were a deliberate ICD-only demonstration without prospective or external validation; production-model behavior and patient-level harms in operational settings remain unmeasured.

The literature review was a targeted, non-systematic screen (Google Scholar, citation-sorted selection) of a convenience sample; the 40.2% prevalence applies to the screened sample and may not generalize to all MIMIC-based studies or to private/institutional datasets.

Bottom Line

Treat this as a procurement and governance warning: 'too-good' same-admission models can be driven by label leakage from post-discharge diagnostic codes. Require prediction-time feature provenance (EHR storage-time availability), a variable-availability statement, and local time-restricted validation (silent-mode) before operational use in pharmacist prioritization or CDS.