Quick Take
- Item Response Theory (IRT) matched or outperformed k-nearest neighbors (kNN) and multiple imputation by chained equations (MICE) and equaled DataWig for nominal and binary categorical imputations (example: nominal F1 ≈ 0.51 on Housing; binary F1 ≈ 0.76 on Heart), while ordinal imputation remained poor (F1 ≈ 0.20).
- Imputing high‑stakes ordinal fields (e.g., allergy severity) is unsafe given ~80% misclassification; IRT may be useful for lower‑risk nominal reporting (e.g., intervention codes) but requires validation on real EHR data before any operational use.
Why it Matters
- Missing categorical fields (ordinal, nominal, binary) are common and hinder clinical score calculation, predictive model development, and automated decision support—often forcing case‑wise deletion or manual review that reduces power, increases workload, and raises safety risk.
- In hospital pharmacy, absent severity ratings or procedure/intervention codes can suppress safety alerts, obscure pharmacist interventions, and complicate operational reporting; imputation approaches that use the outcome risk circular bias and inflated performance estimates.
- Hospitals need imputation methods that prevent outcome leakage, are validated on real EHRs, and balance accuracy against stewardship, CDS integrity, and limited informatics resources.
What They Did
- Selected three complete public datasets representing ordinal, nominal, and binary categorical targets: Diamonds (N=53,920, ordinal color), Housing (N=10,692, nominal city), and BRFSS/Heart Disease (N=47,786, binary high blood pressure).
- Simulated single‑variable missingness at 5–50% under Missing Completely at Random (MCAR) and Missing at Random (MAR); Little’s test used to verify MCAR vs MAR. Missing Not at Random (MNAR) was excluded.
- Applied IRT imputations in IRTPRO using model family matches: 2‑parameter logistic (2‑PL) for binary, Graded Response Model (GRM) for ordinal, and Nominal Response Model (NRM) for nominal. Continuous predictors were binned to ordinal categories for IRT.
- Compared IRT to kNN (k‑nearest neighbors), MICE (multiple imputation by chained equations), and DataWig (AWS deep learning); evaluated direct imputation accuracy using F1 and downstream impact using XGBoost with RMSE (regression) or AUC (classification).
What They Found
- Ordinal (Diamonds): All methods performed poorly for ordered categories (overall F1 ≈ 0.20). IRT was marginally better (≈0.22) but still implies ~80% misclassification—unacceptable for patient‑level use.
- Nominal (Housing): IRT and DataWig led performance (marginal means IRT 0.507 vs DataWig 0.501); kNN (0.286) and MICE (0.203) were substantially worse, so nominal imputation reached roughly coin‑flip accuracy.
- Binary (Heart Disease): Task was easiest—IRT reached ~0.76 F1 under MAR and performed comparably to DataWig/kNN; downstream classifier AUC remained essentially unchanged (original AUC ≈ 0.831 vs 0.829–0.830 after imputation).
- Downstream regression impact: Diamond original RMSE 0.2241 rose to ≈0.23 (MAR) and ≈0.34 (MCAR) after imputation; Housing RMSE rose from 0.4380 to ≈0.61 (MAR) and ≈0.68 (MCAR).
- Practical interpretation: ordinal fields (e.g., allergy severity) are unsafe to impute given ≈80% error; nominal fields (e.g., intervention codes) at ≈50% F1 are unreliable for operational reporting. IRT’s gains on nominal/binary tasks were driven by avoiding outcome leakage and using probabilistic category modeling, but it did not overcome ordinal complexity.
Takeaways
- Do not impute high‑stakes ordinal fields for patient‑level use: across methods ordered categories had F1 ≈ 0.20 (≈80% wrong); critical ordinal fields (e.g., allergy severity) should remain missing or be manually clarified.
- Nominal and binary variables can be imputed with higher raw accuracy (nominal F1 ≈ 0.5, binary F1 ≈ 0.7), but downstream models showed little or no benefit and sometimes worse error; IRT’s primary value is methodological—preventing outcome leakage that can produce misleadingly optimistic results.
- Treat imputation as an aggregate reporting aid or a ‘best‑guess assistant,’ not a substitute for documentation. Any imputed value that could affect individual patient care requires pharmacist review and chart verification.
Strengths and Limitations
Strengths:
- Used complete public datasets with simulated MCAR and MAR missingness and known ground truth, enabling objective F1 and downstream performance comparisons.
- Benchmarked IRT against kNN, MICE, and DataWig across ordinal, nominal, and binary variables and assessed downstream effects using XGBoost models.
Limitations:
- Only one variable per dataset was made missing and all datasets were non‑clinical/public, limiting realism and generalizability to EHR settings.
- MNAR mechanisms were excluded; IRT required proprietary IRTPRO workflows, categorical binning of continuous features, and unidimensional IRT assumptions—constraints that limit scalability, robustness, and fit for complex, multi‑domain EHR data.
Bottom Line
IRT‑based imputation avoids outcome leakage and is conceptually attractive, but it is not ready for pharmacy deployment: unsafe for high‑stakes ordinal fields, unreliable for nominal fields without EHR validation, and constrained by software/workflow and modeling assumptions that require real‑world testing.