Sentrelia — Research: ML-Enhanced ED Triage

Abstract

Background: Emergency department triage scales have limited accuracy for identifying patients who will need ICU admission, and whether NLP of chief complaints adds predictive value beyond structured triage data has not been established.

Methods: Retrospective cohorts from NHAMCS 2018–2022 (n=86,864; 2.02% critical outcome prevalence) and MIMIC-IV-ED (n=425,087; primary anchor-era temporal test n=77,026; sensitivity random-split test n=128,510) — totaling over 512,000 ED encounters. Primary analyses used temporal holdout validation (NHAMCS: train 2018–2021, test 2022; MIMIC: anchor-era split holding out the 2017–2019 patient cohort). Models used only predictors available within 5 minutes of ED arrival.

Results: In the primary NHAMCS temporal holdout (training 2018–2021, testing 2022; n_test=16,025), XGBoost achieved AUC 0.862. Multi-architecture convergence on pooled data confirmed robustness: XGBoost 0.863, LightGBM 0.861, CatBoost 0.863. In MIMIC's primary anchor-era temporal holdout (n_test=77,026; 4.92% prevalence), structured-only AUC was 0.909 and structured+NLP reached 0.943 (ΔAUC +0.034; DeLong 95% CI 0.939–0.946). At the train-locked 80% sensitivity deployment threshold, structured+NLP achieved 13.8% alert rate per 100 ED arrivals at 83.3% sensitivity. Calibration was near-ideal (slope 0.928, 95% CI 0.904–0.951).

Conclusions: Triage-time machine learning achieved strong discrimination across NHAMCS temporal validation (AUC 0.862), MIMIC anchor-era temporal validation (AUC 0.943 with NLP), three model architectures (all >0.93 pooled), and two independent datasets. A COVID-era stress test (AUC 0.912) confirmed stability. NLP of brief chief complaints provided a meaningful, reproducible gain (ΔAUC +0.034 in MIMIC). Subgroup analyses in NHAMCS revealed wider performance variation by race and ethnicity than in MIMIC, a disparity that will require monitoring in any implementation. Prospective validation at individual sites is still needed.

Key Findings

What the Evidence Shows

Strong Discrimination Across Settings

Three independent model architectures converge at AUC > 0.86 on NHAMCS temporal holdout, with concordant performance in MIMIC-IV-ED. This consistency reflects expected deployment performance when models are retrained on hospital-specific data.

0.862

NHAMCS Temporal Holdout

Pooled: XGB 0.863 · LGBM 0.861 · CB 0.863

0.943

MIMIC Anchor-Era Temporal (Struct+NLP, 6h ICU)

DeLong 95% CI: 0.939–0.946

NHAMCS temporal holdout (2018-21→22)0.862

vs. NHAMCS acuity baseline (IMMEDR)0.782

vs. MIMIC ESI alone0.743

Chief Complaints Add Value

Even brief chief complaints (median 2.5 words) significantly improved predictions. Most discrimination came from symptom-based terms, not ICU-proxy language.

NLP contribution (ΔAUC) +0.034 (MIMIC anchor-era, 6h ICU)

False positive reduction 26% (25.7% → 19.0%)

Unnecessary alerts eliminated 7,896

Sensitivity maintained 80%

Multi-Architecture Convergence

Three independent gradient boosting frameworks converge at AUC > 0.86 on pooled data, with temporal holdout at 0.862 — reflecting realistic prospective deployment performance. Site-specific retraining on hospital data matches this evaluation approach.

XGBoost (pooled) 0.863

LightGBM (pooled) 0.861

CatBoost (pooled) 0.863

Temporal holdout (2018-21→22) 0.862

Temporal Stability Confirmed

Both datasets demonstrated stable performance across time, confirming the model captures durable clinical signals rather than transient patterns.

NHAMCS (train 2018–2021 → test 2022)

AUC 0.862

Average precision 0.175

Brier score 0.020

Samples (train / test) 70,839 / 16,025

MIMIC Anchor-Era Temporal (train pre-2017 → test 2017–2019)

Structured + NLP AUC 0.943 (95% CI 0.939–0.946)

Random-split sensitivity 0.933 (n=128,510)

Progressive Severity Signal

AUC improved as the ICU window narrowed, consistent with the model capturing physiologic severity identifiable at presentation.

Any ICU0.898

48-hour0.909

24-hour0.914

12-hour0.924

6-hour (PRIMARY)0.943

MIMIC anchor-era temporal holdout, structured + NLP. Source: time_restricted_results.json

Well-Calibrated Predictions

Predicted probabilities closely match observed rates. A slope of 1.0 indicates perfect calibration.

NHAMCS temporal holdout (isotonic)slope 0.92

95% CI0.86–0.98

MIMIC anchor-era Struct+NLPslope 0.928

95% CI0.904–0.951

Brier (NHAMCS / MIMIC anchor-era)0.017 / 0.107

After local recalibration (MIMIC)Brier 0.029

Subgroup Performance

MIMIC showed minimal variation; NHAMCS revealed wider gaps requiring attention.

MIMIC anchor-era (6h ICU)

Subgroup AUC range (22 levels)0.739–0.964

Race / ethnicity range0.888–0.954

Age 18–39 / 80+0.957 / 0.898

Walk-in / Ambulance0.886 / 0.886

NHAMCS (temporal holdout)

RaceAUC 0.806–0.926

EthnicityHispanic 0.795 vs. 0.899

NHAMCS disparities wider than MIMIC — subgroup monitoring required at deployment

Paper 3 — Deployment Readiness

MIMIC anchor-era temporal holdout supports silent prospective evaluation

The SAP-prespecified anchor-era temporal split (test = patients in MIMIC anchor_year_group 2017–2019; n=77,026 encounters from 57,234 patients; zero patient overlap by construction) is the primary holdout. All Phase 4 deployment-readiness analyses run against this frozen prediction set with their analysis plan SHA-256-frozen in the repository's FROZEN_MODEL_MANIFEST.json v2.1.0.

Locked deployment threshold

13.8

alerts per 100 ED arrivals at 83.3% sensitivity

Threshold 0.6512 chosen on TRAIN; applied UNCHANGED to TEST. Compares with MIMIC acuity ≤2 alert rate of 39.5%.

Calibration slope

0.928

95% CI 0.904–0.951 (target 1.0)

Local logistic / isotonic recalibration further reduces Brier from 0.107 to 0.029 — recommended pre-go-live step at any new institution.

Subgroup AUC range

0.74–0.96

across 22 prespecified subgroups

Variation aligns with case-mix severity gradient (older / ambulance / helicopter), not algorithmic disparate impact. 13 of 22 subgroup ΔAUCs reach BH-FDR significance at q=0.05.

Joint clinician + ML undertriage

81 of 634 false negatives

cases where the model AND the nurse (acuity ≥3) both missed a critical event — the population segment where a hybrid clinician + ML pathway is most valuable.

Time-restricted outcome stability

0.943 → 0.898

monotonic AUC decline 6h → any-ICU. Confirms the model captures triage-identifiable physiologic severity rather than delayed ward deterioration.

Methodology pre-registered in docs/internal/PAPER3_STATISTICAL_ANALYSIS_PLAN.md v1.0 (frozen 2026-04-23, SHA-256 in manifest). All Phase 4 outputs verified by scripts/audit_paper3_exhaustive.py --strict (41/41 PASS).

Methodology

Study Design

Two independent cohorts. One prespecified pipeline. Rigorous cluster-aware evaluation.

National Dataset

NHAMCS 2018–2022

National Hospital Ambulatory Medical Care Survey (CDC)

Visits86,864

Weighted visits552 million

ICU prevalence2.02%

Features146

EvaluationTemporal holdout (2018–21 → 22)

Mean age49.2 years

Female53.4%

Academic Medical Center

MIMIC-IV-ED v2.2

Beth Israel Deaconess Medical Center (MIT/PhysioNet)

Visits425,087 (425,011 adults)

Anchor-era TEST (PRIMARY)77,026 (4.92% prev)

Random-split test (sensitivity)128,510 (4.09% prev)

Features (Struct+NLP)530

Primary evaluationAnchor-era temporal split

Mean age52.9 years

Female54.1%

Locked Deployment Thresholds — MIMIC Anchor-Era (Structured + NLP)

Thresholds chosen on the anchor-era TRAIN subset and applied UNCHANGED to the held-out TEST subset (no test-ROC tuning). 6-hour ICU AUC 0.943 (DeLong 95% CI 0.939–0.946).

Target sensitivity (TRAIN-locked)	Test sensitivity	Test specificity	Test PPV	Alert rate / 100 arrivals
70% (higher specificity)	75.5%	93.6%	36.7%	9.8
80% (PRIMARY DEPLOYMENT)	83.3%	89.8%	29.7%	13.8
90% (higher recall)	90.4%	83.3%	21.0%	20.3

Test sensitivity exceeds TRAIN target at every locked threshold because anchor-era TEST prevalence (4.92%) is higher than TRAIN (3.93%). Compare with MIMIC acuity ≤2 alert rate: 39.5% with 92.2% capture. Source: locked_thresholds.json.

Clinical Implications

What This Means for Emergency Medicine

Uses only data already collected at triage

Demographics, vitals, and chief complaint — available within 5 minutes of ED arrival. Zero additional nurse data entry.

Validated across different populations and settings

Convergent performance in a nationally representative survey (500+ hospitals) and a single academic medical center demonstrates generalizability of the approach.

NPV ≥ 98% across all sensitivity thresholds

Less than 3% risk of critical outcome among patients not flagged, reliably identifying low-risk patients across all operating points.

Prospective validation required before deployment

All analyses were retrospective. Implementation would require site-specific validation, local calibration, age-stratified monitoring, and a silent-phase pilot before clinical use.

Figures

Model Performance Visualizations

Figure 1. ROC and Precision-Recall Curves

(A) ROC curves comparing model discrimination for critical outcome prediction. NHAMCS temporal holdout XGBoost (AUC 0.862, ICU admission or ED death) and MIMIC-IV-ED anchor-era Structured+NLP (6h-ICU AUC 0.943, DeLong 95% CI 0.939–0.946) both substantially outperformed acuity-based baselines. (B) Precision-recall curves showing average precision: NHAMCS temporal holdout 0.175; MIMIC anchor-era Structured 0.453, Struct+NLP 0.573, acuity-only 0.216. Shaded regions indicate 95% CIs via patient-cluster bootstrap.

Figure 2. Operating Characteristics Across Sensitivity Thresholds

(A) Specificity vs. sensitivity trade-off across models. (B) Positive predictive value as a function of sensitivity. (C) Alert burden (alert rate) across sensitivity targets, with the clinical 30% threshold highlighted. NLP consistently reduces alert burden at matched sensitivity. At 80% sensitivity, NLP reduces the alert rate from 29.8% to 23.6%.

Figure 3. Calibration Curves

(A) Calibration curves showing predicted probabilities vs. observed outcome frequencies. NHAMCS shows near-ideal calibration (Brier 0.0165); MIMIC shows slight underconfidence at high probabilities, which is clinically conservative. (B) Brier score comparison (lower is better): NHAMCS 0.0165, MIMIC Structured 0.1421, MIMIC Struct+NLP 0.1218.

Figure 4. Feature Importance Rankings

(A) NHAMCS: top features include age, systolic BP, immediate/emergent triage flag, heart rate, and respiratory rate. (B) MIMIC+NLP: top features include acuity (ESI), arrival method, temperature, and chief complaint NLP terms (chest pain, SOB, altered mental status), demonstrating the incremental value of free-text analysis. Importance measured by XGBoost gain.

Tables

Detailed Results

Table 1. Dataset Characteristics and Cohort Composition

Characteristic	NHAMCS 2018–2022	MIMIC-IV-ED 2011–2019
Study Design
Total ED visits	86,864	425,087
Collection period	2018–2022 (5 yrs pooled)	2011–2019 (9 yrs)
Geographic scope	Nationally representative (US)	Single academic center (Boston)
Data source	Manual chart abstraction	Automated EHR extraction
Partitioning
Training set	60,804 (70%)	296,538 (70%)
Test set	26,060 (30%)	128,510 (30%)
Outcome prevalence	2.02%	7.60%
Demographics
Mean age, yrs (SD)	43.2 (24.8)	51.3 (22.1)
Age ≥65 years	27.0%	31.6%
Female sex	54.2%	52.8%
Race/Ethnicity
White, non-Hispanic	56.3%	48.2%
Black, non-Hispanic	24.1%	18.7%
Hispanic	15.2%	12.4%
Asian/Other	4.4%	20.7%
Vital Sign Abnormalities
Tachycardia (HR ≥100)	28.3%	31.8%
Hypotension (SBP <90)	2.0%	3.0%
Tachypnea (RR >20)	12.4%	15.7%
Hypoxia (SpO₂ <92%)	4.0%	5.0%
Fever (>38.3°C)	7.1%	8.4%
Primary Outcome
ICU admission or ED death	1,754 (2.02%)	32,291 (7.60%)
ICU admission only	1,687 (1.94%)	31,946 (7.52%)
ED death only	67 (0.08%)	345 (0.08%)
Feature Sets
Structured features	146	31
TF-IDF NLP features	N/A	500

NHAMCS percentages are survey-weighted. Missing data: NHAMCS vitals 5–12%, MIMIC vitals 8–15%.

Table 2. Model Performance — Discrimination and Calibration

Model	Dataset	AUC [95% CI]	Avg. Precision [95% CI]	Cal. Slope
Primary ML Models
XGBoost (temporal holdout)	NHAMCS	0.862	0.175	0.84 slope; isotonic cal.
XGBoost (structured, ICU-6h)	MIMIC anchor-era (PRIMARY)	0.909 [0.904–0.914]	0.453	0.901
XGBoost (struct+NLP, ICU-6h)	MIMIC anchor-era (PRIMARY)	0.943 [0.939–0.946]	0.573	0.928
XGBoost (struct+NLP, ICU-6h)	MIMIC random-split (sensitivity)	0.933	0.457 [0.448–0.466]	1.08
Acuity-Based Baselines
IMMEDR-only (logistic)	NHAMCS	0.782 [0.771–0.793]	—	—
ESI-only (logistic)	MIMIC	0.744 [0.741–0.747]	—	—
Model Improvements (p<0.001 for all)
NHAMCS ML vs. IMMEDR		+0.154 (+19.7%)
MIMIC ML vs. ESI		+0.113 (+15.2%)
NLP vs. structured-only		+0.034 (+4.0%)	+0.088 (+19.9%)

AUC CIs via DeLong method. AP CIs via bootstrap (10,000 iterations). Ideal calibration: slope = 1.0.

Table 3. Operating Characteristics at 80% Sensitivity

Model	Sensitivity	Specificity	PPV	NPV	Alert Rate	Detected
NHAMCS (n=26,060 test; 527 critical)
XGBoost (structured)	80.0%	86.3%	10.8%	99.5%	15.0%	422/527
MIMIC Structured-Only (n=128,510 test; 9,767 critical)
XGBoost (structured)	80.0%	74.3%	20.3%	97.9%	29.8%	7,754/9,693
MIMIC Structured+NLP (n=128,510 test; 9,767 critical)
XGBoost (struct+NLP)	80.0%	81.0%	25.5%	98.0%	23.6%	7,754/9,693
NLP Impact
Specificity improvement		+6.7 pp
Alert rate reduction					−6.2 pp
False alarms reduced	7,896 fewer (26% relative reduction)

80% sensitivity selected a priori. Alert rate = (TP + FP) / total. NPV >98% = <2% risk among unflagged patients.

Table 4. Subgroup Performance by ESI Level (MIMIC-IV-ED)

ESI Level	n visits	Outcome Rate	ESI Baseline AUC	ML AUC (Struct+NLP)	Improvement	P-value
ESI 1 (Resuscitation)	12,754	38.4%	N/A*	0.721	N/A	N/A
ESI 2 (Emergent)	140,628	13.5%	0.694	0.852	+0.158 (+22.8%)	<0.001
ESI 3 (Urgent)	221,045	3.4%	0.712	0.891	+0.179 (+25.1%)	<0.001
ESI 4–5 (Less/Non-Urgent)	50,660	2.0%	0.698	0.893	+0.195 (+27.9%)	<0.001

*ESI 1 baseline N/A: all patients same acuity level. ML provides 22–28% relative AUC improvements across ESI 2–5. Among 271,705 ESI 3–5 patients, ML identifies 80% of 8,471 who develop critical outcomes. P-values from DeLong test.

Clinical Translation

What This Means for Your ED

Projected annual impact for a 50,000-visit emergency department with 7.6% critical outcome prevalence.

3,100

Fewer False Alarms / Year

NLP reduces unnecessary high-risk alerts from 11,860 to 8,760 annually while maintaining 80% sensitivity for critical outcomes.

99.5%

Negative Predictive Value

Less than 1% risk of critical outcome among patients not flagged. Clinicians can confidently prioritize flagged patients.

5 min

Time to Risk Score

Uses only data already collected at triage — demographics, vitals, chief complaint. Zero additional nurse data entry required.

Temporal Stability

Model performance holds across time periods, including the COVID-19 pandemic case-mix shift.

NHAMCS pooled (2018–2022)AUC 0.863

Temporal holdout (train 18–21 → test 22)AUC 0.862

COVID stress test (train 18–19 → test 21–22)AUC 0.912

MIMIC temporal (train 11–16 → test 17–19)AUC 0.877

Configurable Alert Thresholds

Hospitals choose their own sensitivity/specificity balance based on operational capacity and risk tolerance.

High-safety (95% sensitivity) More alerts, fewer misses

Balanced (90% sensitivity) Recommended default

Efficient (80% sensitivity) Fewer alerts, high NPV

Alert thresholds are adjustable per institution. NPV remains ≥98% at all operating points.

Safety-First Design

Clinical Safety Architecture

Our RedFlagEngine enforces a core clinical invariant: the system can recommend higher acuity but can never recommend lower acuity past a clinically defined floor. The model augments triage — it cannot override it.

🛡

Up-Triage Only

11 red-flag rules enforce ESI floors for high-risk presentations. The model can always recommend more acute — never less.

⚠

Out-of-Scope Gates

Pediatric patients on an adult-trained model and encounters with missing critical vitals get a structured advisory — not a prediction.

📋

Full Audit Trail

Every safety intervention is logged with rule IDs, engine version, and timestamps. Rules are versioned separately from the model for clinical governance.

Retrospective Validation (n = 16,025)

Applied to the NHAMCS 2022 temporal holdout test set, the safety layer demonstrated clinically appropriate behavior with no fairness disparities.

14.0%

Encounters with
red flag match

0.6%

ESI predictions
elevated by floor

22.4%

True positives
flagged by rules

0%

Disparate impact
across subgroups

Red-Flag Rules and Firing Rates

Rule	Floor ESI	Matched	% of Total
Chest pain with cardiac features	ESI-2	919	5.7%
Pregnancy with bleeding	ESI-2	609	3.8%
Altered mental status	ESI-2	243	1.5%
New focal neurologic deficit	ESI-2	223	1.4%
Suicidal ideation with plan	ESI-2	219	1.4%
Sepsis criteria (SIRS/qSOFA)	ESI-2	193	1.2%
Anaphylaxis	ESI-1	87	0.5%
Stroke within treatment window	ESI-1	Requires onset time*
Pediatric fever < 60 days	ESI-2	Requires onset time*
Thunderclap headache	ESI-2	Requires free text*
Testicular pain (torsion rule-out)	ESI-2	Requires onset time*
Any red flag		2,247	14.0%

*These rules activate in the live clinical pathway where symptom onset time and free-text chief complaints are collected via the patient intake form or FHIR integration.

Study 2 — In Preparation

Can ML Triage Outperform Nurse ESI?

Most ML triage models train on nurse-assigned ESI, inheriting its biases. We trained on what actually happened to patients — and the results show the model identifies sick patients better than nurses.

The Paradigm Shift

Instead of training on nurse-assigned ESI levels, we derived a 5-level severity target from actual patient dispositions: ICU admission, hospital admission, observation, discharge with follow-up, and discharge without follow-up. The model learns to predict what actually happened to the patient, not what the nurse guessed at triage.

Level 1

ICU / Death

Level 2

Hospital Admit

Level 3

Observation

Level 4

Discharge + F/U

Level 5

Discharge

0.825

Model AUC for critical outcomes

vs. 0.766 nurse ESI

83%

Nurse undertriage corrected

79 of 95 missed critical patients caught

67%

Fewer high-acuity flags

2,232 vs. 6,790 nurse ESI 1-2

77% / 87%

Sensitivity / Specificity

vs. 75% / 57% nurse ESI

Head-to-Head: Model vs. Nurse ESI on Critical Outcomes

Temporal holdout: NHAMCS 2022 (n=15,372; 380 critical outcomes). Model trained on 2018–2021 (n=67,753).

Metric	Our Model	Nurse ESI	Improvement
AUC for ICU/Death	0.825	0.766	+0.059 (+7.7%)
Sensitivity (levels 1-2)	77.1%	75.0%	+2.1%
Specificity (levels 1-2)	87.1%	56.6%	+30.5%
Patients flagged high-acuity	2,232 (15%)	6,790 (44%)	−67%
Overtriage rate	86.9%	95.8%	−8.9%
Undertriage correction	79/95 (83.2%) nurse-missed critical patients identified

What This Means

Safer Triage

83% of patients that nurses undertriaged — those assigned low acuity who ended up in the ICU or died — were correctly flagged by the model for re-evaluation.

Less Alarm Fatigue

The model reduces high-acuity designations by 67% while maintaining higher sensitivity than nurse ESI, directly addressing alert fatigue in busy EDs.

Beats Both Metrics

Unlike typical sensitivity-specificity tradeoffs, the model achieves higher sensitivity AND higher specificity than nurse ESI simultaneously.

Study 1: Critical Outcome Prediction — Under Peer Review

Cross-dataset validation of ML models for predicting ICU admission and ED death across NHAMCS and MIMIC-IV-ED. Upon acceptance, the full text will be linked here.

Study 2: Outcome-Based ESI — In Preparation

ML severity levels trained on patient outcomes outperform nurse-assigned ESI for identifying critical ED patients. Manuscript in preparation for submission.

Ready to See It in Action?

We partner with health systems for prospective validation pilots. Let's discuss how Sentrelia can work in your ED.

Request a Pilot Partnership Review the Evidence

Machine Learning at the Triage Desk