A Unit Test for LLM Medication Orders—and Weak Spots

Generative Model Development

Authors:Kaitlin Blotske; Xingmeng Zhao; Moriah Cargile, et al.

Journal:Arxiv

DOI:10.64898/2026.01.13.26343949

MedMatch turns messy medication-order language into six standardized JSON templates plus a 100-order clinician benchmark, giving pharmacy teams a practical “unit test” for LLM ordering pilots and regression checks. On strict exact-match scoring, LLMs reached ~64–84% on oral solids/IV intermittent but fell to 23–43% for oral liquids and 0–18% for titratable infusions—signaling governance value, not autonomous ordering readiness.

January 15, 2026

Quick Take

MedMatch defines six standardized JSON medication-order templates and a 100‑item clinician‑annotated benchmark. When four LLMs were evaluated (one‑shot prompts, each run in triplicate) using strict exact‑match scoring of all JSON fields, category‑level exact‑match accuracy on this dataset ranged roughly: oral solids ~64.2–72.5%; IV intermittent ~72.5–84.3%; IV push ~62.7–74.5%; oral liquids ~23.3–43.3%; titratable continuous infusions ~0–18.2%. When route was removed, route‑inference was high for oral solids (≈98–100%) and IV continuous (100%) but highly variable for IV push (≈18–61%).

MedMatch is positioned as an automated benchmark harness for LLM pilots and regression checks — useful for rapid model/prompt comparison and governance testing, but dependent on explicit equivalence rules, local validation, and human oversight (especially for high‑risk IV workflows); it is not evidence that LLMs are ready to generate or autonomously place clinical medication orders.

Why it Matters

Medication orders are written in many colloquial, abbreviated, and EHR‑specific ways; generic text‑overlap metrics can penalize clinically equivalent phrasing (e.g., '2 g' vs '2000 mg'), so machine‑checkable, medication‑specific standards are needed to evaluate LLM correctness.

Small ambiguities in route, frequency/rate, diluent, infusion time, and titration instructions can cause clarification delays or harm; benchmarks therefore need to capture medication‑order semantics, not just lexical similarity.

Standardized, medication‑specific evaluation supports stewardship, clinical decision‑support governance, and scalable monitoring in resource‑constrained clinical settings — but must be paired with local validation and adjudication rules.

What They Did

Surveyed four board‑certified critical care pharmacists on 40 medication prompts presented in three styles (formal written order, verbal handoff, brief chat) to characterize omissions and lexical agreement; medication components (drug/dose/unit/route/frequency) were extracted using an LLM‑assisted parser (GPT‑4o‑mini) and agreement quantified with word‑level Jaccard similarity.

A five‑pharmacist Delphi‑style panel used survey results to define six MedMatch JSON templates (oral solid, oral liquid; IV intermittent, IV push, IV continuous titratable, IV continuous non‑titratable). Two clinicians curated a 100‑med clinician‑annotated ground‑truth set and a second version with route removed for route‑categorization testing.

Tested four LLMs (GPT‑4o‑mini, Gemma‑3‑27B‑IT, Qwen3‑32B, LLaMA‑3.3‑70B‑Instruct) with one‑shot prompts run in triplicate; primary scoring required exact matches across all JSON fields, supplemented by overlap/Jaccard checks. The study reported IRB approval and TRIPOD‑LLM–aligned reporting elements.

What They Found

Formal written orders showed very low omission rates (~1–2% across core entities) and a stable element order (drug, dose, unit, route, frequency); verbal and brief styles frequently omitted administration details (route omitted 94% in verbal, 78% in brief).

Word‑level Jaccard similarity was high for drug (0.880) and dose (0.897); overall similarity was highest for formal communication (0.847) versus verbal (0.771) and brief written (0.839).

Clinicians judged formal‑written model outputs acceptable 78.3% overall (92.5% for oral prompts, 64.1% for intravenous prompts); verbal‑ and brief‑style outputs were acceptable only 6.9% and 10.3%, respectively.

Using strict exact‑match scoring, LLM exact‑match accuracy by category was approximately: oral solids 64.2–72.5%; IV intermittent 72.5–84.3%; IV push 62.7–74.5%; oral liquids 23.3–43.3%; titratable continuous infusions 0–18.2%.

When route was removed from prompts, route‑classification accuracy was high for oral solids (≈98–100%) and IV continuous (100%), variable for oral liquids and IV intermittent, and inconsistent for IV push (≈18–61%).

Models showed reproducible patterns of strengths (drug/dose extraction, some route classes) and failure modes (formulation volumes/concentrations, titratable infusion logic, IV‑push classification) rather than a single clearly superior model.

Inputs in verbal or brief styles produced outputs clinicians rarely accepted as computer‑generated clinical outputs, highlighting sensitivity to input style and the importance of input normalization for benchmarking.

Takeaways

Generic language metrics are insufficient for medication tasks; tools like MedMatch show the value of a consistent, machine‑checkable structure to benchmark LLMs at scale.

Use MedMatch as a unit test for med‑order text — helpful for model/prompt comparison and regression testing, but not a substitute for local validation or pharmacist oversight before any clinical use.

Expect uneven performance by workflow; prioritize scrutiny and manual review for IV push vs intermittent distinctions, oral liquids (volumes/concentrations), and titratable infusion details where errors clustered.

When evaluating or governing LLMs, require documented equivalence rules (unit/concentration conversions), explicit omission‑handling policies, and a local validation/adjudication plan that preserves pharmacist review for high‑risk cases.

Strengths and Limitations

Strengths:

Clinician‑grounded design and transparent reporting: MedMatch was developed through a Delphi‑style clinician panel, a clinician‑curated 100‑med benchmark, IRB approval, and alignment with TRIPOD‑LLM reporting elements.; Technical evaluation across multiple models and repeat runs: four LLMs tested with one‑shot prompts in triplicate, using strict exact‑match scoring plus Jaccard overlap to characterize consistency and failure modes.

Limitations:

Small, specialty‑skewed clinician sample (critical care pharmacists) and a modest benchmark (100 meds with limited per‑category counts) limit generalizability to broader inpatient practice and local formulary conventions.; Methodological dependencies may bias measurement: an LLM (GPT‑4o‑mini) was used for extraction/majority‑answer preprocessing, and strict exact‑match scoring can under‑ or over‑penalize clinically equivalent expressions (e.g., unit conversions, concentration equivalences) unless explicit equivalence rules are defined.

Bottom Line

MedMatch is an early, pharmacist‑grounded unit‑test harness for LLM medication outputs — useful for model/prompt regression testing and governance, but it requires explicit equivalence rules, local validation, and continued pharmacist oversight; it is not evidence that LLMs are ready for unsupervised clinical medication ordering.