Quick Take
- In a real-world stress test of 300 clinical queries, "raw" generative AI models (ChatGPT-4o and Gemini 1.0) achieved only 19% accuracy when strictly graded on both correct answers and reliable references.
- The failure modes pose distinct operational risks: ChatGPT-4o acted as a "confident fabricator" (15% inaccurate with fake citations), while Gemini 1.0 acted as a "vague avoider" (46% non-answers), forcing pharmacists to restart research.
- While this study confirms that ungrounded chatbots are unsafe for standalone use, it benchmarks "pre-reasoning" models from mid-2024, highlighting why pharmacy leaders must pivot to retrieval-augmented (RAG) and agentic workflows in late 2025.
Why it Matters
- Pharmacy departments face a "volume crisis" with drug information requests rising nearly 70% in recent years; however, this study proves that adopting off-the-shelf chatbots to meet this demand invites "confident misinformation" into the clinical workflow.
- The safety stakes are immediate: the study documented "hallucination traps" where AI incorrectly cleared a sodium phosphate-containing autoinjector for a sensitive patient, proving that "human-in-the-loop" verification remains non-negotiable.
- For leaders in late 2025, these findings signal the end of the "Chatbot Era" and the necessity of the "Agentic Era"—moving from models that rely on fallible probabilistic memory to integrated systems that "look up" facts in verified compendia before answering.
What They Did
- Researchers sampled 300 authentic drug information questions submitted to the University of Michigan Health (UMH) Medication Use Policy team, excluding patient-specific items to focus on generalizable clinical knowledge.
- They tested two representative models (ChatGPT-4o and Gemini 1.0) using standardized prompts that identified the user as a pharmacist requesting references, a design that tests "closed-book" memorization rather than modern tool-use.
- A senior pharmacy intern, under the supervision of clinical specialists, adjudicated responses using a strict "zero-tolerance" framework where any inaccuracy or fabricated reference resulted in a failure.
What They Found
- Both models hit a hard "19% Ceiling" for completely accurate answers with valid references, demonstrating that ungrounded LLMs cannot reliably support complex clinical decision-making without external validation tools.
- ChatGPT-4o demonstrated a high risk of Type I errors (false positives), providing completely inaccurate answers with fabricated or irrelevant references in 15% of cases—creating a dangerous verification burden.
- Gemini 1.0 demonstrated a high risk of Type II errors (failure to answer), providing partially accurate, vague, or "consult your doctor" style responses in 46% of cases, limiting its utility as a workflow accelerator.
- Performance degraded sharply with complexity; while models could retrieve simple package insert data, they failed significantly when tasked with synthesizing literature or identifying manufacturer-specific details like latex content or excipients.
Takeaways
- Treat ungrounded chatbots as "drafting tools" only; the 81% failure rate for fully reliable answers proves these tools should never function as autonomous providers of drug information.
- Recognize the hidden cost of "Verification Tax"; because models like ChatGPT can hallucinate plausible-looking but non-existent citations, pharmacists must fact-check every claim, often taking longer than performing the primary research themselves.
- Update your AI strategy to demand "Grounding"; the failures in this study stem from the models' reliance on internal training data, confirming that enterprise deployment requires Retrieval-Augmented Generation (RAG) architecture that anchors answers to trusted medical databases.
- Implement strict governance against "Shadow AI"; the study reveals that consumer-grade tools lack the "world model" to distinguish between latex in a vial stopper versus the drug solution, a nuance that requires specialized oversight.
Strengths and Limitations
Strengths:
- The study leverages a "ground truth" dataset of complex, real-world queries rather than simplified board-exam questions, ensuring the results reflect actual clinical ambiguity.
- Adjudication criteria were rigorously high, requiring both the clinical conclusion and the supporting references to be perfect, aligning with the safety culture of inpatient pharmacy.
Limitations:
- The evaluation benchmarks models (ChatGPT-4o, Gemini 1.0) that are effectively "last generation" in late 2025, pre-dating the reasoning and "deep think" capabilities of newer enterprise models.
- The "closed-book" testing methodology evaluated the models' ability to memorize facts rather than their ability to use tools (browsing, database access) to verify them, which is the standard for modern enterprise AI agents.
Bottom Line
This study definitively proves that "raw" generative AI chatbots (circa 2024) are unsuitable for independent drug information duties, exhibiting dangerous hallucinations and low reliability (19%). Pharmacy leaders should view these findings as a baseline that validates the need for strict human oversight and a strategic shift toward the validated, retrieval-augmented (RAG) systems becoming available in late 2025.