Quick Take

  • Restricting retrieval to authoritative neurology domains (whitelisting) increased Perplexity web–retrieval-augmented generation (web-RAG) model correctness by 8–18 percentage points (Sonar 60% → 78%; Sonar-Pro 80% → 88%; Sonar-Reasoning-Pro 81% → 89%), reduced output variability by about half, and outperformed or matched a literature-only engine (OpenEvidence 82%).
  • For pharmacy teams, applying domain-level whitelisting in web-RAG pipelines is a low-effort operational control that reduces nonprofessional sources and produces more reliable, guideline-aligned drug information, consult responses, and formulary recommendations.

Why it Matters

  • Open-web retrieval for large language models (LLMs) exposes answers to nonprofessional sources; this study links such sources to hallucinations and factual errors that can undermine evidence-based care and increase clinical risk when web-augmented assistants are used for specialty questions.
  • Marked output variability and reduced performance on case vignettes make answers unreliable across repeated queries, complicating pharmacists' need for consistent, guideline-concordant recommendations at the point of care.
  • Domain whitelisting of authoritative guideline sites is a practical, low-friction safety lever to keep continuously updated web-RAG tools aligned with trusted evidence and to support responsible clinical decision support and organizational stewardship of AI-generated information.

What They Did

  • Benchmarked answering quality using a validated 130-item American Academy of Neurology (AAN) question set (65 factual items, 65 case vignettes).
  • Queried three Perplexity web-RAG tiers (Sonar, Sonar-Pro, Sonar-Reasoning-Pro) via the Perplexity API with default open-web retrieval and again with retrieval restricted to aan.com and neurology.org (whitelisting); each prompt was run four times per question to capture variability.
  • Compared Perplexity configurations to OpenEvidence (a literature-only RAG service); two blinded neurologists rated every response on a 0–2 scale (0=wrong, 1=inaccurate, 2=correct) with a third neurologist adjudicating ties, and an independent reviewer classified cited sources as professional versus nonprofessional.
  • All runs were executed in March 2025, the study used Perplexity’s native whitelisting feature, and all outputs, ratings, and code were published on GitHub to support reproducibility.

What They Found

  • Whitelisting increased correct-answer rates from 60% to 78% for Sonar, 80% to 88% for Sonar-Pro, and 81% to 89% for Sonar-Reasoning-Pro; OpenEvidence scored 82.5% correct.
  • Mean paired improvements on the 0–2 rating scale were +0.23 (95% CI 0.12–0.34) for Sonar, +0.08 (95% CI 0.01–0.16) for Sonar-Pro, and +0.08 (95% CI 0.02–0.13) for Sonar-Reasoning-Pro; whitelisting also roughly halved output variability (eg, worst-case variability 32.5% → better models + whitelisting ~13.3%).
  • Source quality strongly predicted accuracy: inclusion of ≥1 nonprofessional source halved the odds of a higher rating for Sonar (odds ratio [OR] 0.50), whereas citing an AAN or neurology document more than doubled the odds (OR 2.18); Sonar-Pro additionally benefited markedly from guideline citations (OR 6.85).
  • Operationally relevant outcome: simple domain whitelisting yielded 8–18 percentage-point gains and more consistent outputs, suggesting fewer downstream verifications and more reliable guideline-aligned drug or formulary consults.

Takeaways

  • Adopt a 'whitelist-first' RAG pattern for specialty consults: configure the Perplexity API (or equivalent web-RAG tool) to restrict retrieval to authoritative domains (for neurology, aan.com and neurology.org) and log cited URLs for audit and traceability.
  • Operationalize review rules: accept outputs that cite AAN or neurology sources for routine factual look-ups, require pharmacist sign-off for case-style or treatment-decision queries, and periodically benchmark accuracy and variability against a literature-only engine (eg, OpenEvidence).
  • Implement focused change management: provide brief training on reading citation trails, identifying nonprofessional sources, and escalation procedures; start pilots in neurology, then expand specialty-by-specialty with a feedback loop for governance to refine whitelists and usage policies.
  • Practical insight: whitelisting serves as a pragmatic mechanism to keep web-RAG assistants aligned with guideline repositories—enabling continuously updated retrieval while limiting exposure to unreliable web content.

Strengths and Limitations

Strengths:

  • Blinded, adjudicated neurologist ratings with high interrater agreement (κ=0.86) and four runs per question allowed quantification of both accuracy and stochastic variability.
  • Open release of outputs, ratings, and code, plus prespecified statistical analyses (Friedman/Wilcoxon, ordinal-logit), improves reproducibility and clarifies how source quality affects answer quality.

Limitations:

  • Single-domain evaluation (neurology) combined with whitelisting to AAN/neurology sites risks domain–source coupling; effect sizes may differ across specialties or guideline ecosystems.
  • Dependence on proprietary, evolving web-RAG models (and limited technical detail for OpenEvidence) constrains transparency and reproducibility, and the study did not assess real-world clinical impact or prospective clinical validation.

Bottom Line

Restricting web retrieval to authoritative guideline domains is a low-friction, high-yield operational control: it meaningfully improves accuracy and consistency of web-RAG LLM outputs and is suitable for controlled clinical pilot deployment to support pharmacist consults and formulary decisions.