Methods

  • Dataset: 3,456 cancer patient records (2017–2021) from a single radiation oncology institution; 762 unique diagnoses (326 ICD-10 codes, 436 free-text notes).
  • Categories: 14 oncology categories (e.g., Breast, Lung/Thoracic, CNS, Metastasis, Benign).
  • Models:
  • GPT-3.5 (OpenAI)
  • GPT-4o (OpenAI omni)
  • Llama 3.2 (Meta, open-source)
  • Gemini 1.5 (Google)
  • BioBERT (fine-tuned on dataset, 3 epochs)
  • Evaluation: Zero-shot prompting for LLMs, fine-tuning for BioBERT. Metrics: accuracy and weighted F1-score with 95% bootstrapped CIs. Error analysis identified misclassification patterns.

Key Results

  • ICD-coded diagnoses:
  • BioBERT highest (accuracy 90.8%, F1 84.2%).
  • GPT-4o close second (accuracy 90.8%, F1 81.2%).
  • GPT-3.5 lower F1 (61.6%) despite good accuracy (88.9%).
  • Gemini and Llama trailed (~77–81% accuracy).
  • Free-text diagnoses:
  • GPT-4o best performer (accuracy 81.9%, F1 71.8%).
  • BioBERT similar accuracy (81.6%) but lower F1 (61.5%).
  • GPT-3.5 dropped to 75.9% accuracy, F1 52.5%.
  • Gemini/Llama struggled (accuracy ~64%, F1 44–52%).
  • Error patterns:
  • Benign misclassified as malignant due to ICD wording gaps.
  • Metastasis often mislabeled as primary cancers.
  • “Unknown” vague cases over-predicted as specific cancers.
  • Operational aspects: Cloud APIs (GPT, Gemini) vs local (Llama, BioBERT). Integration into Python pipelines feasible but resource/cost considerations noted.

Comparative Context

Other recent studies support findings:

  • Bürgisser 2024: Llama 2 achieved 95% accuracy detecting gout from French notes, outperforming regex.
  • Du 2025: Systematic review (196 studies) highlights widespread LLM use in CDS, but need for standardized evaluation.
  • Guevara 2024: Fine-tuned models outperformed GPT in extracting social determinants of health, capturing context missed by ICD codes.

Authors’ Conclusions

  • LLMs and BioBERT show strong potential for structuring EHR data.
  • BioBERT excels at standardized text; GPT-4o competitive on free-text.
  • Current performance (~81–91% accuracy) is promising but not safe for autonomous clinical decision support.
  • Recommended hybrid approach: automate routine categorization while flagging ambiguous/low-confidence cases for human review.
  • Limitations: single institution, oncology-only dataset, custom categories, limited generalizability.
  • Urges validation across multi-center datasets and implementation of safeguards (confidence thresholds, human-in-loop).

Background and Objective

Electronic health records (EHRs) contain large amounts of unstructured text (e.g., clinician notes, free-text diagnoses) that are labor-intensive to preprocess for use in predictive models. This study evaluated the performance of four large language models (GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5) and a biomedical transformer (BioBERT) in classifying cancer diagnoses into 14 oncology-relevant categories. Models were tested on structured ICD-10 code descriptions and free-text diagnosis entries to assess feasibility for automating data structuring for clinical decision support.


Implications for Clinical Informatics

  • Unlocking Unstructured Data: LLMs can transform narrative notes into structured inputs for CDS, registries, and risk models.
  • Caution in CDS: Misclassifications (e.g., metastasis vs benign) could lead to harmful CDS alerts.
  • Vendor-Agnostic Integration: Feasible via Python pipelines and FHIR APIs, avoiding proprietary lock-in.
  • Monitoring: Continuous oversight, bias audits, retraining needed to maintain accuracy and fairness.
  • Ethical Considerations: Liability, transparency, and bias risks highlight the need for human oversight.

Citation

Hashtarkhani S, et al. Cancer diagnosis categorization in EHRs using large language models and BioBERT: model performance evaluation study. JMIR Cancer. 2025;11:e72005. doi:10.2196/72005