Methods
- Dataset: 3,456 cancer patient records (2017–2021) from a single radiation oncology institution; 762 unique diagnoses (326 ICD-10 codes, 436 free-text notes).
- Categories: 14 oncology categories (e.g., Breast, Lung/Thoracic, CNS, Metastasis, Benign).
- Models:
- GPT-3.5 (OpenAI)
- GPT-4o (OpenAI omni)
- Llama 3.2 (Meta, open-source)
- Gemini 1.5 (Google)
- BioBERT (fine-tuned on dataset, 3 epochs)
- Evaluation: Zero-shot prompting for LLMs, fine-tuning for BioBERT. Metrics: accuracy and weighted F1-score with 95% bootstrapped CIs. Error analysis identified misclassification patterns.
Key Results
- ICD-coded diagnoses:
- BioBERT highest (accuracy 90.8%, F1 84.2%).
- GPT-4o close second (accuracy 90.8%, F1 81.2%).
- GPT-3.5 lower F1 (61.6%) despite good accuracy (88.9%).
- Gemini and Llama trailed (~77–81% accuracy).
- Free-text diagnoses:
- GPT-4o best performer (accuracy 81.9%, F1 71.8%).
- BioBERT similar accuracy (81.6%) but lower F1 (61.5%).
- GPT-3.5 dropped to 75.9% accuracy, F1 52.5%.
- Gemini/Llama struggled (accuracy ~64%, F1 44–52%).
- Error patterns:
- Benign misclassified as malignant due to ICD wording gaps.
- Metastasis often mislabeled as primary cancers.
- “Unknown” vague cases over-predicted as specific cancers.
- Operational aspects: Cloud APIs (GPT, Gemini) vs local (Llama, BioBERT). Integration into Python pipelines feasible but resource/cost considerations noted.
Comparative Context
Other recent studies support findings:
- Bürgisser 2024: Llama 2 achieved 95% accuracy detecting gout from French notes, outperforming regex.
- Du 2025: Systematic review (196 studies) highlights widespread LLM use in CDS, but need for standardized evaluation.
- Guevara 2024: Fine-tuned models outperformed GPT in extracting social determinants of health, capturing context missed by ICD codes.
Authors’ Conclusions
- LLMs and BioBERT show strong potential for structuring EHR data.
- BioBERT excels at standardized text; GPT-4o competitive on free-text.
- Current performance (~81–91% accuracy) is promising but not safe for autonomous clinical decision support.
- Recommended hybrid approach: automate routine categorization while flagging ambiguous/low-confidence cases for human review.
- Limitations: single institution, oncology-only dataset, custom categories, limited generalizability.
- Urges validation across multi-center datasets and implementation of safeguards (confidence thresholds, human-in-loop).
Background and Objective
Electronic health records (EHRs) contain large amounts of unstructured text (e.g., clinician notes, free-text diagnoses) that are labor-intensive to preprocess for use in predictive models. This study evaluated the performance of four large language models (GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5) and a biomedical transformer (BioBERT) in classifying cancer diagnoses into 14 oncology-relevant categories. Models were tested on structured ICD-10 code descriptions and free-text diagnosis entries to assess feasibility for automating data structuring for clinical decision support.
Implications for Clinical Informatics
- Unlocking Unstructured Data: LLMs can transform narrative notes into structured inputs for CDS, registries, and risk models.
- Caution in CDS: Misclassifications (e.g., metastasis vs benign) could lead to harmful CDS alerts.
- Vendor-Agnostic Integration: Feasible via Python pipelines and FHIR APIs, avoiding proprietary lock-in.
- Monitoring: Continuous oversight, bias audits, retraining needed to maintain accuracy and fairness.
- Ethical Considerations: Liability, transparency, and bias risks highlight the need for human oversight.
Citation
Hashtarkhani S, et al. Cancer diagnosis categorization in EHRs using large language models and BioBERT: model performance evaluation study. JMIR Cancer. 2025;11:e72005. doi:10.2196/72005