[Paper Review] Clinically Accurate Chest X-Ray Report Generation
The paper presents a domain-aware, hierarchical chest X-ray report generator optimized with a Clinically Coherent Reward to improve both language quality and clinical accuracy on Open-I and MIMIC-CXR.
The automatic generation of radiology reports given medical radiographs has significant potential to operationally and improve clinical patient care. A number of prior works have focused on this problem, employing advanced methods from computer vision and natural language generation to produce readable reports. However, these works often fail to account for the particular nuances of the radiology domain, and, in particular, the critical importance of clinical accuracy in the resulting generated reports. In this work, we present a domain-aware automatic chest X-ray radiology report generation system which first predicts what topics will be discussed in the report, then conditionally generates sentences corresponding to these topics. The resulting system is fine-tuned using reinforcement learning, considering both readability and clinical accuracy, as assessed by the proposed Clinically Coherent Reward. We verify this system on two datasets, Open-I and MIMIC-CXR, and demonstrate that our model offers marked improvements on both language generation metrics and CheXpert assessed accuracy over a variety of competitive baselines.
Motivation & Objective
- Address the gap between fluent radiology reports and clinical accuracy in generated chest X-ray reports.
- Propose a hierarchical CNN-RNN-RNN generator that creates sentences from topic-driven sentence decoders.
- Incorporate a Clinically Coherent Reward based on CheXpert to align disease state mentions with ground truth.
- Fine-tune the model with reinforcement learning to balance readability and clinical fidelity.
- Evaluate on two public datasets (Open-I and MIMIC-CXR) against strong baselines.
Proposed method
- Hierarchical generation: image encoding via CNN, sentence-level topic generation via an LSTM, and word-level decoding with attention.
- Topic-guided sentence generation where each sentence is conditioned on a topic vector derived from the sentence-level LSTM.
- Word decoder with visual sentinel and attention over image features to generate each sentence.
- Reinforcement learning with a combined objective: CIDEr-based NLG reward and a Clinically Coherent Reward (CCR) derived from CheXpert labels.
- Clinically Coherent Reward models disease-state consistency by comparing ground-truth and generated reports via probabilistic mappings p(+|l) and p(-|l) under assumptions suitable for rare diseases.
- Evaluation uses SCST-style policy gradient to optimize the expected rewards; ground-truth alignment drives both fluency and clinical accuracy.
Experimental results
Research questions
- RQ1Can a hierarchical image-to-text model generate radiology reports that are both fluent and clinically accurate?
- RQ2Does incorporating a Clinically Coherent Reward improve CheXpert-driven disease-state alignment without sacrificing readability?
- RQ3How does the proposed method compare to state-of-the-art radiology report generation baselines on large chest X-ray datasets?
- RQ4What is the impact of combining NLG and CCR rewards versus optimizing for one alone?
Key findings
| Model | CIDEr | ROUGE | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | Accuracy |
|---|---|---|---|---|---|---|---|
| MIMIC-CXR Noise-RNN | 0.716 | 0.272 | 0.269 | 0.172 | 0.113 | 0.074 | 0.803 |
| MIMIC-CXR 1-NN | 0.755 | 0.244 | 0.305 | 0.171 | 0.098 | 0.057 | 0.818 |
| MIMIC-CXR S&T | 0.886 | 0.300 | 0.307 | 0.201 | 0.137 | 0.093 | 0.837 |
| MIMIC-CXR SA&T | 0.967 | 0.288 | 0.318 | 0.205 | 0.137 | 0.093 | 0.849 |
| MIMIC-CXR TieNet | 1.004 | 0.296 | 0.332 | 0.212 | 0.142 | 0.095 | 0.848 |
| MIMIC-CXR Ours (NLG) | 1.153 | 0.307 | 0.352 | 0.223 | 0.153 | 0.104 | 0.834 |
| MIMIC-CXR Ours (CCR) | 0.956 | 0.284 | 0.294 | 0.190 | 0.134 | 0.094 | 0.868 |
| MIMIC-CXR Ours (full) | 1.046 | 0.306 | 0.313 | 0.206 | 0.146 | 0.103 | 0.867 |
| Open-I Noise-RNN | 0.747 | 0.291 | 0.233 | 0.130 | 0.087 | 0.061 | 0.914 |
| Open-I 1-NN | 0.728 | 0.201 | 0.232 | 0.116 | 0.051 | 0.018 | 0.911 |
| Open-I S&T | 0.926 | 0.306 | 0.265 | 0.157 | 0.105 | 0.073 | 0.915 |
| Open-I SA&T | 1.276 | 0.313 | 0.328 | 0.195 | 0.123 | 0.080 | 0.908 |
| Open-I TieNet | 1.334 | 0.311 | 0.330 | 0.194 | 0.124 | 0.081 | 0.902 |
| Open-I Ours (NLG) | 1.490 | 0.359 | 0.369 | 0.246 | 0.171 | 0.115 | 0.916 |
| Open-I Ours (CCR) | 0.707 | 0.244 | 0.162 | 0.084 | 0.055 | 0.036 | 0.917 |
| Open-I Ours (full) | 1.424 | 0.354 | 0.359 | 0.237 | 0.164 | 0.113 | 0.918 |
- The full model achieves the highest clinical disease annotation accuracy (CheXpert concordance) while maintaining solid NLG metrics.
- NLG-focused variant improves CIDEr and related linguistic metrics but has limited clinical accuracy gains when used alone.
- CCR-only variant enhances clinical precision/PPV but can reduce recall, highlighting the need for a joint objective.
- Across MIMIC-CXR and Open-I, the proposed method outperforms baselines including 1-NN, Show & Tell, ShowAtten,& Tell, and TieNet on both language and clinical metrics.
- Post-hoc removal of exact duplicate sentences improves readability with minimal impact on NLG metrics.
- Ablations show that combining NLG and CCR rewards yields the best overall performance in both language quality and clinical alignment.
- Open-I benefits from a smaller corpus and lower disease prevalence compared to MIMIC-CXR, affecting model performance and evaluation dynamics.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.