[論文レビュー] LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation
LLM-CXR finetunes a text-only pretrained LLM with image-token objectives using VQ-GAN tokens and instruction-following training to enable bidirectional CXR understanding and generation without changing model architecture.
Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.
研究の動機と目的
- Bridge vision-language understanding in a pretrained LLM for medical CXRs without architectural changes.
- Enable bidirectional CXR-to-text and text-to-CXR generation along with CXR-based VQA.
- Preserve clinical information during image tokenization and improve alignment between visual and textual features.
提案手法
- Tokenize CXRs with a clinical-information-preserving VQ-GAN to produce image tokens.
- Expand the LLM’s embedding to include image tokens by enlarging its vocab and training the full embedding table.
- Use instruction-tuning with a mix of NL-IF, report-to-CXR, CXR-to-report, and CXR-VQA tasks under an instruction-driven format.
- Train with a conditional autoregressive loss that generates Instruction, Input, and Response as a single target paragraph.
- Employ a two-stage fine-tuning: first on broad image-text relationships with the MIMIC-CXR-JPG dataset, then on higher-quality, pruned data for vision-language alignment.
- Leverage synthetic VQA data generated via ChatGPT to augment training.
実験結果
リサーチクエスチョン
- RQ1Can a pretrained text-only LLM acquire robust vision-language capabilities for chest X-rays through instruction-finetuning and image-token interactions?
- RQ2How does bidirectional image-text generation (CXR-to-report and report-to-CXR) compare to specialized multimodal models on CXR tasks?
- RQ3Does clinical-information-preserving VQ-GAN tokenization improve CXR task performance compared with standard image-tokenization?
- RQ4What is the impact of a two-stage fine-tuning regime on learning image-text relations and downstream tasks like VQA?
- RQ5Is an LLM-based approach with expanded token space able to achieve state-of-the-art alignment across CXR understanding and generation tasks?
主な発見
| Model | AUROC Micro | AUROC Macro | AUROC Weighted | AUROC NoF. | AUROC Pmtx. | AUROC Edem. | AUROC PEff. | AUROC Csdn./Pna. | AUROC LLsn. | F1 Micro | F1 Macro | F1 Weighted | F1 NoF. | F1 Pmtx. | F1 Edem. | F1 PEff. | F1 Csdn./Pna. | F1 LLsn. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UniXGen-512 | 0.661 | 0.588 | 0.634 | 0.676 | 0.519 | 0.615 | 0.682 | 0.533 | 0.501 | 0.434 | 0.280 | 0.415 | 0.532 | 0.064 | 0.374 | 0.530 | 0.167 | 0.014 |
| UniXGen-256 | 0.577 | 0.533 | 0.541 | 0.564 | 0.530 | 0.542 | 0.533 | 0.516 | 0.513 | 0.281 | 0.187 | 0.256 | 0.411 | 0.083 | 0.226 | 0.215 | 0.132 | 0.055 |
| XrayGPT | 0.595 | 0.552 | 0.576 | 0.592 | 0.511 | 0.590 | 0.595 | 0.515 | 0.511 | 0.314 | 0.227 | 0.320 | 0.371 | 0.049 | 0.333 | 0.404 | 0.143 | 0.058 |
| LLM-CXR | 0.654 | 0.586 | 0.628 | 0.698 | 0.532 | 0.612 | 0.635 | 0.540 | 0.501 | 0.414 | 0.283 | 0.408 | 0.562 | 0.083 | 0.370 | 0.455 | 0.198 | 0.030 |
- LLM-CXR achieves state-of-the-art performance across CXR-to-report, CXR-VQA, and text-to-CXR generation among models evaluated, using a single model.
- For CXR-to-report generation, LLM-CXR attains AUROC 0.654 and F1 0.414 (Micro), outperforming XrayGPT and approaching UniXGen-512 at comparable resolutions.
- In CXR-VQA, LLM-CXR scores 56.7% overall accuracy with strong per-diagnosis results (e.g., 60.9% for Consolidation/Pna., 71.3% for No Findings).
- For report-to-CXR generation, LLM-CXR achieves FID of 20.22 (txv-all-1024) and AUROC/F1 aligned with input text signals (0.808–0.849 AUROC/Macro F1 across categories).
- In text-to-CXR generation, LLM-CXR yields the best AUROC/F1 across several lesion-related categories (e.g., Edema, Pleural Effusion, Pneumonia) and FID markedly lower than baselines (e.g., 20.22 FID).
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。