[論文レビュー] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification
FixationFormer models expert gaze trajectories as sequential tokens within a Transformer to directly fuse gaze with chest X-ray features, achieving state-of-the-art results on several benchmarks.
Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.
研究の動機と目的
- Leverage expert gaze as a sequential, temporally-aware representation for Chest X-Ray classification.
- Develop a gaze tokenisation scheme that preserves temporal and spatial information from fixation trajectories.
- Integrate gaze tokens with image features via attention mechanisms to improve diagnostic accuracy.
提案手法
- Represent gaze trajectories as a sequence of fixation tokens by encoding duration and spatial coordinates with learnable projections and using start-time for temporal position encoding.
- Use a standard Vision Transformer as image backbone pretrained with MGCA on MIMIC-CXR to obtain robust image features.
- Fuse image and gaze tokens with a Gaze Integration module employing cross-attention (Image-to-Gaze) and an optional Two-Way Attention variant for bidirectional refinement.
- Apply Nested Tensor batching to handle variable-length gaze sequences efficiently without padding masks.
- Fine-tune task-specific layers with LoRA on top of a frozen MGCA-pretrained image backbone to adapt to datasets.
- Evaluate on three chest X-ray datasets with gaze data: CXR-Gaze, SIIM-ACR, and Reflacx.
![Figure 1: FixationFormer overview : Image and gaze are encoded into separate token sequences. To infuse the image features with gaze information, we use cross-attention in one or optionally both directions throughout a stack of Transformer layers. Finally, the [ CLS ] token from the image encoder is](https://ar5iv.labs.arxiv.org/html/2603.22939/assets/x3.png)
実験結果
リサーチクエスチョン
- RQ1Can gaze trajectories be effectively represented as a sequence of tokens for Transformer-based medical imaging models?
- RQ2Does direct cross-attention between gaze tokens and image tokens improve chest X-ray classification beyond heatmap or CNN-based gaze integrations?
- RQ3How do different gaze integration mechanisms (single-direction cross-attention vs. two-way attention) compare in performance and stability across datasets?
- RQ4What is the impact of gaze information when using strong MGCA-pretrained backbones versus weaker ImageNet backbones?
- RQ5Are the attention maps learned by FixationFormer aligned with expert gaze patterns and anatomically relevant regions?
主な発見
| Architecture | CXR-Gaze Accuracy | CXR-Gaze F1 | CXR-Gaze AUC | SIIM-ACR Accuracy | SIIM-ACR F1 | SIIM-ACR AUC | Reflacx Accuracy | Reflacx F1 | Reflacx AUC |
|---|---|---|---|---|---|---|---|---|---|
| GG-CAM Zhu et al. (2022) | 77.57 | 0.770 | 0.888 | - | - | - | - | - | - |
| GazeMTL Saab et al. (2021) | 78.50 | 0.779 | 0.887 | - | - | - | - | - | - |
| U-Net + Gaze Karargyris et al. (2021) | - | - | - | 81.10 | 0.803 | 0.689 | - | - | - |
| EG-ViT Ma et al. (2022) | - | 0.807 | 0.909 | 85.60 | 0.849 | 0.741 | - | - | - |
| GII-ViT Chen et al. (2026) | - | 0.806 | 0.919 | - | - | - | - | - | - |
| GazeGNN Wang et al. (2024) | 83.18 | 0.823 | 0.923 | - | - | - | - | - | - |
| GazeGNN (repr.) Wang et al. (2024) | 71.02 | 0.700 | 0.879 | 81.84 | 0.700 | 0.851 | 64.51 | 0.453 | 0.757 |
| Cross-Attention | 84.11 | 0.833 | 0.944 | 84.96 | 0.765 | 0.902 | 70.06 | 0.561 | 0.853 |
| Two-Way Attention | 82.80 | 0.819 | 0.952 | 86.40 | 0.797 | 0.915 | 68.06 | 0.510 | 0.842 |
- FixationFormer variants outperform prior state-of-the-art on the CXR-Gaze dataset under both evaluation schemes, with Cross-Attention achieving 87.96% test accuracy (E.S. Test) and 84.11% on validation.
- On SIIM-ACR, the method matches or nears state-of-the-art, with Two-Way Attention reaching 86.40% accuracy and Cross-Attention reaching 84.96% accuracy (validation).
- On Reflacx, Cross-Attention achieves 70.06% accuracy and Two-Way 68.06%, with Cross-Attention showing more stable training.
- Ablation shows gaze-only models underperform image-based models, but incorporating gaze with a MGCA image backbone yields gains (about 1-3% on CXR-Gaze; notable gains on Reflacx).
- Using an ImageNet backbone, Cross-Attention provides notable gains, suggesting gaze integration effectiveness with weaker backbones.
- GradCAM visualizations indicate attention aligns with expert gaze trajectories when using gaze integration.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。