QUICK REVIEW

[논문 리뷰] Rotate to Attend: Convolutional Triplet Attention Module

Diganta Misra, Trikay Nalamada|arXiv (Cornell University)|2020. 10. 06.

Advanced Neural Network Applications참고 문헌 35인용 수 51

한 줄 요약

트리플렛 어텐션은 차원을 축소하지 않고 교차 차원 상호작용(C-H, C-W, H-W)을 포착하는 경량의 세 가지 분기 모듈로, CNN에 플러그앤플레이가 가능하며 최소한의 오버헤드로 ImageNet 및 COCO 성능을 향상시킵니다.

ABSTRACT

Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive in-sight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights. Code for this paper can be publicly accessed at https://github.com/LandskapeAI/triplet-attention

연구 동기 및 목표

Investigate cheap yet effective attention mechanisms that model inter-dimension dependencies in CNN features.
Propose a cross-dimension attention approach that preserves all information (no dimensionality reduction).
Evaluate the method as a plug-in module on standard backbones across classification and detection tasks.

제안 방법

Introduce triplet attention with three parallel branches capturing (C, H), (C, W), and (H, W) interactions.
Use tensor rotations and a Z-pool (concatenation of max and average pooling) followed by a k x k convolution to generate attention maps.
Aggregate branch outputs by simple averaging to produce refined feature maps without dimensionality reduction.
Compared to CBAM and SE, emphasize cross-dimension interaction with negligible parameter and FLOP overhead.
Provide analytical and empirical complexity comparisons showing very low overhead (e.g., 6k^2 parameter term for triplet attention).

실험 결과

연구 질문

RQ1Can cross-dimension interactions improve attention quality without bottleneck dimensionality reduction?
RQ2What is the computational and parameter cost of triplet attention compared to CBAM, SE, and other attention modules?
RQ3Do the gains from triplet attention translate to ImageNet classification and MS COCO/PASCAL VOC object detection tasks?
RQ4How does triplet attention affect Grad-CAM visual explanations compared to baselines?

주요 결과

Triplet attention yields a 2.28% Top-1 accuracy boost on ResNet-50 with only a 0.02% parameter increase and ~1% FLOP increase.
On ImageNet across backbones, triplet attention matches or outperforms similar modules while using fewer parameters (e.g., 0.0048M overhead for the attention layer).
In object detection, ResNet-50 + Triplet Attention improves Faster R-CNN, RetinaNet, and Mask R-CNN results compared to baselines and CBAM, with notable AP gains on COCO validation.
On PASCAL VOC, using triplet attention with Faster R-CNN yields higher AP than CBAM and baseline ResNet-50.
Grad-CAM/Grad-CAM++ visualizations indicate triplet attention produces tighter, more discriminative localization patterns than baselines.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.