QUICK REVIEW

[논문 리뷰] DETRs with Collaborative Hybrid Assignments Training

Zhuofan Zong, Guanglu Song|arXiv (Cornell University)|2022. 11. 22.

Music and Audio Processing인용 수 20

한 줄 요약

본 논문은 DETR 유사 검출기에 다중 보조헤드를 추가하고 일대다 레이블 할당을 도입하는 학습 방식 Co-DETR를 제시하여 인코더 감독 신호와 디코더 주의력을 강화하되 추론 비용은 추가로 들지 않으며, COCO와 LVIS에서 최첨단 성능을 달성한다.

ABSTRACT

In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{https://github.com/Sense-X/Co-DETR}.

연구 동기 및 목표

일대일 DETR 세트 매칭의 한계, 특히 드문 인코더 감독 신호와 약한 디코더 주의에 대한 동기 부여와 분석.
보조 헤드를 통해 다대다 레이블 할당을 활용하여 인코더 감독 신호를 풍부하게 하는 협력형 하이브리드 할당 학습(Co-DETR) 방식 제안.
원래 디코더 아키텍처를 바꾸지 않고 보조 헤드의 좌표에서 커스텀 Positive 쿼리를 생성하여 디코더 학습을 강화.
Co-DETR이 추론 비용 증가 없이 DETR 변형 전체의 수렴 및 정확도를 향상시킨다는 점을 입증합니다.

제안 방법

다양한 다목적 일대다 레이블 할당(예: ATSS, Faster R-CNN, RetinaNet, FCOS)로 감독되는 K개의 보조 헤드를 도입합니다.
인코더 출력으로부터 피처 피라미드를 구성하여 다중 규모에서 보조 헤드를 입력으로 사용합니다.
모든 보조 헤드의 할당된 Positive/Negative를 사용하여 loss를 합산한 enc loss를 계산합니다.
보조 헤드 Positive 좌표로부터 커스텀 Positive 쿼리를 생성하여 원래 디코더 아키텍처를 변경하지 않고 디코더 학습을 풍부하게 합니다.
주된 일대일 DETR 손실과 보조 헤드 손실을 균형 계수로 결합한 글로벌 objective로 학습합니다.

Figure 1 : Performance of models with ResNet-50 on COCO val . $\mathcal{C}$ o-DETR outperforms other counterparts by a large margin.

실험 결과

연구 질문

RQ1보조 헤드를 도입하고 일대다 할당이 DETR 변형에서 인코더의 구분 가능성과 특징 학습에 어떤 영향을 미치는가?
RQ2보조 헤드의 커스텀 Positive 쿼리가 추론 비용 증가 없이 디코더의 교차 주의 학습을 개선할 수 있는가?
RQ3Co-DETR을 다양한 DETR 백본(예: Deformable-DETR, DINO-Deformable-DETR, Swin-L 백본)에 COCO 및 LVIS에서 적용했을 때의 성능 향상은 어느 정도인가?
RQ4보조 헤드 간의 상호 작용과 안정한 학습을 위한 한계(예: 최적 헤드 수)는 무엇인가?

주요 결과

Co-DETR은 DETR 변형 전반에서 상당한 AP 향상을 가져오며, 예를 들어 Deformable-DETR은 12에폭에서 5.8 AP, 36에폭에서 3.2 AP를 개선합니다.
DINO-Deformable-DETR에서 Swin-L를 사용할 때 Co-DETR로 COCO 검증에서 58.5에서 59.5 AP로 향상됩니다.
ViT-L 백본에서 Co-DETR은 COCO test-dev에서 66.0 AP, LVIS val에서 67.9 AP를 달성하여 더 작은 모델로도 기존 방법을 능가합니다.
COCO val에서 ViT-L 기준 65.9 AP(val) / 66.0 AP(test-dev); LVIS에선 56.9 AP(val) / 62.3 AP(minival) 등으로 나타납니다.
Co-DETR은 수렴 속도를 높이고 테스트 시 보조 헤드를 제거하므로 추론 비용 추가가 없습니다.
대형 백본(ViT-L, Objects365에서 사전 학습)의 경우 COCO test-dev에서 66.0 AP, LVIS minival에서 71.9 AP를 기록하는 등 새로운 기록을 세웁니다(강력한 테스트 시간 증강 없이).

Figure 2 : IoF-IoB curves for the feature discriminability score in the encoder and attention discriminability score in the decoder.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.