QUICK REVIEW

[Paper Review] Query2Label: A Simple Transformer Way to Multi-Label Classification

Shilong Liu, Lei Zhang|arXiv (Cornell University)|Jul 22, 2021

Advanced Image and Video Retrieval Techniques54 references120 citations

TL;DR

Query2Label introduces a simple two-stage Transformer-based framework that uses learnable label embeddings as queries to perform cross-attention and adaptive feature pooling for multi-label classification, achieving state-of-the-art results on multiple datasets.

ABSTRACT

This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.

Motivation & Objective

Motivate and address the challenges of multi-label classification where multiple objects or concepts may appear in one image.
Propose a simple, backbone-agnostic framework that leverages Transformer decoders to query the existence of each label.
Enable adaptive, region-focused feature extraction for each label via cross-attention in Transformer decoders.
Demonstrate state-of-the-art performance across standard benchmarks (MS-COCO, PASCAL VOC, NUS-WIDE, Visual Genome) using simple components.

Proposed method

Use a two-stage framework where a backbone extracts spatial features from the image.
Introduce learnable label embeddings as queries to a multi-layer Transformer decoder.
Apply cross-attention to pool label-specific features from the spatial feature map for each label.
Project the resulting label-specific features to logits with a linear layer and sigmoid to predict label presence.
Train with a backbone-agnostic setup and optimize using an asymmetric version of focal loss to handle class imbalance.
Optionally include a lightweight Transformer encoder to fuse global context; end-to-end training.
Ground the label embeddings in data to implicitly capture label correlations without explicit graphs.

Experimental results

Research questions

RQ1Can Transformer-based cross-attention with label-specific queries improve localization of discriminative regions for each label in multi-label images?
RQ2Does learning label embeddings end-to-end provide robust, backbone-agnostic multi-label classification with state-of-the-art performance?
RQ3How does the proposed asymmetric loss interact with the Transformer-based framework to handle label imbalance across datasets?
RQ4What are the effects of using different backbone architectures and input resolutions on Q2L performance across benchmarks?

Key findings

Achieves new state-of-the-art results on MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome across multiple metrics.
Demonstrates strong performance especially for medium-sized objects due to spatially adaptive feature pooling.
Shows that a simple, end-to-end trainable label-embedding strategy with cross-attention provides strong baselines with a compact, easy-to-implement architecture.
Transformer decoders with multi-head attention can decouple object representations into multiple parts or views, improving recognition under occlusion and viewpoint changes.
Backbone-agnostic design proves effective with various backbones (CNNs and Vision Transformers) and resolutions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.