QUICK REVIEW

[Paper Review] GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

Yuan Liu, Zehong Shen|arXiv (Cornell University)|Nov 13, 2019

Advanced Image and Video Retrieval Techniques35 references45 citations

TL;DR

GIFT introduces a transformation-invariant dense descriptor using group convolutions over features extracted from transformed images, achieving discriminative and provably invariant descriptors for dense matching and improving relative pose estimation.

ABSTRACT

Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.

Motivation & Objective

Motivate the need for local descriptors robust to geometric transformations across viewpoints.
Propose a descriptor that remains discriminative while being invariant to a transformation group.
Develop a pipeline that builds group features from transformed images and embeds them with group CNNs.
Show provable invariance via group convolutions and bilinear pooling.
Demonstrate state-of-the-art performance on standard and extreme-variation datasets.

Proposed method

Warp the input image with a grid of transformations from the group G (rotations and scaling).
Extract features with a vanilla CNN on each transformed image to form group features f0(g) over G at each point.
Process f0 with two group CNNs (alpha and beta) to obtain f_l,alpha and f_l,beta while preserving equivariance (group convolution layers).
Apply bilinear pooling on the two group-CNN outputs to form the final GIFT descriptor d; normalize it to unit length.
Train with a triplet loss using hard negative mining to encourage correct matches.
Use sampled group elements to make computation tractable and employ discrete group pooling to achieve invariance.

Experimental results

Research questions

RQ1How can local descriptors be made invariant to a transformation group without sacrificing discriminability?
RQ2Can group convolutions on features defined over transformation groups preserve equivariance and enable invariant dense descriptors?
RQ3Does GIFT improve dense and sparse matching, as well as relative pose estimation, under large viewpoint and appearance changes?

Key findings

GIFT yields discriminative, provably invariant descriptors for the considered transformation group, outperforming traditional and learned descriptors on benchmark datasets.
Biliner pooling of two group-CNN outputs provides robust invariance and richer statistics than other pooling schemes.
Increasing the number of group convolution layers improves performance in ablations; GIFT-6 used in experiments shows strong results.
GIFT demonstrates robustness to extreme scale and orientation changes and improves relative pose estimation when fine-tuned on real data (GIFT-F).
The implementation runs around 65.2 ms on a GTX 1080 Ti for 1024 interest points on a 480x360 image, indicating practical speed.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.