QUICK REVIEW

[Paper Review] Deep convolutional filter banks for texture recognition and segmentation

Mircea Cimpoi, Subhransu Maji|arXiv (Cornell University)|Nov 25, 2014

Advanced Image and Video Retrieval Techniques32 references50 citations

TL;DR

This paper proposes FV-CNN, a novel texture descriptor that applies Fisher Vector pooling to convolutional neural network (CNN) filter banks to improve texture, material, and scene recognition in cluttered images. By treating CNN features as a learnable filter bank and using orderless, multi-scale pooling, FV-CNN achieves state-of-the-art performance—79.8% on Flickr Material, 81.1% on MIT Indoor Scenes—without requiring fine-tuning or image resizing.

ABSTRACT

Research in texture recognition often concentrates on the problem of material recognition in uncluttered conditions, an assumption rarely met by applications. In this work we conduct a first study of material and describable texture at- tributes recognition in clutter, using a new dataset derived from the OpenSurface texture repository. Motivated by the challenge posed by this problem, we propose a new texture descriptor, D-CNN, obtained by Fisher Vector pooling of a Convolutional Neural Network (CNN) filter bank. D-CNN substantially improves the state-of-the-art in texture, mate- rial and scene recognition. Our approach achieves 82.3% accuracy on Flickr material dataset and 81.1% accuracy on MIT indoor scenes, providing absolute gains of more than 10% over existing approaches. D-CNN easily trans- fers across domains without requiring feature adaptation as for methods that build on the fully-connected layers of CNNs. Furthermore, D-CNN can seamlessly incorporate multi-scale information and describe regions of arbitrary shapes and sizes. Our approach is particularly suited at lo- calizing stuff categories and obtains state-of-the-art re- sults on MSRC segmentation dataset, as well as promising results on recognizing materials and surface attributes in clutter on the OpenSurfaces dataset.

Motivation & Objective

Address the challenge of material and texture attribute recognition in real-world, cluttered natural images, where textures are not isolated or uniformly distributed.
Overcome limitations of existing CNN-based approaches that rely on fully connected layers, which are sensitive to spatial layout, require fixed input size, and may be less transferable.
Develop a flexible, orderless, and multi-scale feature representation that preserves texture-specific invariances while enabling domain transfer without fine-tuning.
Evaluate the proposed method on new benchmarks derived from the OpenSurfaces dataset for material and texture attribute recognition and segmentation in clutter.

Proposed method

Treat the activation maps of early convolutional layers in a pre-trained CNN (e.g., VGG-M) as a learned, non-linear filter bank.
Apply Fisher Vector (FV) pooling to the feature maps of each filter bank to create a compact, orderless, and discriminative global descriptor.
Use the FV representation to encode the distribution of filter responses across spatial locations, enabling multi-scale and shape-agnostic feature aggregation.
Process arbitrary-sized input images directly through convolutional layers, avoiding costly resizing operations required by fully connected layers.
Train a linear SVM on FV-CNN features for classification, enabling fast and effective recognition without domain-specific adaptation.
Extend the method to weakly supervised segmentation by combining FV-CNN region descriptors with general-purpose image segmentation algorithms (e.g., crisp regions or overlapping proposals).

Experimental results

Research questions

RQ1Can a Fisher Vector pooling of CNN filter banks outperform standard CNN features (e.g., from fully connected layers) for texture and material recognition in cluttered scenes?
RQ2Does FV-CNN enable better domain transfer than methods relying on fully connected layers, especially without fine-tuning?
RQ3How does the performance of FV-CNN vary across different CNN layers, and which layer provides the most discriminative texture representation?
RQ4Can FV-CNN achieve state-of-the-art results in weakly supervised segmentation tasks without CRF-based post-processing or dataset-specific training?
RQ5How effective is FV-CNN in recognizing descriptive texture attributes (e.g., wrinkled, marbled) and materials (e.g., brick, fabric) in real-world, uncluttered, and complex scenes?

Key findings

FV-CNN achieves 79.8% accuracy on the Flickr Material dataset, representing an absolute improvement of over 10% over prior state-of-the-art methods.
On the MIT Indoor Scenes dataset, FV-CNN achieves 81.1% accuracy, significantly outperforming the previous state-of-the-art of 70.8%.
FV-CNN outperforms SIFT-based Fisher Vector representations on all evaluated datasets, with performance improving monotonically from earlier to deeper convolutional layers.
Filter banks from VGG-M’s conv3 and deeper layers produce significantly better descriptors than SIFT, demonstrating the superiority of deep features for texture representation.
FV-CNN enables effective weakly supervised segmentation: using crisp regions, it achieves 55.4% accuracy on the OpenSurfaces material recognition benchmark and 87.0% on MSRC, matching or exceeding prior results without CRF or domain-specific training.
The method is robust to region size and shape, and overlapping proposal-based segmentation with FV-CNN yields 55.7% accuracy on OpenSurfaces, showing strong generalization and flexibility.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.