[Paper Review] Deep convolutional filter banks for texture recognition and segmentation
This paper proposes FV-CNN, a novel texture descriptor that applies Fisher Vector pooling to convolutional neural network (CNN) filter banks to improve texture, material, and scene recognition in cluttered images. By treating CNN features as a learnable filter bank and using orderless, multi-scale pooling, FV-CNN achieves state-of-the-art performance—79.8% on Flickr Material, 81.1% on MIT Indoor Scenes—without requiring fine-tuning or image resizing.
Research in texture recognition often concentrates on the problem of material recognition in uncluttered conditions, an assumption rarely met by applications. In this work we conduct a first study of material and describable texture at- tributes recognition in clutter, using a new dataset derived from the OpenSurface texture repository. Motivated by the challenge posed by this problem, we propose a new texture descriptor, D-CNN, obtained by Fisher Vector pooling of a Convolutional Neural Network (CNN) filter bank. D-CNN substantially improves the state-of-the-art in texture, mate- rial and scene recognition. Our approach achieves 82.3% accuracy on Flickr material dataset and 81.1% accuracy on MIT indoor scenes, providing absolute gains of more than 10% over existing approaches. D-CNN easily trans- fers across domains without requiring feature adaptation as for methods that build on the fully-connected layers of CNNs. Furthermore, D-CNN can seamlessly incorporate multi-scale information and describe regions of arbitrary shapes and sizes. Our approach is particularly suited at lo- calizing stuff categories and obtains state-of-the-art re- sults on MSRC segmentation dataset, as well as promising results on recognizing materials and surface attributes in clutter on the OpenSurfaces dataset.
Motivation & Objective
- Address the challenge of material and texture attribute recognition in real-world, cluttered natural images, where textures are not isolated or uniformly distributed.
- Overcome limitations of existing CNN-based approaches that rely on fully connected layers, which are sensitive to spatial layout, require fixed input size, and may be less transferable.
- Develop a flexible, orderless, and multi-scale feature representation that preserves texture-specific invariances while enabling domain transfer without fine-tuning.
- Evaluate the proposed method on new benchmarks derived from the OpenSurfaces dataset for material and texture attribute recognition and segmentation in clutter.
Proposed method
- Treat the activation maps of early convolutional layers in a pre-trained CNN (e.g., VGG-M) as a learned, non-linear filter bank.
- Apply Fisher Vector (FV) pooling to the feature maps of each filter bank to create a compact, orderless, and discriminative global descriptor.
- Use the FV representation to encode the distribution of filter responses across spatial locations, enabling multi-scale and shape-agnostic feature aggregation.
- Process arbitrary-sized input images directly through convolutional layers, avoiding costly resizing operations required by fully connected layers.
- Train a linear SVM on FV-CNN features for classification, enabling fast and effective recognition without domain-specific adaptation.
- Extend the method to weakly supervised segmentation by combining FV-CNN region descriptors with general-purpose image segmentation algorithms (e.g., crisp regions or overlapping proposals).
Experimental results
Research questions
- RQ1Can a Fisher Vector pooling of CNN filter banks outperform standard CNN features (e.g., from fully connected layers) for texture and material recognition in cluttered scenes?
- RQ2Does FV-CNN enable better domain transfer than methods relying on fully connected layers, especially without fine-tuning?
- RQ3How does the performance of FV-CNN vary across different CNN layers, and which layer provides the most discriminative texture representation?
- RQ4Can FV-CNN achieve state-of-the-art results in weakly supervised segmentation tasks without CRF-based post-processing or dataset-specific training?
- RQ5How effective is FV-CNN in recognizing descriptive texture attributes (e.g., wrinkled, marbled) and materials (e.g., brick, fabric) in real-world, uncluttered, and complex scenes?
Key findings
- FV-CNN achieves 79.8% accuracy on the Flickr Material dataset, representing an absolute improvement of over 10% over prior state-of-the-art methods.
- On the MIT Indoor Scenes dataset, FV-CNN achieves 81.1% accuracy, significantly outperforming the previous state-of-the-art of 70.8%.
- FV-CNN outperforms SIFT-based Fisher Vector representations on all evaluated datasets, with performance improving monotonically from earlier to deeper convolutional layers.
- Filter banks from VGG-M’s conv3 and deeper layers produce significantly better descriptors than SIFT, demonstrating the superiority of deep features for texture representation.
- FV-CNN enables effective weakly supervised segmentation: using crisp regions, it achieves 55.4% accuracy on the OpenSurfaces material recognition benchmark and 87.0% on MSRC, matching or exceeding prior results without CRF or domain-specific training.
- The method is robust to region size and shape, and overlapping proposal-based segmentation with FV-CNN yields 55.7% accuracy on OpenSurfaces, showing strong generalization and flexibility.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.