[Paper Review] Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation
This paper proposes Fisher Vectors derived from Laplacian Mixture Models (LMM) and Hybrid Gaussian-Laplacian Mixture Models (HGLMM), which outperform traditional Gaussian Mixture Model (GMM)-based Fisher Vectors on image annotation and sentence-based image search. The HGLMM model adaptively selects Gaussian or Laplacian distributions per dimension during EM optimization, enabling improved modeling of heavy-tailed descriptor distributions and achieving state-of-the-art performance on Flickr8k for image captioning using an RNN with HGLMM-encoded word representations.
In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high dimensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven empirically to be the leading representation for a large variety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distributions should be applied for different datasets, we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interesting property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks.
Motivation & Objective
- To improve image annotation and text-to-image retrieval by replacing the standard Gaussian Mixture Model (GMM) in Fisher Vector representations with alternative distributions better suited to heavy-tailed descriptor statistics.
- To develop a Laplacian Mixture Model (LMM) and a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) that better capture the distribution of SIFT descriptors than GMMs.
- To derive valid Expectation-Maximization (EM) and Fisher Vector formulations for LMM and HGLMM, enabling end-to-end training and inference.
- To evaluate the new Fisher Vector variants on image annotation and sentence-based image search, demonstrating state-of-the-art performance.
- To enable end-to-end image caption synthesis by projecting HGLMM Fisher Vectors into a shared CCA space with images, allowing joint modeling via RNNs.
Proposed method
- Proposes a multivariate Laplacian distribution under diagonal covariance assumption, forming the basis for the Laplacian Mixture Model (LMM).
- Derives the EM algorithm for LMM, including E-step and M-step equations, with closed-form updates for component parameters.
- Introduces the Hybrid Gaussian-Laplacian distribution as a weighted geometric mean of Gaussian and Laplacian densities, enabling flexible modeling per dimension.
- Derives the EM algorithm for HGLMM, showing that the M-step leads to a binary decision: each dimension in each component is either Gaussian or Laplacian, not a mixture.
- Applies power normalization and L2 normalization to the HGLMM Fisher Vector, following the standard normalization scheme of Perronnin et al. for improved performance.
- Projects image features (via VGG or Overfeat) and word representations (via word2vec) into a shared CCA space, using HGLMM Fisher Vectors for word-level encoding in a joint image-sentence embedding space.
Experimental results
Research questions
- RQ1Can Fisher Vectors derived from Laplacian Mixture Models (LMM) improve image annotation and text-to-image retrieval compared to standard GMM-based Fisher Vectors?
- RQ2Does a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) that adaptively selects between Gaussian and Laplacian distributions per dimension yield better performance than GMM or LMM alone?
- RQ3Can HGLMM-based Fisher Vectors effectively represent words in a shared embedding space with images, enabling accurate image caption generation via RNNs?
- RQ4Is the normalization scheme used in standard Fisher Vectors (power and L2 normalization) equally effective for HGLMM-derived Fisher Vectors?
- RQ5Does the use of HGLMM Fisher Vectors in a CCA-based joint embedding framework lead to state-of-the-art results in image captioning and sentence-based image search?
Key findings
- The HGLMM-based Fisher Vector achieves state-of-the-art performance on the Flickr8k dataset for image annotation and sentence-based image search, outperforming both GMM- and LMM-based Fisher Vectors.
- The EM algorithm for HGLMM results in a binary decision per dimension per component, selecting either Gaussian or Laplacian distribution, which improves modeling of heavy-tailed SIFT descriptor distributions.
- The use of HGLMM Fisher Vectors in a CCA-based joint embedding space enables effective image caption generation, with a greedy deterministic decoding strategy in an RNN with 512 LSTM units.
- The proposed method achieves superior performance on the Flickr8k dataset, with quantitative results showing improved accuracy over prior state-of-the-art methods in both image-to-sentence and sentence-to-image matching tasks.
- The model was trained for 300 epochs using SGD with a learning rate of 0.00001 and momentum of 0.5, with early stopping based on validation set performance.
- The RNN-based caption synthesis model uses the HGLMM Fisher Vector of word2vec embeddings as input at each decoding step, enabling consistent representation across images and sentences in the shared CCA space.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.