[Paper Review] Visual Relationship Detection with Language Priors
This work proposes a scalable model that learns visual appearances for objects and predicates and uses language priors to predict and localize thousands of visual relationships, enabling zero-shot and improved image retrieval.
Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.
Motivation & Objective
- Motivate unbiased detection and localization of diverse visual relationships beyond a small set of common ones.
- Propose a two-module model that learns visual appearances for objects and predicates and fuses them to predict relationships.
- Introduce a language embedding module using word vectors to relate similar relationships.
- Demonstrate zero-shot visual relationship detection and show improvements in content-based image retrieval.
- Provide a new dataset with thousands of relationship types for benchmarking visual relationship prediction.
Proposed method
- Train separate object and predicate detectors using CNNs (VGG) and RCNN proposals.
- Model visual relationships as V(R) = P_i(O1) * (z_k^T CNN(O1,O2) + s_k) * P_j(O2).
- Project object pairs into a language embedding space with f(R) = w_k^T [word2vec(t_i), word2vec(t_j)] + b_k.
- Encourage semantic similarity by minimizing the variance of distance-weighted embeddings (K(W)).
- Impose a ranking loss L(W) to order observed relationships above unseen ones.
- Combine V, L, and K in a joint objective for training (C + λ1 L + λ2 K).
- At test time, score R* = argmax_R V(R,Θ|O1,O2) f(R,W) for each object pair.
Experimental results
Research questions
- RQ1Can visual relationships be detected by combining independently learned object/predicate appearances with language priors?
- RQ2How does embedding-based language priors affect recognition, especially for infrequent or unseen relationships?
- RQ3Does the proposed model scale to thousands of relationship types and support zero-shot learning?
- RQ4Does leveraging relationships improve image retrieval performance?
- RQ5How does the model perform on a new large-scale visual relationship dataset compared to prior methods?
Key findings
- The full model (V + L + K) substantially outperforms prior methods on the new dataset for phrase, relationship, and predicate detection.
- Zero-shot visual relationship detection improves when language priors and similarity embedding are used (K term).
- Language priors enable scaling to thousands of relationships with limited examples and enable zero-shot in evaluation.
- On the Visual Phrases dataset, the full model achieves higher mAP and strong recall, indicating benefits from embedding-based priors.
- Using predicted relationships improves image-based retrieval, achieving higher Recall@1 and lower median rank than baselines.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.