QUICK REVIEW

[Paper Review] Semantic Object Parsing with Local-Global Long Short-Term Memory

Xiaodan Liang, Xiaohui Shen|arXiv (Cornell University)|Nov 14, 2015

Multimodal Machine Learning Applications31 references30 citations

TL;DR

This paper proposes Local-Global Long Short-Term Memory (LG-LSTM), a novel deep architecture that jointly models local spatial dependencies from neighboring pixels and global contextual information from the entire image to enhance feature learning in semantic object parsing. By stacking LG-LSTM layers on intermediate convolutional features, the method achieves state-of-the-art performance on three public datasets through end-to-end learning, significantly improving pixel-wise segmentation accuracy over baseline CNNs and prior post-processing methods.

ABSTRACT

Semantic object parsing is a fundamental task for understanding objects in detail in computer vision community, where incorporating multi-level contextual information is critical for achieving such fine-grained pixel-level recognition. Prior methods often leverage the contextual information through post-processing predicted confidence maps. In this work, we propose a novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into the feature learning over all pixel positions. In each LG-LSTM layer, local guidance from neighboring positions and global guidance from the whole image are imposed on each position to better exploit complex local and global contextual information. Individual LSTMs for distinct spatial dimensions are also utilized to intrinsically capture various spatial layouts of semantic parts in the images, yielding distinct hidden and memory cells of each position for each dimension. In our parsing approach, several LG-LSTM layers are stacked and appended to the intermediate convolutional layers to directly enhance visual features, allowing network parameters to be learned in an end-to-end way. The long chains of sequential computation by stacked LG-LSTM layers also enable each pixel to sense a much larger region for inference benefiting from the memorization of previous dependencies in all positions along all dimensions. Comprehensive evaluations on three public datasets well demonstrate the significant superiority of our LG-LSTM over other state-of-the-art methods.

Motivation & Objective

To address the limitation of CNNs in capturing long-range and global contextual dependencies for fine-grained pixel-level object parsing.
To overcome the inefficiency and suboptimal performance of post-processing techniques like CRF or mean field approximation in modeling contextual relationships.
To develop a deep learning architecture that seamlessly integrates local and global context during feature learning, enabling end-to-end training.
To improve the discriminative capability of visual features by leveraging memory cells that retain long-term dependencies across spatial and depth dimensions.

Proposed method

The LG-LSTM architecture uses individual LSTMs for spatial dimensions (horizontal, vertical, and diagonal) and a depth LSTM to propagate information across network layers.
Local guidance is provided by hidden states from eight neighboring spatial positions, enabling rich local context modeling.
Global guidance is implemented by dividing the previous layer’s hidden map into nine grids, applying max-pooling per grid to extract discriminative global features.
Global and local hidden states are combined as input to each position’s LSTM, allowing each pixel to attend to both local neighborhoods and the full image context.
Multiple LG-LSTM layers are stacked and appended to intermediate convolutional layers in a fully convolutional network, enabling hierarchical feature enhancement.
The memory cells store long-term contextual dependencies across all positions, allowing each pixel to sense a larger receptive field through sequential computation.

Experimental results

Research questions

RQ1Can a unified deep learning architecture effectively model both local and global spatial dependencies in semantic object parsing without relying on post-processing?
RQ2How does the integration of local spatial connections and global image-wide context improve pixel-wise classification accuracy compared to standard CNNs?
RQ3To what extent do long-range dependencies captured via recurrent memory cells enhance feature representation in semantic segmentation tasks?
RQ4Does the proposed LG-LSTM architecture outperform conventional post-processing methods like CRF or mean field approximation in terms of accuracy and efficiency?
RQ5Can the end-to-end learning of LG-LSTM layers lead to better generalization and robustness on challenging parsing tasks with appearance and positional variations?

Key findings

The LG-LSTM model achieves a mean IoU of 69.4% on the PASCAL-Context dataset, significantly outperforming the baseline VGG16 and other state-of-the-art methods.
On the Horse-Cow dataset, LG-LSTM improves mean IoU by 4.19% over the 'LG-LSTM local_2' variant and 2.94% over 'LG-LSTM local_4', demonstrating the importance of eight spatial connections.
Removing global guidance in LG-LSTM leads to a 1.27% and 1.81% drop in IoU on horse and cow classes, respectively, proving the value of global context for disambiguation.
The model reduces segmentation errors on ambiguous regions such as 'skirt' vs 'dress' and 'legs' vs 'pants' by leveraging global image context.
Compared to five extra convolutional layers with equivalent parameter count, LG-LSTM improves mean IoU by 2.78% on the horse class and 4.86% on the cow class, showing superior modeling of long-range patterns.
Qualitative results show that LG-LSTM produces more consistent, semantically meaningful, and boundary-preserving predictions than VGG16 and Co-CNN, especially on small or visually similar parts like tails and legs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.