Skip to main content
QUICK REVIEW

[论文解读] Vision GNN: An Image is Worth Graph of Nodes

Kai Han, Yunhe Wang|arXiv (Cornell University)|Jun 1, 2022
Advanced Neural Network Applications被引用 194
一句话总结

ViG 将图像表示为补丁图,并使用带 Grapher 和 FFN 模块的图神经网络,在 ImageNet 和 COCO 上超过多种骨干网络。

ABSTRACT

Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at https://github.com/huawei-noah/Efficient-AI-Backbones and the MindSpore code is available at https://gitee.com/mindspore/models.

研究动机与目标

  • Motivate and explore representing visual data as graphs rather than grids or sequences.
  • Propose a graph-based backbone (ViG) with Grapher and FFN modules to process image patches as nodes.
  • Investigate isotropic and pyramid ViG architectures across vision tasks like classification and detection.
  • Demonstrate ViG's effectiveness on ImageNet classification and COCO object detection/segmentation.
  • Provide insights into graph construction and channel-wise feature diversity to tackle over-smoothing in GNNs.

提出的方法

  • Convert an image into N patches, treat patches as nodes, and connect each node to its K nearest neighbors to form a graph G(X).
  • Use a Grapher module based on max-relatve graph convolution to aggregate and update node features with a multi-head mechanism.
  • Apply a FFN module (two linear layers with GELU) for node-wise feature transformation to maintain diversity.
  • Stack Grapher and FFN blocks to build ViG, with isotropic and pyramid network variants.
  • Incorporate positional encodings (absolute for both isotropic and pyramid; relative for pyramid) to inject spatial information.
  • Train using standard vision data augmentation and optimization strategies, with dilated aggregation in Grapher and skip connections to preserve diversity.

实验结果

研究问题

  • RQ1Can a graph-based representation of image patches outperform grid/sequence-based backbones on standard vision benchmarks?
  • RQ2Do Grapher and FFN modules prevent over-smoothing and preserve feature diversity as networks deepen?
  • RQ3How do isotropic versus pyramid ViG architectures compare on classification and detection tasks?
  • RQ4What is the impact of graph-construction choices (K, number of heads) on ViG performance?
  • RQ5How do ViG backbones fare against CNNs, MLPs, and transformers on ImageNet and COCO?

主要发现

  • Pyramid ViG-S achieves 82.1% top-1 accuracy on ImageNet with ~4.5B FLOPs, outperforming similar FLOPs CNNs, MLPs, and transformers in that setting.
  • Isotropic ViG variants (Ti, S, B) show competitive performance with increasing model size (Top-1 73.9% for ViG-Ti, 80.4% for ViG-S, 82.3% for ViG-B).
  • ViG backbones outperform representative backbones on COCO object detection and instance segmentation when used in RetinaNet and Mask R-CNN frameworks.
  • Among graph convolutions, Max-Relative GraphConv provides a favorable trade-off between FLOPs and accuracy (Table 6).
  • Introducing FC in Grapher and FFN in ViG blocks improves accuracy, addressing over-smoothing and improving feature diversity.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。