QUICK REVIEW

[论文解读] Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Clément Farabet, Camille Couprie|arXiv (Cornell University)|Feb 9, 2012

Advanced Image and Video Retrieval Techniques参考文献 26被引用 135

一句话总结

本文提出了一种快速、端到端的场景解析系统，该系统利用多尺度卷积特征、基于像素差异性的分割树，以及最优覆盖算法来选择纯净的图像区域进行标注。该方法在Stanford Background（79.5% 每像素准确率）、SIFT Flow（78.5%）和Barcelona（67.8%）数据集上均达到了当前最优的准确率，且在标准CPU上处理320×240图像的时间不足1秒。

ABSTRACT

Scene parsing, or semantic segmentation, consists in labeling each pixel in an image with the category of the object it belongs to. It is a challenging task that involves the simultaneous detection, segmentation and recognition of all the objects in the image. The scene parsing method proposed here starts by computing a tree of segments from a graph of pixel dissimilarities. Simultaneously, a set of dense feature vectors is computed which encodes regions of multiple sizes centered on each pixel. The feature extractor is a multiscale convolutional network trained from raw pixels. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average "purity" of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free. The system yields record accuracies on the Stanford Background Dataset (8 classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) while being an order of magnitude faster than competing approaches, producing a 320 \ imes 240 image labeling in less than 1 second.

研究动机与目标

通过利用多尺度上下文特征，解决语义分割中同时检测、分割与识别的挑战。
通过从原始像素端到端训练卷积网络，消除对手工设计特征的依赖。
通过使用最优覆盖算法选择最小化平均段纯度（熵）的树节点子集，提升分割的一致性。
通过结合层次化分割与高效的特征聚合及分类方法，实现高准确率与高速度。

提出的方法

从原始输入图像构建多尺度对比归一化拉普拉斯金字塔，以捕捉多样的空间上下文信息。
在每个金字塔尺度上应用两阶段卷积网络，生成密集的多尺度特征图，并对每个像素进行上采样与拼接。
通过像素差异性图的最小生成树构建分割树，其中边表示相邻像素之间的基于颜色的差异性。
在每个树节点内，使用5×5空间网格聚合特征向量，并应用逐分量最大池化操作，生成尺度不变的段表示。
在聚合的特征网格上训练分类器，以估计每个段的类别分布（从而计算基于熵的不纯度）。
使用贪心算法选择最优的树节点覆盖——即最小化平均段不纯度（熵）——以生成全局一致且纯净的分割结果。

实验结果

研究问题

RQ1是否能够通过从原始像素端到端训练的多尺度卷积网络，生成无需手工设计特征的、有效的场景解析特征？
RQ2是否能够从像素差异性中提取的分割树，编码出支持准确语义标注的有意义图像段？
RQ3基于最小化段不纯度（熵）的最优覆盖方法，是否能比传统推理方法（如图割）提供更好的分割一致性？
RQ4多尺度特征、基于树的分割与基于纯度的覆盖选择相结合，是否能够在保持亚秒级推理速度的同时实现当前最优的准确率？

主要发现

在Stanford Background数据集上，系统实现了79.5%的每像素准确率和74.3%的平均每类准确率，优于先前方法。
在SIFT Flow数据集上，该方法达到了78.5%的每像素准确率，且采用类别平衡采样，显著提升了小物体的识别能力。
在包含170个类别的Barcelona数据集上，采用类别平衡采样时，系统实现了67.8%的每像素准确率，展现出对高类别复杂度的鲁棒性。
在标准CPU上处理320×240图像的推理时间少于1秒，比现有方法快一个数量级。
模型在训练完成后为无参模型，推理阶段无需阈值调整或超参数调节。
频率平衡采样提升了稀有类别的识别能力，但降低了整体像素准确率，凸显了全局性能与每类性能之间的权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。