QUICK REVIEW

[论文解读] Representation Learning by Learning to Count

Mehdi Noroozi, Hamed Pirsiavash|arXiv (Cornell University)|Aug 22, 2017

Domain Adaptation and Few-Shot Learning参考文献 40被引用 30

一句话总结

本文提出了一种自监督表示学习方法，通过利用缩放和拼贴变换下的不变性，训练深度网络来计数视觉基元（如物体或部件）。通过在变换后的图像块上使用对比损失，模型学习到语义上有意义的特征，在无需任何人工标注的情况下，其在迁移学习基准测试中的表现达到或超过当前最先进水平。

ABSTRACT

We introduce a novel method for representation learning that uses an artificial supervision signal based on counting visual primitives. This supervision signal is obtained from an equivariance relation, which does not require any manual annotation. We relate transformations of images to transformations of the representations. More specifically, we look for the representation that satisfies such relation rather than the transformations that match a given representation. In this paper, we use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. These two transformations are combined in one constraint and used to train a neural network with a contrastive loss. The proposed task produces representations that perform on par or exceed the state of the art in transfer learning benchmarks.

研究动机与目标

开发一种自监督表示学习方法，通过基于计数视觉基元的新颖掩码任务避免人工标注。
利用图像变换（缩放和拼贴）与特征变换之间的等变性，形式化监督信号。
证明基于计数的自监督学习可生成对下游任务（如分类和检测）具有判别力的特征。
验证所学习的特征捕捉的是高层语义内容，而非低层纹理或边缘。

提出的方法

该方法使用两种图像变换：缩放（以强制视觉基元计数的尺度不变性）和拼贴（以强制图像区域间计数的加法一致性）。
它构建了一种对比损失，促使网络对保留相同总视觉基元数的变换图像块产生相似的表示。
监督信号源自等变性原则：若视觉基元总数在变换中保持不变，则表示必须反映这种算术一致性。
网络通过在满足计数约束的图像对上使用对比损失进行端到端训练，其中正样本对即为满足计数约束的图像对。
网络输出的计数向量被用作下游迁移学习的表示。
该方法可推广至其他变换关系，只要它们能在特征空间中表示为函数关系。

实验结果

研究问题

RQ1计数视觉基元能否作为自监督表示学习的有意义掩码任务？
RQ2强制在缩放和拼贴变换下保持不变性，是否能生成捕捉高层语义内容的表示？
RQ3基于计数一致性的对比损失能否在标准迁移学习基准测试中超越现有自监督方法？
RQ4所学习的特征在多大程度上反映语义概念，而非低层图像统计特征？

主要发现

所提方法在标准迁移学习基准测试中达到最先进性能，优于或匹配先前的自监督方法。
计数向量的模长随图像区域大小增加，表明其对视觉基元数量敏感，而非对低层纹理敏感。
计数特征模长较高的图像通常包含多个物体或大物体，而模长较低的图像通常为无显著基元的纹理。
在计数特征空间中进行最近邻检索，可检索到具有相似场景轮廓的语义相似图像，证实了特征的语义相关性。
神经元激活的可视化显示，单个神经元对语义一致的图像簇有响应，例如 ImageNet 中的狗和 COCO 中打棒球的人群。
即使在图像裁剪中保留颜色，模型性能依然良好，表明颜色不会破坏计数信号，但完全去除颜色会降低性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。