QUICK REVIEW

[论文解读] When Slots Compete: Slot Merging in Object-Centric Learning

Christos Chatzisavvas, Pavlos Rigas|arXiv (Cornell University)|Mar 11, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

论文提出一个可微分的槽位合并算子，将重叠的 Slot Attention 槽合并，整合进 DINOSAUR 以减少碎片化、提升对象感知表征与分割。

ABSTRACT

Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.

研究动机与目标

在无监督条件下刺激对象中心学习将场景分解为离散对象。
通过允许重叠槽位合并来解决由固定槽位数量引起的槽位碎片化。
提供一个轻量、可微分的机制，在训练中细化槽位表示。
将合并机制整合到 DINOSAUR 框架并在标准基准上进行评估。

提出的方法

使用概率 Soft-IoU 分数量化槽位注意力图的空间重叠。
引入一个可微分的槽位合并算子，执行槽位表示的质量加权重心插值。
应用固定的合并策略，选择重叠度最高的对并合并，直到达到数据驱动的阈值。
在合并过程中通过聚合槽位注意力来更新注意力图，以保持质量和梯度流。
在槽位表示稳定后激活合并，由从重叠统计中导出的数据驱动阈值控制。
在 VOC、COCO、MOVi-C 与 MOVi-E 数据集的 DINOSAUR 框架内进行评估。

Figure 1 : We introduce a merge operator over the slot set that adaptively refines factorization, producing coherent object-level representations.

实验结果

研究问题

RQ1是否可以在不进行硬剪枝的情况下将重叠（竞争）槽位合并为一个连贯表示？
RQ2在训练过程中引入槽位合并是否比仅在推理阶段合并能获得更好的对象因素分解和分割效果？
RQ3基于 Soft-IoU 的合并策略如何影响下游的重建/分割性能？
RQ4通过合并操作的可微分性和梯度流对槽位优化有何影响？

主要发现

所提出的合并机制在真实世界和合成基准上均能持续提升对象表示和分割质量。
在训练阶段进行合并的效果优于仅在推理阶段合并，获得更高的 mBO 与 mIoU。
允许梯度通过合并层反向传播对性能有利。
合并过程中的注意力图聚合进一步提升分割指标。
合并频率随场景复杂度自适应，在更密集的场景中合并更多，稀疏场景则较少。

Figure 2 : Illustration of the proposed pipeline.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。