QUICK REVIEW

[논문 리뷰] Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn, Inwoong Lee|arXiv (Cornell University)|2026. 01. 22.

Human Pose and Action Recognition인용 수 0

한 줄 요약

논문은 object-driven shortcuts를 제로샷 구성적 행동 인식(ZS-CAR)의 핵심 실패 모드로 식별하고, 구성 인식 augmentation과 시간 순서 규제를 통해 시계열로 근거 있는 동사 학습을 촉진하는 RCORE 프레임워크를 제시하여 unseen verb–object 일반화를 개선한다.

ABSTRACT

We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

연구 동기 및 목표

ZS-CAR 모델이 보지 못한 동사-대상 조합으로 일반화에 실패하는 원인을 식별한다.
동사와 객체 간의 동시 발생 사전정보(co-occurrence priors)와 학습 비대칭성의 역할을 진단한다.
지름길 학습을 완화하고 시간적으로 근거 있는 동사 표현을 강화하기 위한 간단한 프레임워크를 제안한다.
여러 데이터셋에 걸친 오픈 월드, 편향 없는 설정에서 일반화 이점을 검증한다.

제안 방법

훈련 편향 및 구성 격차를 진단하기 위한 진단 지표를 사용하여 ZS-CAR의 객체 주도 단축 현상을 진단한다.
시간 구조를 보존하면서 그럴듯한 unseen verb–object 쌍을 합성하기 위해 VOCAMix를 도입한다.
정적 객체 신호가 아닌 시간 역학에 의존하도록 강요하는 시간 순서 규제 손실 TORC를 제안한다.
자주 발생하는 쌍에 대한 편향을 억제하기 위해 주변적 구성 손실(marginal composition loss)과 동시 발생 여백 손실(co-occurrence margin loss)을 통합한다.
백본으로 AIM을 사용하고 Sth-com 및 EK100-com 데이터셋에서 오픈 월드 및 편향 설정으로 평가한다.

Figure 2 : Controlled experiments demonstrate object-driven shortcut learning in ZS-CAR. We empirically identify a key failure mode in ZS-CAR—object-driven shortcuts. (a) Objects are easier to learn than verbs. We train a randomly initialized ViT [ 10 ] on a balanced 10 $\times$ 10 verb-object subse

실험 결과

연구 질문

RQ1전통적인 ZS-CAR 평가 프로토콜이 실제 모델 동작과 일반화를 충실히 드러내는가?
RQ2제안된 구성 격차 지표가 독립적인 동사/대상 예측을 넘어 구성적 이해의 진정한 이점을 드러내는가?
RQ3RCORE가 ZS-CAR에서 동시 발생 과적합과 객체 주도 동사 단축을 완화할 수 있는가?
RQ4RCORE의 개선이 데이터셋과 평가 설정 전반에 걸쳐 일반화되는가?
RQ5RCORE의 각 구성 요소가 강건한 구성 학습에 어떻게 기여하는가?

주요 결과

ZS-CAR 모델은 동시 발생 편향과 객체와 동사 간 학습 비대칭성에 의해 주도되는 객체 주도 단축을 보인다.
RCORE는 co-occurrence priors에 대한 의존을 줄이고 unseen 구성에서 긍정적인 구성 격차를 산출한다.
VOCAMix는 시간 신호를 방해하지 않으면서 그럴듯한 unseen 동사–대상 조합을 확장한다.
TORC는 시간적으로 근거 있는 동사 표현을 강제하고 정적 신호 의존도를 감소시킨다.
Sth-com 및 EK100-com에서 RCORE는 unseen 구성 정확도를 향상시키고 강력한 베이스라인 대비 동사 표현을 강화한다.

Figure 3 : Learning curve of the SOTA model with our diagnostic metrics. We plot the learning curve of C2C [ 16 ] trained on Sth-com [ 16 ] . We measure the False Seen Prediction (FSP) and False Co-occurrence Prediction (FCP) ratios, and observe that the seen–unseen accuracy gap ( $\Delta_{SU}$ ) co

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.