QUICK REVIEW

[논문 리뷰] Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang|arXiv (Cornell University)|2018. 12. 10.

Multimodal Machine Learning Applications인용 수 63

한 줄 요약

한두 문장 직접 답변 요약: 이 논문은 Temporal segment annotations 없이도 비디오 이벤트를 로컬라이즈하고 자막을 생성하는 Weakly Supervised Dense Event Captioning (WS-DEC)을 도입하며, 고정점 반복을 통한 문장 로컬라이제이션과 자막 생성의 이중 사이클을 사용합니다.

ABSTRACT

Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training. Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and most real-world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to demonstrate the ability of our model on both dense event captioning and sentence localization in videos.

연구 동기 및 목표

Dense Event Captioning에서 시그먼트 타임라인 라벨링 제거로 주석 비용을 줄인다.
자막-세그먼트의 1:1 대응을 활용하여 약한 감독을 가능하게 한다.
문장 로컬라이제이션과 자막 생성의 이중 학습 사이클을 개발하여 엔드-투-엔드로 학습한다.
ActivityNet Captions에서 Dense Captioning과 Sentence Localization 모두에서 효과를 입증한다.

제안 방법

두 가지 이중 작업을 형식화한다: 문장 로컬라이제이션 lθ1(V, C)과 이벤트 자막화 gθ2(V, S).
고정점 반복을 사용하여 테스트 시 유효한 세그먼트로 수렴하도록 한다: S(t+1)=lθ1(V, gθ2(V, S(t))).
주기 제약 조건으로 학습한다: C ≈ gθ2(V, lθ1(V, C)) 및 수렴을 촉진하는 디노이징 스타일 손실.
Crossing Attention을 적용하여 비디오와 자막 특징 간의 교차 모달 로컬라이제이션을 수행한다.
최고의 앵커를 중심으로 다중 앵커 분류 후 보정하는 방식으로 세그먼트 로컬라이제이션을 회귀한다.
비디오 세그먼트에 대해 미분 가능하도록 부드러운 클리핑 메커니즘을 도입하여 자막 생성을 가능하게 한다.

실험 결과

연구 질문

RQ1Dense Event Captioning을 시간 세그먼트 주석 없이 학습할 수 있는가?
RQ2자막과 세그먼트 간의 양방향 1:1 대응이 WS-DEC 학습에 충분한가?
RQ3고정점 반복과 디노이징이 약한 감독에서 학습 안정성과 성능에 도움을 주는가?

주요 결과

ActivityNet Captions에서 WS-DEC 모델은 일부 완전 감독 방법과 비교하여 METEOR 및 CIDEr 점수에서 경쟁력을 보인다.
제안된 방법은 Meteor를 완전 감독 접근법과 비견될 정도로 달성하고 약한 감독 변형 중에서도 CIDEr 점수에서 최고점을 달성한다.
최종 WS-DEC 모델(모든 구성 요소 포함)은 Dense Event Captioning 지표에서 미감소 baselines 및 ablated 변형을 능가한다.
Localization 결과는 약한 감독에서도 합리적인 세그먼트 예측을 보여주며 CTRL를 능가하고 일부 지표(R@1 IoU=0.1~0.5, mIoU)에서 감독 기반 모델에 근접한다.
테스트 시 무작위 초기 세그먼트의 수를 증가시키면 수익은 제한적이나 견고성은 나타나며 초기 제안에 대한 강건성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.