QUICK REVIEW

[논문 리뷰] Towards Automatic Learning of Procedures from Web Instructional Videos

Luowei Zhou, Chenliang Xu|arXiv (Cornell University)|2017. 03. 28.

Multimodal Machine Learning Applications인용 수 222

한 줄 요약

본 논문은 제약 없이 제공되는 비디오에 대한 절차 분할(procedure segmentation)을 정의하고, YouCook2 데이터셋을 도입하며, 프로시저 단위를 독립된 카테고리로 구분하는 절차를 단계 수준으로 다루는 ProcNets를 제시한다. ProcNets는 길이가 긴 안내 비디오를 카테고리 독립적 절차 단계로 분할하는 데 있어 베이스라인보다 우수한 성능을 보이는 단계 단위 순환 모델이다.

ABSTRACT

The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

연구 동기 및 목표

긴 제약 없이 제공되는 안내 비디오(예: YouTube)에서 인간 합의 절차 구조를 학습하도록 동기를 부여한다.
비디오를 카테고리 독립적 세그먼트로 분할하기 위한 절차 분할 문제를 정의하고 해결한다.
절차 분할 연구를 가능하게 하는 대규모로 풍부하게 주석된 데이터셋(YouCook2)을 만든다.
세그먼트 제안을 위치를 식별하고 세그먼트 수준의 시간 종속성을 학습하는 엔드-투-엔드 모델(ProcNets)을 개발한다.
프레임 수준 베이스라인 및 자막이 없는 베이스라인에 비해 세그먼트 수준 모델링이 향상됨을 입증한다.

제안 방법

ResNet 특징을 이용한 컨텍스트 인식 프레임 인코딩 뒤에 Bi-LSTM을 적용하여 컨텍스트 인식 프레임 표현을 생성한다.
시작/종료 오프셋이 있는 후보 절차 세그먼트를 생성하기 위해 K 개의 앵커(anchor 기반 제안)들을 포함하는 세그먼트 제안 모듈을 제안하고, 이를 이진 분류와 오프셋 회귀로 학습한다.
Proposal Vector, Location Embedding, Segment Content를 입력으로 활용해 세그먼트 수준 의존성을 모델링하는 순차 예측 모듈(LSTM)을 이용해 최종 절차 세그먼트 시퀀스를 선택·출력한다.
L = L_cla + alpha_r L_reg + alpha_s L_seq로 결합된 손실로 학습하는데, 여기서 L_cla는 절차성에 대한 이진 교차 엔트로피, L_reg는 오프셋에 대한 매끈한 L1, L_seq는 순차 예측에 대한 교차 엔트로피이다.
고정된 수의 세그먼트를 필요로 하지 않고 빔 검색으로 일관된 절차 세그먼트 시퀀스를 추론한다.

실험 결과

연구 질문

RQ1시각적 증거만을 사용하여 긴 비제약적 비디오로부터 인간 합의 절차 구조를 학습할 수 있는가?
RQ2세그먼트 수준의 순차 모델이 프레임 수준 접근법이나 비순차 제안보다 절차 단계 간의 장거리 의존성을 더 잘 포착할 수 있는가?
RQ3대규모의 풍부하게 주석된 데이터셋이 카테고리 독립 절차 분할의 강건한 학습 및 평가를 가능하게 하는가?
RQ4절차 분할의 출력이 안내 비디오에서의 밀집 캡션 생성이나 이벤트 파싱과 같은 다운스트림 작업을 향상시킬 수 있는가?

주요 결과

Method	Jaccard (validation)	mIoU (validation)	Jaccard (test)	mIoU (test)
Uniform	41.5	36.0	40.1	35.1
vsLSTM	47.2	33.9	45.2	32.2
SCNN-prop	46.3	28.0	45.6	26.7
ProcNets-NMS (ours)	49.8	35.2	47.6	33.9
ProcNets-LSTM (ours)	51.5	37.5	50.6	37.0

ProcNets은 Jaccard 및 mIoU 지표(검증 및 테스트)에서 경쟁 베이스라인보다 절차 분할에 크게 우수하다.
ProcNets-LSTM은 가장 높은 점수를 달성한다: validation Jaccard 51.5, validation mIoU 37.5, test Jaccard 50.6, test mIoU 37.0.
ProcNets-NMS는 비최대 억제(non-maximum suppression)만 의존하는 베이스라인보다 향상되어 강한 세그먼트 위치지정을 보인다.
Location Embedding은 절차 구조 학습에서 가장 중요한 구성 요소이며 제거 시 눈에 띄는 성능 저하가 발생한다.
모델은 비디오당 세그먼트 수를 적응시키며 비주석이지만 의미론적으로 의미 있는 세그먼트를 다루는 등 절차 구조에 대한 질적 이해를 보여준다.
YouCook2 데이터셋은 89개의 레시피에 걸친 2000개 비디오와 시간적 절차 주석 및 명령문(sentence)을 제공하여 강건한 평가를 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.