QUICK REVIEW

[논문 리뷰] On the effectiveness of task granularity for transfer learning

Farzaneh Mahdisoltani, Guillaume Berger|arXiv (Cornell University)|2018. 04. 24.

Human Pose and Action Recognition참고 문헌 31인용 수 51

한 줄 요약

본 연구는 소스 태스크의 세분화 정도(대략적 캡션에서 미세한 캡션까지)가 비디오 이해의 전이 학습을 위한 학습 특징의 질에 어떤 영향을 미치는지 조사하고, 더 세분화된 태스크가 더 나은 전이 성능을 가져오며 캡션 작성이 효과적인 소스 태스크가 될 수 있음을 보인다.

ABSTRACT

We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune a neural net on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.

연구 동기 및 목표

소스 태스크 라벨의 세분화와 전이 가능한 특징 품질의 관계를 조사한다.
공유 표현을 갖는 비디오 분류와 캡션 작성을 위한 일체형 인코더-디코더 모델을 개발한다.
Something-Something 특징에서 새로운 도메인으로의 전이 학습을 평가하며, 주방 액션 데이터세트를 포함한다.
세분화된 태스크를 위한 전이 학습 벤치마크로 20bn-kitchenware를 도입한다.

제안 방법

공유 LSTM 인코더로 피드되는 2-channel 비디오 인코더(2D 공간 CNN 및 3D 시공간 CNN)를 사용한다.
가중 손실을 사용하여 분류 헤드와 캡션 디코더를 공동 학습한다: loss = lambda * classification_loss + (1 - lambda) * captioning_loss.
네 가지 태스크를 학습한다: coarse-grained action groups, fine-grained action categories, simplified object placeholders captions, 그리고 full object placeholder captions.
캡션 디코더는 인코딩된 비디오 표현에 조건화된 캡션을 생성한다; 학습은 고정 캡션 길이(14단어)로 교사 강제(teacher forcing)를 사용한다.
평가에는 전이 학습이 포함된다: 인코더를 동결하고 대상 데이터에서 분류기를 훈련하여 서로 다른 소스 세분화 수준에서 학습된 특징을 비교한다.

실험 결과

연구 질문

RQ1더 미세한 세분화의 소스 태스크로 학습하는 것이 전이 학습을 위한 특징을 더 풍부하게 만들까?
RQ2분류와 캡션 작성을 공동으로 학습하는 것이 단일 태스크 학습에 비해 전이 성능에 어떤 차이를 보이나?
RQ3다양한 세분화 수준(coarse 그룹, fine-grained 액션, simplified captions, full captions)이 분류 및 캡션 성능에 미치는 영향은?
RQ4Something-Something에서 파생된 특징이 새로운 미세한 주방 액션 데이터셋(20bn-kitchenware)으로 얼마나 잘 전이되는가?

주요 결과

더 많은 세부적인 작업으로의 학습은 전이 학습에 더 나은 특징을 생성하는 경향이 있다.
분류와 캡션 작성을 공동으로 수행하도록 학습된 모델은 새로운 태스크로의 전이가 더 잘 된다.
coarse vs fine-grained 분류의 경우, 미세한 학습이 더 높은 테스트 정확도를 낳았다(예: 보고된 설정에서 50.44% 대 41.7%).
캡션 작성이 소스 태스크로서 실행 가능하고 유익하며, 캡션 작성과 액션 분류를 결합한 학습이 전이 성능을 향상시킨다.
제안된 20bn-kitchenware 벤치마크는 Something-Something–pretrained 특징과 재발이 있는 시간 모델이 미세한 주방 액션으로의 전이에서 베이스라인을 능가한다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.