QUICK REVIEW

[논문 리뷰] Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Dimitrios Kollias, Stefanos Zafeiriou|arXiv (Cornell University)|2018. 11. 11.

Emotion and Mood Recognition참고 문헌 43인용 수 110

한 줄 요약

본 논문은 Aff-Wild 데이터셋을 Aff-Wild2로 두 배로 확장하여 458명의 피험자와 2.8M 프레임을 확보하고, 연속적인 벨런스(Valence)–각성(Arousal) 예측을 위한 CNN–RNN–attention 아키텍처를 제시하여 RECOLA에 대한 교차 데이터베이스 전이 성능을 강하게 달성합니다.

ABSTRACT

Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.

연구 동기 및 목표

Aff-Wild 데이터베이스를 확장하여 변동성과 규모를 증가시켜 Aff-Wild2로 확장하기(더 많은 피험자, 더 많은 프레임, 다양한 조건).
야생에서의 연속 벨런스-각성 추정을 위한 end-to-end 딥 아키텍처(CNN–RNN with attention) 개발.
RECOLA에 대한 미세 조정을 통해 교차-데이터베이스 일반화를 평가하고 최첨단 모델과 비교합니다.
대형 얼굴 데이터셋에서의 사전 학습이 감정 예측 성능에 미치는 영향을 분석합니다.

제안 방법

Aff-Wild2를 Aff-Wild에 260개의 비디오(1,413,000 프레임)를 추가하여 총 558개 비디오와 2,786,201 프레임, 458명의 피험자.
네 명의 전문가를 통해 연속 타임스탬프의 벨런스와 각성을 주석화하고 주석 후처리를 통해 MAIC 기반의 최종 라벨을 얻습니다.
프레임에서 얼굴을 검출하고 CNN에 대한 96×96×3 입력으로 정규화합니다.
백본 CNN( VGGFACE, VGGFACE2, DenseNet-121; 해당 데이터셋에 대해 사전 학습된)을 이용하고 RNN 변형(LSTM, GRU, indRNN)으로 128유닛의 2개의 숨겨진 RNN 계층을 사용합니다.
RNN 위에 주의(attention) 계층을 도입하고 손실 L_total = 1 - (ρ_a + ρ_v)/2로 학습합니다. 여기서 ρ_a와 ρ_v는 각성/벨런스의 CCC입니다.
전 프레임 기반 학습 세부사항(Adam 옵티마이저, 배치 크기 320, attention 길이 32)으로 Aff-Wild2에서 CCC를 성능 지표로 사용하는 아키텍처를 평가합니다.

실험 결과

연구 질문

RQ1Aff-Wild2가 Aff-Wild에 비해 생물학적 자연스러운 표정의 강건성과 커버리지를 향상시킬 수 있는가?
RQ2Aff-Wild2에서 어떤 CNN–RNN–attention 구성이 벨런스-각성 예측에 가장 적합한가?
RQ3Aff-Wild2에서 학습된 모델이 미세 조정 후 다른 데이터셋(예: RECOLA)으로 일반화되는가?
RQ4VGGFACE/VGGFACE2와 같은 대형 얼굴 데이터셋에서의 사전 학습이 감정 예측 성능에 어떤 영향을 미치는가?

주요 결과

Aff-Wild2는 558개의 비디오, 2,786,201 프레임, 458명의 피험자(279 남성, 179 여성)로 구성됩니다.
가장 성능이 높은 아키텍처는 VGGFace-GRU-attention으로 테스트 세트에서 벨런스 CCC가 0.55, 각성 CCC가 0.45를 달성했고(검증 CCC는 각각 0.58, 0.48).
Aff-Wild2에서 최적의 모델(VGGFACE1-GRU-attention)을 RECOLA에 대해 미세 조정하면 벨런스 CCC 0.547, 각성 CCC 0.304로, RECOLA에서 ResNet-GRU 및 AffWildNet 베이스라인을 능가합니다.
주의(attention)가 강화된 CNN–RNN 모델은 설정 전반에 걸쳐 비주목(attention 비포함) 변형보다 CCC를 일관되게 향상시킵니다.
교차-데이터베이스 전이는 Aff-Wild2에서 사전 학습된 모델이 RECOLA에 적응될 때 강력한 성능 향상을 보여주며 제시된 접근 방식의 일반화가 좋음을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.