QUICK REVIEW

[論文レビュー] Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Yake Wei, Di Hu|arXiv (Cornell University)|Aug 20, 2022

Music and Audio Processing被引用数 32

ひとこと要約

認知基盤と3つのタスクカテゴリ—音声視覚ブースティング、クロスモーダル知覚、音声視覚協働—を軸に組織化された音響-視覚学習の総合的なレビューと、場面理解に関するマクロな視点を提案する。

ABSTRACT

Sight and hearing are two senses that play a vital role in human communication and scene understanding. To mimic human perception ability, audio-visual learning, aimed at developing computational approaches to learn from both audio and visual modalities, has been a flourishing field in recent years. A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected. Starting from the analysis of audio-visual cognition foundations, we introduce several key findings that have inspired our computational studies. Then, we systematically review the recent audio-visual learning studies and divide them into three categories: audio-visual boosting, cross-modal perception and audio-visual collaboration. Through our analysis, we discover that, the consistency of audio-visual data across semantic, spatial and temporal support the above studies. To revisit the current development of the audio-visual learning field from a more macro view, we further propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area. Overall, this survey reviews and outlooks the current audio-visual learning field from different aspects. We hope it can provide researchers with a better understanding of this area. A website including constantly-updated survey is released: \url{https://gewu-lab.github.io/audio-visual-learning/}.

研究の動機と目的

人間の知覚基盤と並行することで、音声-視覚学習の研究を動機づける。
最近のAV学習研究を、音声視覚ブースティング、クロスモーダル知覚、AV協働という3つの主要領域に系統的に分類する。
意味、空間、時間の次元を横断する音声-視覚の一貫性を分析し、手法を整理する。
AV場面理解と今後の方向性について、マクロで認知に着想を得た視点を提供する。

提案手法

音声-視覚処理と統合の認知神経科学的基盤をレビューする。
音声-視覚学習タスクを、音声-視覚ブースティング、クロスモーダル知覚、音声-視覚協働の3分類に分類する体系を提案する。
意味的・空間的・時間的一貫性がAV学習タスクをどのように支えるかを分析する。
AVデータセットを調査し、その知見を認知に着想を得た枠組みに結びつける。
将来を見据えた視点を提示し、実現可能な今後の方向性を論じる。

実験結果

リサーチクエスチョン

RQ1AV学習を動機づける音声-視覚知覚の認知基盤とは何か？
RQ2音声-視覚の一貫性に基づいて、AV学習を一貫したカテゴリー（ブースティング、クロスモーダル知覚、協働）にどう組織できるか？
RQ3音声-視覚場面理解の主要データセットと今後の研究方向は何か？

主な発見

この分野は3つのカテゴリー（AVブースティング、クロスモーダル知覚、AV協働）に整理できる。
意味・空間・時間の領域に跨る音声-視覚の一貫性は、多くのAV学習アプローチの基盤となっている。
認知に着想を得た視点は、AV場面理解のマクロな視点を提供し、今後の方向性を導く。
最近の調査と継続的に更新されるオンラインリソースは、AV学習の進展の継続的な総括を支援する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。