QUICK REVIEW

[논문 리뷰] Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition

Haojun Xu, Yan Gao|arXiv (Cornell University)|2023. 05. 21.

Human Pose and Action Recognition인용 수 18

한 줄 요약

LA-GCN은 대형 언어 모델의 지식을 활용하여 GCN을 Skeleton 기반 행동 인식을 안내하는 글로벌 및 카테고리 priors 그래프를 구축하고, NTU RGB+D, NTU RGB+D 120, NW-UCLA에서 최첨단 정확도를 달성한다.

ABSTRACT

How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new "bone" representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision-forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

연구 동기 및 목표

인간 인지에서 영감을 받은 사전 지식으로 골격 기반 행동 인식을 동기부여한다.
대규모 언어 모델(LLM) 지식을 통합하여 골격 그래프의 글로벌 및 카테고리 priors를 구축한다.
GCN에서 다중 호 관심(attention)을 통한 토폴로지 학습 및 정보 전달을 개선한다.
학습을 규제하기 위한 보조 감독 PC-AC를 제안하여 클래스 특정 의미를 활용한 학습 규제화를 도모한다.
NTU RGB+D, NTU RGB+D 120, NW-UCLA에서 최첨단 성능을 입증한다.

제안 방법

LLM에서 도출된 관절 및 행동 클래스의 텍스트 특징으로부터 Global Prior Relation(GPR) 그래프를 구성한다.
관절 특징을 뼈 모양의 표현으로 변환하고 GPR 거리로 가중치를 두어 Priori Skeleton Modal 표현을 만든다.
단일 층에서 다중 홉 이웃으로부터 정보를 집계하는 Multi-Hop Attention Graph Convolution(MHA-GC)을 도입한다.
LLM 특징으로부터 카테고리 priors 토폴로지(T-C)를 형식화하고 이를 다중 작업 학습용 보조 PC-AC 모듈에서 사용한다.
주 분류기와 클래스 토폴로지 그래프에 의해 감독되는 보조 분기로 함께 학습하며, 추론 시 보조 분기는 제거된다.

Figure 1: Schematic of LA-GCN concept. The top half of this figure shows two brain activity processes when humans perform action recognition. The bottom half shows the proposed multi-task learning process. The knowledge of the language model is divided into global information and category informatio

실험 결과

연구 질문

RQ1LLM에서 파생된 사전 지식이 골격 기반 행동 인식의 토폴로지 학습을 개선할 수 있는가?
RQ2글로벌 및 카테고리 priors가 보다 구분 가능한 골격 표현 및 특징을 안내할 수 있는가?
RQ3다중 호 attention이 이 작업에서 GCN의 정보 전달 및 수렴을 향상시키는가?
RQ4유사하거나 과정적 행동의 인식을 보조 PC-AC 감독이 향상시키는가?

주요 결과

데이터셋	지표	메서드/변형(예시)	Top-1 / 정확도
NTU RGB+D 60	X-Sub	LA-GCN	93.5%
NTU RGB+D 60	X-View	LA-GCN	97.2%
NTU RGB+D 120	X-Sub	LA-GCN (Joint)	86.5%
NTU RGB+D 120	X-Sub	LA-GCN (Joint+Bone)	89.7%
NTU RGB+D 120	X-Sub	LA-GCN (4 ensemble)	89.9%
NTU RGB+D 120	X-Sub	LA-GCN (6 ensemble)	90.7%
NW-UCLA	Top-1	LA-GCN	97.6%

LA-GCN은 NTU RGB+D 60에서 93.5% (X-sub) 및 97.2% (X-view)로 다수의 베이스라인을 능가한다.
NTU RGB+D 120에서 LA-GCN 변형은 앙상블로 최대 90.7% (X-sub) 및 91.8% (X-setup)에 도달한다.
NW-UCLA에서 LA-GCN은 Top-1 정확도 97.6%를 달성하며 기존 방법을 능가한다.
Joint+Bone/앙상블 방식의 네 가지에서 NTU RGB+D 120의 결과가 향상된다(예: 4 ensemble: 89.9%/91.3%).
PC-AC 보조 감독은 유사한 행동(예: '읽기'와 '쓰기')의 인식을 약 8–9 포인트 향상시킨다.

Figure 2: Extraction of text features. Subfigure (a) is Bert’s architecture. (b) Our method uses the learned text encoder to extract text features by embedding the names of classes [C] and the names of all joints [J] of the target dataset.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.