QUICK REVIEW

[논문 리뷰] Large-Scale Long-Tailed Recognition in an Open World

Ziwei Liu, Zhongqi Miao|arXiv (Cornell University)|2019. 04. 10.

Domain Adaptation and Few-Shot Learning참고 문헌 64인용 수 65

한 줄 요약

OLTR( Open Long-Tailed Recognition )를 제안하고, 동적 메타 임베딩과 모듈화된 어텐션을 결합하여 헤드/테일 공유, 소샷 일반화 및 오픈 셋 신규성 문제를 하나의 unified 프레임워크에서 다루며 대규모 오픈-롱테일 벤치마크에서 검증합니다.

ABSTRACT

Real world data often have a long-tailed and open-ended distribution. A practical recognition system must classify among majority and minority classes, generalize from a few known instances, and acknowledge novelty upon a never seen instance. We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes. OLTR must handle imbalanced classification, few-shot learning, and open-set recognition in one integrated algorithm, whereas existing classification approaches focus only on one aspect and deliver poorly over the entire class spectrum. The key challenges are how to share visual knowledge between head and tail classes and how to reduce confusion between tail and open classes. We develop an integrated OLTR algorithm that maps an image to a feature space such that visual concepts can easily relate to each other based on a learned metric that respects the closed-world classification while acknowledging the novelty of the open world. Our so-called dynamic meta-embedding combines a direct image feature and an associated memory feature, with the feature norm indicating the familiarity to known classes. On three large-scale OLTR datasets we curate from object-centric ImageNet, scene-centric Places, and face-centric MS1M data, our method consistently outperforms the state-of-the-art. Our code, datasets, and models enable future OLTR research and are publicly available at https://liuziwei7.github.io/projects/LongTail.html.

연구 동기 및 목표

헤드, 테일, 오픈 클래스를 포함하는 긴 꼬리 분포와 오픈 세트 신규성을 결합하고 균형 잡힌 평가를 수행하는 OLTR 태스크 정의.
동적 메타 임베딩을 통해 헤드와 테일 간 지식을 공유하고, 보정된 임베딩 노름으로 테일과 오픈 클래스를 구분하는 통합 모델 개발.
대규모 OLTR 벤치마크(ImageNet-LT, Places-LT, MS1M-LT)를 큐레이션하고 최첨단 기준선보다 우수한 성능 시연.
메모리 기반 전달, 컨셉 셀렉터, 도달성 보정, 모듈화된 어텐션 등 대규모 데이터세트에 확장 가능한 엔드 투 엔드 학습 가능한 구성요소 제공

제안 방법

동적 메타 임베딩은 직접 이미지 특징과 시각 기억 M에서 학습된 판별 중심점을 나타내는 메모리 특징을 결합합니다.
메모리 특징 v^{memory}는 v^{memory}=o^{T}M로 형성되며 여기서 o=T_hal(v^{direct})이고 M은 클래스 중심점을 포함합니다; 컨셉 셀렉터 e=T_sel(v^{direct})가 메모리의 기여를 조절합니다.
메타 임베딩은 v^{meta} = (1/γ) * (v^{direct} + e ⊗ v^{memory})로 정의되며 γ는 메모리 중심점까지의 도달성(최소 거리)입니다.
도달성 보정은 γ를 사용하여 오픈 세트 인스턴스(메모리로부터의 거리가 큰 경우)와 알려진 클래스(거리가 작은 경우)를 구분합니다.
모듈화된 어텐션 MA은 자기 어텐션 맵에 조건부 공간 주의력을 적용하여 맥락 선택을 통해 헤드-테일 구분을 강화합니다: f^{att} = f + MA(f) ⊗ SA(f).
코사인 분류자는 정규화된 메타 임베딩과 가중치를 사용하고 v^{meta}에 대한 비선형 스퀴싱을 포함해 스케일링을 안정화합니다.
손실은 교차 엔트로피에 메모리 중심점에 대한 큰 마진 항을 결합합니다: L = Σ L_CE(v^{meta}, y) + λ L_LM(v^{meta}, {c_i}).

실험 결과

연구 질문

RQ1헤드, 테일, 오픈 클래스를 한 번에 통합 프레임워크에서 인식하는 방법은 무엇인가?
RQ2헤드와 테일 간 시각 지식을 공유하는 것이 테일의 강인성을 향상시키면서 헤드 정확도에 해를 끼치지 않는가?
RQ3오픈 세트 신규성을 학습된 특징 공간 내에서 탐지하고 보정할 수 있는가, classifier 출력이 아니라도?
RQ4제안된 구성요소들(동적 메타 임베딩, 메모리, 도달성 보정, 모듈화된 어텐션)이 대규모 실제 롱테일 데이터(이미지, 장면, 얼굴) 전 영역으로 일반화되는가?

주요 결과

Backbone	Method	Many-shot	Medium-shot	Few-shot	Overall	Open-Set F-measure
ResNet-10	Plain Model [20]	>100	5-50	<20	20.9	0.295
ResNet-10	Lifted Loss [37]	>100	30.4	17.9	30.8	0.374
ResNet-10	Focal Loss [29]	>100	29.9	16	30.5	0.371
ResNet-10	Range Loss [64]	>100	30.3	17.6	30.7	0.373
ResNet-10	+ OpenMax [3]	>100	-	-	-	0.368
ResNet-10	FSLwF [15]	>100	22.1	15	28.4	0.347
ResNet-10	Ours	>100	35.1	18.5	35.6	0.474
ResNet-152	Plain Model [20]	>100	22.4	0.36	27.2	0.366
ResNet-152	Lifted Loss [37]	>100	35.4	24	35.2	0.459
ResNet-152	Focal Loss [29]	>100	34.8	22.4	34.6	0.453
ResNet-152	+ OpenMax [3]	>100	-	-	-	0.458
ResNet-152	FSLwF [15]	>100	29.9	29.5	34.9	0.375
ResNet-152	Ours	>100	37.0	25.3	35.9	0.464

동적 메타 임베딩을 활용한 OLTR은 대규모 오픈-롱테일 벤치마크(ImageNet-LT, Places-LT, MS1M-LT)에서 일관되게 최첨단 기준선보다 향상된 성능을 보인다.
메모리 특징과 컨셉 셀렉터는 특히 중간 샷 및 소수 샷 구간에서 테일 클래스 성능을 크게 향상시킨다.
도달성 보정은 오픈 세트 구분력을 강화하며 특히 소샷 및 원샷 정체성에서 유리하고 다수 샷에서도 성능을 보존한다.
모듈화된 어텐션은 클래스 간 서로 다른 공간 맥락을 유도해 헤드와 테일 간의 구별을 개선한다.
MegaFace 및 SUN-LT에 대한 실험은 얼굴 및 장면 데이터셋에 대한 강력한 일반화와 낮은 샷 및 제로 샷 정체성에서의 유의미한 이득을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.