QUICK REVIEW

[논문 리뷰] What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Vitaly Feldman, Chiyuan Zhang|arXiv (Cornell University)|2020. 08. 09.

Anomaly Detection Techniques and Applications참고 문헌 32인용 수 95

한 줄 요약

본 논문은 memorization과 training 예시의 영향력을 추정하여 긴 꼬리 memorization 이론을 경험적으로 검증하고, 암기된 사례가 일반화에 크게 기여하며 많은 고-영향 쌍이 단일 학습 예시에서 비롯된다는 것을 보여줍니다. 또한 아키텍처 간 일관성과 memorization이 발생하는 깊이에 대해 분석합니다.

ABSTRACT

Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman (2019) proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given. In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in (Feldman, 2019).

연구 동기 및 목표

희소 데이터 분포에서 memorization이 일반화를 돕는다는 긴 꼬리 이론을 동기화하고 검증한다.
대규모 데이터셋에 대해 실행 가능하도록 memorization과 influence를 위한 효율적 추정기를 개발한다.
memorized 예시의 한계효용을 정량화하고 고영향 학습-테스트 쌍을 식별한다.
아키텍처와 데이터 규칙에 따라 memorization과 influence가 어떻게 달라지는지 평가한다.
네트워크 표현 내부에서 memorization이 주로 어디에 위치하는지 조사한다.

제안 방법

memorization을 h(x_i) = y_i 확률의 훈련 세트에 i번째 예시를 추가했을 때의 변화로 정의한다(식(1)).
메모리화 추정기를 가능하게 하기 위해 임의의 크기 m의 부분 샘플 서브샘 estimator mem_m를 도입한다.
포함 여부에 따라 테스트 정확도에 미치는 영향을 평균적으로 추정하는 서브샘 영향 infl_m을 정의한다.
크기 m의 임의의 부분집합에서 모델을 학습시키고, subset에 i가 들어있거나 빠져 있을 때 조건부로 Pr(h_k(x)=y)을 계산하여 memorization과 influence를 추정한다.
고메모리화와 고영향 쌍을 식별하기 위해 선택 임계값 theta_mem = 0.25 및 theta_infl = 0.15를 제공한다.
ImageNet, CIFAR-100, 및 MNIST에서 ResNet50을 실험하여 memorization과 influence를 추정하고, 한계 효용을 무작위 부분집합 기준과 비교한다.
아키텍처 간 일관성과 memorization이 주로 딥 표현에 위치하고 마지막 층이 아니라는 주장의 타당성을 검토한다.

실험 결과

연구 질문

RQ1Fel19가 제시한 바와 같이 긴 꼬리 데이터 분포에서 memorization이 일반화를 의미 있게 기여하는가?
RQ2대규모 데이터셋에 대해 효율적인 서브샘플링으로 정확한 memorization과 influence 추정이 가능하는가?
RQ3 memorized 예시가 일반적으로 선택된 임의 예시보다 테스트 정확도 향상에 더 큰 한계효용을 가지는가?
RQ4고영향 학습-테스트 쌍이 단일 학습 예시 주위에 집중되어 있으며 시각적으로 해석 가능한가?
RQ5아키텍처 전반에서 memorization이 주로 네트워크의 어디에 위치하는가(마지막 층 대 표현)?

주요 결과

CIFAR-100 및 ImageNet에서 상당한 비율의 memorized 예시가 존재하며, 일부 memorization 추정값이 ≥ 0.3에 이르고 제거 시 주목할 만한 한계효용이 관찰된다.
memorized 예시는 동일한 크기의 임의 부분집합보다 더 높은 한계효용을 가지므로 memorization이 단순히 샘플 수를 줄이는 것以上의 기여를 한다.
고영향 학습-테스트 쌍이 상당히 존재하며(예: ImageNet에서 1641쌍), 많은 테스트 예시가 단일 학습 예시의 영향을 받는다(1298개 테스트 예시).
고영향 쌍은 시각적으로 해석 가능한 경우가 많으며 의미 있는 유사성 또는 거의 중복에 반영되어 긴 꼬리의 하위집단을 강조한다.
memorized 예시를 제거하면 테스트 정확도가 감소하는데, CIFAR-100의 경우 memorized 세트를 제거했을 때의 정확도 하락은 전체보다 고영향 부분에서 더 큰 기여도(2.38%)를 보인다.
대부분의 memorization은 최종 층이 아닌 딥 표현에서 발생하는 것으로 보이며, 고정 표현에 대해서 선형 분류기만 학습하는 경우의 효과가 제한적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.