QUICK REVIEW

[논문 리뷰] Scaling Laws Do Not Scale

Fernando Díaz, Michael Madaio|arXiv (Cornell University)|2023. 07. 05.

Opinion Dynamics and Social Influence인용 수 12

한 줄 요약

본 논문은 모델 성능을 데이터 규모나 파라미터 규모와 연결하는 스케일링 법칙이 다양한 인간 집단을 평가할 때 취약하다고 주장한다. 이는 메트릭의 취약성, 부분집단의 분기, 사회기술적 역학 때문이라고 한다. 또한 인구 구성의 변화를 포착하기 위해 스케일링 분석에 평가 데이터 크기를 제3의 축으로 보강하자고 제안하며, 더 큰 데이터 세트가 모든 커뮤니티의 성능을 향상시키지 않을 수 있음을 경고한다.

ABSTRACT

Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as "scaling laws"). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations -- threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger -- at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.

연구 동기 및 목표

데이터세트가 커질 때 스케일링 법칙이 다양한 커뮤니티에서 성능을 신뢰할 수 있게 예측하는지에 대한 의문.
평가 지표가 잠재 구성요소의 대리척도이며 인구 간에 다투어지거나 불안정할 수 있음을 강조.
평가 세트 크기를 늘리면 구성이 바뀌어 서로 다른 지표 선호도를 가진 부분집단이 도입됨을 주장.
동적 인구 구성을 반영하기 위해 스케일링 법칙에 평가 데이터 크기 축을 도입하자고 제안.

제안 방법

구성요소와 대리변수(μ*, μ)를 정의하기 위해 평가 지표 이론과 측정 모델링에 대한 고찰.
스케일링 법칙이 학습 데이터 크기를 사용해 대리변수 μ(U, π(D))를 통해 성능 μ를 추정하는 방식을 분석.
스케일링 법칙 맥락에서 지표 불일치, 비정상성, 스테이징, 하위 작업, 지표의 힘 등에 대한 논의.
더 큰 평가 데이터 세트가 부분집단 다양성을 증가시켜 보편 지표 타당성을 깨뜨릴 수 있음을 주장.
인구 구성 변화를 포착하기 위해 스케일링 법칙 분석에 평가 데이터 크기를 세 번째 축으로 추가하자는 제안.

실험 결과

연구 질문

RQ1평가 지표가 다양한 인구에 걸쳐 잠재 성능 구성요소를 충실하게 반영하는가?
RQ2평가 데이터 세트 크기의 증가가 부분집단 구성과 스케일링 법칙의 타당성에 어떻게 영향을 미치는가?
RQ3대형 AI 시스템의 모든 커뮤니티에 대해 하나의 보편적 지표가 모델 품질을 충분히 포착할 수 있는가?
RQ4시간에 따른 사회기술적 변화에 대응하기 위해 스케일링 법칙 분석에 평가 데이터 크기 축을 포함해야 하는가?

주요 결과

평가 지표는 잠재 구성요소 μ*의 불안정한 대리척도이며 부분집단 간에 일치하지 않을 수 있다.
평가 데이터 크기가 커질수록 대표되는 부분집단의 수가 증가하는 경향이 있어 지표 해석이 복잡해진다.
다른 커뮤니티는 ‘좋은’ 성능에 대한 상충되는 개념을 가질 수 있어 지표 발산과 사용자 가치 결과와의 불일치를 초래한다.
지표는 비정상성, 작업 간의 스테이징에 영향을 받고 사회기술적 맥락에 의해 강하게 형성되어 보편 스케일링 법칙을 약화시킨다.
대규모 학습 데이터 세트가 전 세계적으로 다양한 사용자 기반에 배치될 때 보편적인 성능 향상을 가져오지 못할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.