QUICK REVIEW

[논문 리뷰] Configuration-to-Performance Scaling Law with Neural Ansatz

Huaqing Zhang, Kaiyue Wen|arXiv (Cornell University)|2026. 02. 10.

Machine Learning in Materials Science인용 수 0

한 줄 요약

NCPL은 미세 조정된 언어 모델을 사용하여 전체 사전학습 구성(configurations)을 학습 결과로 매핑하고, 최종 손실 및 손실 곡선 예측을 정확하게 수행하며 제약 조건 하에서 하이퍼파라미터를 공동으로 조정하는 데 도움을 준다.

ABSTRACT

Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a extit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the extit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a extit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.

연구 동기 및 목표

다양한 하이퍼파라미터 하에서 사전 학습된 모델의 성능을 모든 매개변수 튜닝 없이 예측해야 할 필요성을 동기 부여한다.
구성을 성능으로 매핑하는 신경 기반(LLM 기반) Configuration-to-Performance Scaling Law(CPL) 제안.
NCPL이 최종 손실과 손실 곡선을 예측하고 공동 하이퍼파라미터 최적화를 가능하게 함을 시연.
학습 집합과 다르게(out-of-distribution) 모델 크기에 일반화 및 더 큰 컴퓨트로의 외삽 시연.
CPL에 대한 오픈소스 로그 및 파운데이션 모델 사용의 이점과 한계 강조.

제안 방법

사전 학습된 언어 모델을 회귀기로 미세조정하여 f_theta 매개변수화하고 전체 학습 구성 C를 성능 P로 매핑.
소스, 아키텍처(N, 레이어, 헤드, 은닉 차원), 데이터 크기 D, 옵티마이저 및 하이퍼파라미터를 포함한 입력 특징 사용.
Chinchilla-law 기준선 ell_chinchilla(N,D)에 대한 잔차를 예측하고 잔차 타깃에 대해 MSE로 학습.
숫자 필드 인코더와 헤드를 업데이트하는 1단계 파인튜닝 스킴(Stage 1)과 모든 파라미터를 미세조정하는 2단계(Stage 2).
타깃을 (i) 최종 사전학습 손실 및 (ii) 손실 곡선을 재구성하기 위한 중간 손실로 예측.
Marin 및 StepLaw 데이터셋을 사용한 ID(인디스트리뷰션) 및 OOD(아웃-디스트리뷰션) 분할에서 평가.
NCPL을 XGBoost 및 Chinchilla-law 베이스라인과 비교하고 백본 크기, 파인튜닝 대 스크래치에 대한 애블레이션 수행.
구성을 스윕하면서 하이퍼파라미터 선택을 시연하고 파워-법칙(base-line)과 비교.

Figure 1 : An Overview of NCPL’s Performance Across Tasks. We split the collected pretraining logs by the model size. In-distribution (ID) means the model size is within the range of the model size in the training set used for NCPL and out-of-distribution (OOD) means the model size is larger. Left:

실험 결과

연구 질문

RQ1신경망, 특히 미세조정된 언어 모델이 전체 학습 구성에서 사전학습 성능으로의 매핑(C → P)을 학습할 수 있는가?
RQ2NCPL은 최종 손실과 손실 곡선을 예측할 때 구성을 무시하는 스케일 법칙(예: Chinchilla)보다 예측 정확도를 향상시키는가?
RQ3NCPL은 공동 하이퍼파라미터 튜닝을 가능하게 하고 ID 및 OOD 설정에서 수작업으로 설계된 하이퍼파라미터 스케일링 베이스라인보다 우수한가?
RQ4학습 세트보다 더 큰 컴퓨트로 일반화하는 정도는 얼마나 되며 손실 곡선과 같은 더 풍부한 타깃으로의 외삽은?
RQ5오픈소스 로그에서 하이퍼파라미터 간의 상호작용(예: 옵티마이저와 가중치 감소)을 NCPL이 어떤 방식으로 발견할 수 있는가?

주요 결과

NCPL은 최종 손실 예측에서 ID 및 OOD 데이터에 대해 Chinchilla 베이스라인보다 더 낮은 예측 오차와 더 높은 순위 상관을 달성한다.
NCPL은 하이퍼파라미터의 공동 튜닝을 가능하게 하며 전용 하이퍼파라미터 스케일링 법칙에 버금가는 성능을 얻는다.
NCPL은 최종 손실뿐만 아니라 손실 곡선 전체를 예측할 수 있으며, 최적화 알고리즘과 하이퍼파라미터 설정 전반에 걸쳐 작동한다.
NCPL은 옵티마이저별 가중치 감소 효과와 같은 비선형 상호작용을 질적으로 학습한다.
펀더멘테이션 모델 기반 회귀기를 미세튜닝하는 것이 무작위 초기화나 비신경 기반 베이스라인보다 다양한 이질적 구성을 다루는 데 더 강력한 결과를 낳는다.
NCPL은 학습 세트의 10배에 달하는 더 많은 컴퓨트로 이루어진 OOD 실행으로 일반화한다.

Figure 3 : Predicted loss vs. ground-truth loss. Each point visualizes the predicted vs. ground-truth final pretraining loss of an individual run from the Marin dataset (for StepLaw dataset, see Figure ˜ 1 left). NCPL uses the full training configuration as input, whereas the Chinchilla law only dep

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.