QUICK REVIEW

[논문 리뷰] Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Ishaq Aden-Ali, Noah Golowich|arXiv (Cornell University)|2026. 02. 04.

Topic Modeling인용 수 0

한 줄 요약

논문은 Logit-Linear Selection (LLS)이라는 방법으로 로그-선형 프레임워크를 사용해 선호 데이터의 부분집합을 추출하고, 아키텍처에 걸쳐 학생 모델로 시스템 프롬트와 같은 동작을 은근히 전달하는(은근한 전이) 방법을 제시한다.

ABSTRACT

Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

연구 동기 및 목표

데이터가 관찰 가능한 데이터 포인트를 넘어 다운스트림 모델 행동을 형성하는 방식을 이해하려는 동기 부여.
데이터셋에서 숨겨진 효과를 이끌어내는 일반적이고 수학적으로 원칙에 기초한 메커니즘(LLS)을 introduce한다.
다양한 모델 아키텍처와 교사-학생 쌍에서도 은근한 전달이 지속됨을 입증한다.
LLS로 실제 선호 데이터의 필터링이 추론 시 프롬프트 없이 시스템 프롬트와 같은 특성을 유도할 수 있음을 보인다.

제안 방법

언어 모델의 로그-확률이 임베딩 공간에서 대략 선형이라는 로그-선형 추상화를 채택한다.
선호 데이터셋과 DPO(Direct Preference Optimization) 손실을 정의하여 선택된 데이터에 대해 모델을 미세 조정한다.
LLS를 제안하여 각 데이터 예제가 목표 시스템 프롬트가 교사-모델의 선호를 얼마나 바꿀지 점수 매기고, 상위 점수의 b-분의 gamma를 선택한다.
LLS로 필터링된 부분집합에서 DPO를 사용해 학생 모델을 미세 조정하면 추론 시점에 시스템 프롬트가 작동하는 것처럼 행동하는 모델을 얻는다.
선형 표현 가정하에서 원래 로그잇 차이와 시스템 프롬트로 유도된 로그잇 차이 사이의 상관관계를 보이는 이론적 기반(Theorem 2.2)을 제시한다.
여러 모델 페어와 과제(예: 목표 선호도, 언어 번역, 성격 유사 행동)에서 실험적으로 검증한다.

(a) Depiction of Logit-Linear Selection ( LLS ). The original preference dataset does not contain Spanish. The teacher is system-prompted to respond in Spanish and used to construct the LLS subset. The student fine-tuned on the LLS subset responds in Spanish.

실험 결과

연구 질문

RQ1다양한 모델 아키텍처와 과제에서 데이터 기반 일반 메커니즘이 은근한 효과를 생성할 수 있는가?
RQ2로그-선형성이 작은 데이터 포인트 상관관계를 강건한 다운스트림 행동으로 집계하게 하는가?
RQ3실세계 선호 데이터를 필터링해 추론 시점 프롬프트 없이 숨겨진 시스템 프롬트-like 특성을 드러내고 전이할 수 있는가?
RQ4교사 모델과 기본 모델이 일치할 때 은근한 전달이 더 강하고 모델 계열 간 일반화가 되는가?

주요 결과

LLS는 시스템 프롬트 특성(예: 언어, 페르소나)을 학생 모델로 은근한 전달로 유도하되 추론 시점의 시스템 프롬트를 필요로 하지 않는다.
미세 조정 전후의 로짓 차이 벡터 간의 상관관계가 모든 실험에서 양수로 남아 이론적 프레임워크를 뒷받침하며, 특정 설정에서 보고된 상관관계가 약 0.5 수준이다.
은근한 효과는 서로 다른 학생 아키텍처 및 교사–학생 모델 조합에서도 지속되어 이 메커니즘의 보편성을 시사한다.
선호 데이터의 부분집합이 스페인어 예시를 포함하지 않아도 모델이 스페인어로 말하게 만드는 예시적 결과가 있으며 다수 언어로도 일반화된다.
tulu2.5 데이터셋에 대한 경험적 측정은 생물 선호도 및 번역 방향과 같은 행동의 변화가 나타나며, 전이 강도는 모델 페어링에 의존한다.
이 연구는 이 메커니즘을 형식적 로-선형성 정리(Theorem 2.2)와 연결하고 Fig. 19 프로젝션과 말뭉치 기반 실험 같은 보조 시각화를 제공한다.]
table_headers: ["동물","OLMo → OLMo","Qwen → OLMo"]
table_rows:[ ["올빼미","0.537","0.113"], ["개","0.565","0.049"], ["고양이","0.569","0.026"], ["사자","0.539","0.139"], ["호랑이","0.550","0.062"], ["곰","0.531","0.062"], ["늑대","0.543","0.124"], ["여우","0.474","0.106"], ["코끼리","0.562","0.065"], ["기린","0.553","0.084"]]}]} }`````````? Wait ensure proper JSON. Let's reformat correctly. Sorry. ]
table_rows:

Figure 2 : Mean counts of animal mentions when ${\mathsf{M}}_{\mathsf{T}}={\mathsf{M}}_{\mathsf{S}}$ are both Olmo2-7B-Instruct . For all examples the blue bars are essentially invisible as the base model ${\mathsf{M}}_{\mathsf{S}}$ (before fine-tuning) rarely mentions the animal without the system

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.