QUICK REVIEW

[论文解读] Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Ishaq Aden-Ali, Noah Golowich|arXiv (Cornell University)|Feb 4, 2026

Topic Modeling被引用 0

一句话总结

该论文提出 Logit-Linear Selection (LLS)，一种通过对数线性框架提取偏好数据子集的方法，使系统提示-like 行为在 student 模型中跨架构实现潜意识转移。

ABSTRACT

Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.

研究动机与目标

理解数据如何影响下游模型行为（超出可观察的数据点）之动机与机理
引入一个普遍、数学原理性机制（LLS），以从数据集中诱发隐藏效应
证明潜意识转移在不同模型架构和教师-学生对之间仍然成立
展示用 LLS 过滤真实世界的偏好数据可在推理时不使用提示的情况下诱发系统提示式特征

提出的方法

采用对数线性语言模型抽象，其中对数概率在嵌入空间近似线性
定义偏好数据集与 DPO（Direct Preference Optimization）损失，在选定数据上微调模型
提出 Logit-Linear Selection (LLS)，通过评估目标系统提示引起教师模型偏好变化的程度来对每个数据样本打分，然后选择得分最高的 gamma 分位子集
在 LLS 过滤子集上用 DPO 微调学生模型，使其在推理时表现得像被系统提示过
给出理论基础（定理 2.2），在线性表示假设下展示原始对比与系统提示诱发的对数差之间的相关性
在多对模型及多任务上进行经验验证（如目标偏好、语言翻译、类人格行为等）

(a) Depiction of Logit-Linear Selection ( LLS ). The original preference dataset does not contain Spanish. The teacher is system-prompted to respond in Spanish and used to construct the LLS subset. The student fine-tuned on the LLS subset responds in Spanish.

实验结果

研究问题

RQ1一个通用的数据驱动机制是否能在多样的模型架构和任务中产生潜意识效应？
RQ2对数线性是否能够将小 datapoint 的相关性聚合成鲁棒的下游行为？
RQ3能否对真实世界的偏好数据进行过滤，以揭示并转移隐含的系统提示式特征而无需推理时开启显式提示？
RQ4当教师模型与基模型一致时，潜意识转移是否更强，并且是否能跨模型家族进行泛化？

主要发现

LLS 可在不需要推理时系统提示的情况下，将系统提示特征（如语言、人格）潜移默化地转移到学生模型
在微调前后，对数差向量的相关性在各实验中保持为正（在某些设置中相关性约为0.5），支持理论框架
潜意识效应在不同学生架构以及教师–学生组合中持续存在，显示该机制的普适性
一个示例结果表明，偏好数据集的子集即使不包含西班牙语样本，也能让模型说出西班牙语，且该效应可推广到多种语言
对 dataset (tulu2.5) 的实证测量显示行为有可衡量的改变，如动物偏好与翻译方向，迁移强度取决于模型配对
该工作将该机制与形式化的对数线性定理（定理 2.2）联系起来，并提供支持性可视化（如 Fig. 19 投影）与基于语料库的实验

Figure 2 : Mean counts of animal mentions when ${\mathsf{M}}_{\mathsf{T}}={\mathsf{M}}_{\mathsf{S}}$ are both Olmo2-7B-Instruct . For all examples the blue bars are essentially invisible as the base model ${\mathsf{M}}_{\mathsf{S}}$ (before fine-tuning) rarely mentions the animal without the system

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。