QUICK REVIEW

[论文解读] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Jiahao Yuan, Yike Xu|arXiv (Cornell University)|Feb 11, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

本论文研究不同注意力掩蔽策略（因果、混合、双向）在统一对比学习框架下对解码器式大语言模型学习的用户表示的影响，并提出梯度引导软掩蔽以优化从因果到双向注意力的转变。

ABSTRACT

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

研究动机与目标

研究因果、混合与双向注意力掩蔽在现实数据中学习用户表示时对解码器式大语言模型的影响。
在统一对比学习框架内评估不同掩蔽策略对训练稳定性和嵌入质量的影响。
提出梯度引导软掩蔽（GG-SM）以稳定因果到双向的转变并提升双向表示。
在9个工业用户认知基准（使用支付宝数据）上展示GG-SM的有效性。

提出的方法

面向解码器式大语言模型的三掩蔽制（因果、混合、双向）的统一对比学习框架。
梯度引导软掩蔽（GG-SM）：一种基于梯度的预热，在线性双向调度器之前的预热阶段为未来注意力权重提供信息。
用于嵌入的两源训练数据：(i) 基于规则的行为轨迹，将历史序列与未来行为关联；(ii) 与硬正样本挖掘与标定的LLM合成问答对齐。
模态特定编码器将异构的用户信号转换为LLM嵌入空间，配备轻量级适配器；同一解码器式LLM处理用户视图与答案，进行双塔对比学习。
基于InfoNCE的对比目标，带批内负样本，并通过基于掩蔽的相似性机制降低假负样本。

Figure 1 : Architechure Overview of Our Find-Embedding (w / GGSM).

实验结果

研究问题

RQ1因果、混合与双向注意力掩蔽如何影响解码器式LLM学习到的用户嵌入质量？
RQ2在统一训练设置中，从因果向双向注意力的转变是否会影响训练稳定性和表示质量？
RQ3相较于仅使用调度转变，梯度引导的软掩蔽预热是否能改善优化动力学并提升最终的双向表示？
RQ4GG-SM强化的嵌入在工业领域的9项多样化真实世界用户认知任务中相较于其他基线的表现如何？

主要发现

Concert	User	MAU	Transit	Power	Food	Movie	Achiev.	Physical	Avg
Oracle	0.5173	0.7219	0.9202	0.5642	0.7638	0.6561	0.6435	0.5415	0.5592	0.6542
w/ Causal	0.5716	0.8313	0.9669	0.6967	0.9678	0.8473	0.7922	0.6054	0.6589	0.7709
w/ Hybrid	0.5748	0.8311	0.9671	0.6951	0.9653	0.8520	0.7913	0.6056	0.6565	0.7710
w/ Hybrid_gq	0.5647	0.8382	0.9665	0.6945	0.9678	0.8528	0.7887	0.6044	0.6582	0.7706
w/ Hybrid_mlp	0.5750	0.8410	0.9667	0.6965	0.9649	0.8484	0.7886	0.6042	0.6608	0.7718
w/ Bidirectional	0.5707	0.8390	0.9673	0.6983	0.9671	0.8505	0.7906	0.6043	0.6607	0.7721
w/ Scheduler	0.5742	0.8419	0.9664	0.6973	0.9688	0.8540	0.7908	0.6056	0.6605	0.7733
w/ GG-SM (Ours)	0.5767	0.8438	0.9674	0.6978	0.9689	0.8554	0.7913	0.6078	0.6615	0.7745

在统一框架中，双向掩蔽在表示质量方面表现最佳。
从因果到双向掩蔽的转变路径对优化稳定性和嵌入质量具有关键影响。
与因果、混合及仅调度的基线相比，GG-SM在训练稳定性和最终双向嵌入方面具有一致性提升。
GG-SM在平均AUC上超过若干通用嵌入，并在9项任务中优于其他用户嵌入基线。

Figure 2 : Average AUC performance across 9 downstream tasks under different attention masking strategies (left) and comparison with general embedding, user embedding (right).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。