[论文解读] Curveball Steering: The Right Direction To Steer Isn't Always Linear
Curveball 导向使用多项式核 PCA 在非线性激活流形上引导 LLMs,相较线性导向在激活几何高度弯曲时表现更佳。
Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
研究动机与目标
- Motivate that activation spaces of LLMs exhibit non-Euclidean geometry and challenge the Linear Representation Hypothesis.
- Propose Curveball steering, a nonlinear, geometry-aware steering method based on polynomial kernel PCA.
- Empirically validate Curveball against linear steering across multiple models and behavioral traits.
- Characterize when kernel-based steering outperforms linear methods via geometric analysis of activation manifolds.
提出的方法
- Assess activation-space geometry by measuring geodesic vs Euclidean distances using pullback metrics learned from ensembles of VAEs.
- Define and compute distortion ratio R = d_geo/d_Euc to test linearity of activation spaces.
- Develop Curveball steering that operates in KPCA space with polynomial kernels (degrees 2 or 3) and uses kernel pre-image reconstruction to map back to activation space.
- Project activations into KPCA space to obtain steering vectors via class means; apply steering in kernel space; reconstruct back, preserving the activation residual orthogonal to the learned manifold.
- Treat Curveball steering as a drop-in nonlinear generalization of linear steering, reducing to linear PCA when p = 1 (linear kernel).

实验结果
研究问题
- RQ1Do LLM activation spaces exhibit non-Euclidean geometry that undermines linear steering?
- RQ2Can nonlinear, geometry-aware steering via polynomial KPCA improve control of LLM behavior over linear directions?
- RQ3How does Curveball performance vary with activation manifold curvature and steering strength?
- RQ4Is Curveball steering robust across model families and behavioral concepts?
- RQ5What geometric factors explain when Curveball outperforms linear steering?
主要发现
| Concept | Llama-3.2-1B-It (Linear) | Llama-3.2-1B-It (Curveball) | Phi-3.5-mini-It (Linear) | Phi-3.5-mini-It (Curveball) |
|---|---|---|---|---|
| Self-awareness | 14% | 24% | 0.6% | 25.4% |
| Wealth-seeking | 15% | 28% | 2.3% | 6.7% |
| Power-seeking | 16% | 47% | 2.9% | 14.9% |
| Corrigible | 21% | 17% | 2.1% | 93.4% |
| Humorous | 54.9 | 28.2 | 85 | 75 |
| Rudeness | 85.7 | 26.1 | 61.0 | 100 |
| Excitement | 41.4 | 37.9 | 90.0 | 90.0 |
| Sadness | 15.4 | 19.5 | 85 | 100 |
- Activation spaces show substantial geometric distortion (R > 1) and concept-dependent distortion, challenging the linearity hypothesis.
- Curveball steering consistently outperforms linear steering, especially in high-curvature regimes, and improves steering of several behaviors across models.
- In synthetic curved manifolds, Curveball achieves lower tangent-space deviation and competitive/greater target distances than linear steering, particularly as curvature κ increases.
- On real models (Llama-3.2-1B-Instruct and Phi-3.5-mini-Instruct), Curveball yields larger behavioral shifts for most concepts (e.g., power-seeking, self-awareness, wealth-seeking) and higher trait scores in several cases, with some exceptions.
- Curveball adapts steering magnitude in ambient space and reveals multimodal, locally varying steering directions, indicating geometry-aware adaptation not captured by linear methods.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。