QUICK REVIEW

[论文解读] Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics

Niru Maheswaranathan, Alex H. Williams|arXiv (Cornell University)|Jun 25, 2019

Neural Networks and Applications被引用 41

一句话总结

本论文分析用于情感分析的训练RNN，展示它们收敛到一个低维线性吸引子，其线性化的动力学能够将来自单词的证据整合以驱动跨体系结构的情感预测。

ABSTRACT

Recurrent neural networks (RNNs) are a widely used tool for modeling sequential data, yet they are often treated as inscrutable black boxes. Given a trained recurrent network, we would like to reverse engineer it--to obtain a quantitative, interpretable description of how it solves a particular task. Even for simple tasks, a detailed understanding of how recurrent networks work, or a prescription for how to develop such an understanding, remains elusive. In this work, we use tools from dynamical systems analysis to reverse engineer recurrent networks trained to perform sentiment classification, a foundational natural language processing task. Given a trained network, we find fixed points of the recurrent dynamics and linearize the nonlinear system around these fixed points. Despite their theoretical capacity to implement complex, high-dimensional computations, we find that trained networks converge to highly interpretable, low-dimensional representations. In particular, the topological structure of the fixed points and corresponding linearized dynamics reveal an approximate line attractor within the RNN, which we can use to quantitatively understand how the RNN solves the sentiment analysis task. Finally, we find this mechanism present across RNN architectures (including LSTMs, GRUs, and vanilla RNNs) trained on multiple datasets, suggesting that our findings are not unique to a particular architecture or dataset. Overall, these results demonstrate that surprisingly universal and human interpretable computations can arise across a range of recurrent networks.

研究动机与目标

使用动力系统分析来理解训练好的RNN如何解决文档级情感分析。
识别RNN动力学中的低维结构和稳定点。
评估线性吸引子动力学是否在不同架构和数据集之间具有泛化性。

提出的方法

在 IMDB、Yelp 和 SST 数据集上训练四种 RNN 架构（LSTM、GRU、Update Gate RNN、vanilla RNN）。
通过最小化 q = (1/N) ||h - F(h,0)||^2 并从网络状态分布中采样来识别近似固定点。
在固定点周围对动态线性化，得到 h_t ≈ h* + J_rec (h_{t-1}-h*) + J_inp x_t。
计算 J_rec 的特征值/特征向量以分析慢模态和记忆时间常数。
使用线性化模型预测输入的影响，并与完整非线性动力学进行比较。

实验结果

研究问题

RQ1训练好的 RNN 是否在情感分类中表现出低维动态？
RQ2训练好RNN动力学的固定点是否沿着与读取方向对齐的线性吸引子组织？
RQ3在不同的RNN 架构和情感数据集中是否存在慢整合模态？
RQ4在固定点附近的线性化动力学是否足以近似非线性 RNN 的行为以便进行解释？

主要发现

RNN 在训练后探索一个低维子空间，其中前几个主成分捕捉了大部分方差。
固定点形成一个近似的一维流形，与读取权重对齐。
RNN 展现边际稳定的固定点，具有持续数百到数千个标记的慢模态。
最重要的整合模态与固定点流形对齐，使得线性化输入能够根据词语情感将状态沿线性吸引子推动。
正向和负向词汇在沿线性吸引子上产生相反的移动，而中性词的影响较小。
线性化动力学以小的单步误差近似非线性系统，且该机制在 Yelp、IMDB、SST 上跨越 LSTM、GRU、UGRNN 以及 vanilla RNN 均具有泛化性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。