QUICK REVIEW

[论文解读] The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo|arXiv (Cornell University)|Feb 1, 2023

Machine Learning in Bioinformatics被引用 16

一句话总结

本文分析在蛋白质和图像领域的大型自监督 Transformer 中，内在维度和邻域结构如何随层演化，展示三阶段（有时四阶段）的ID轨迹，其中中间层编码最丰富的语义内容，可通过ID最小值无监督识别。

ABSTRACT

Large transformers are powerful architectures used for self-supervised data analysis across various data types, including protein sequences, images, and text. In these models, the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next. We characterize the geometric and statistical properties of these representations and how they change as we move through the layers. By analyzing the intrinsic dimension (ID) and neighbor composition, we find that the representations evolve similarly in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets. Based on our findings, we point out an explicit strategy to identify, without supervision, the layers that maximize semantic content: representations at intermediate layers corresponding to a relative minimum of the ID profile are more suitable for downstream learning tasks.

研究动机与目标

理解大型自监督 transformer 的隐藏表示几何特性（内在维度与邻域组成）如何随层演化。
比较蛋白质语言模型和图像 transformer 的几何特征以识别共同模式。
提出一种无监督策略，在下游任务中定位具有最大语义内容的层。

提出的方法

使用 TwoNN 估计器对最近邻距离来估计层表示的内在维度（ID）。
测量邻域重叠 chi_k^{l,m}，以量化层之间局部邻居结构的变化。
计算 chi_k^{l,gt} 以评估与真实语义标签的对齐（例如远程同源性、ImageNet 类别）。
从 ESM-2 蛋白质语言模型和 iGPT 图像 transformer 提取中间层表示进行分析。
在模型规模、任务和数据集（ProteinNet、SCOPe、ImageNet）上分析 ID 和邻域指标。
通过在 ID 曲线中的局部极小值进行无监督识别具有丰富语义内容的层。

The geometry of hidden representations of large transformer models

实验结果

研究问题

RQ1自监督训练的大型 transformer 中表示的内在维度如何在各层间变化？
RQ2ID 和邻域结构在蛋白质语言模型和图像 transformer 中是否呈现一致的阶段？
RQ3是否存在一个无监督度量可以识别最大化下游任务语义内容的层？
RQ4语义信息（蛋白质的远程同源性、图像的类别标签）与 ID 曲线及层位置有何关系？
RQ5在 pLMs 与 iGPTs 之间，ID 动态有何差异，模型大小和任务如何影响它们？

主要发现

在大型 transformer 中，ID 曲线对 pLMs 显示三阶段（峰值、平台期、最终上升），而对 iGPTs 则有第四阶段（第二峰值）。
ID 峰值发生在较早阶段；平台期显示低 ID 和稳定的邻域结构，表明表示具有语义意义。
语义信息（蛋白质的远程同源性；图像的 ImageNet 标签）在平台期最好表达，那里 ID 低，或在 ID 曲线的相对最小值附近。
与真实标签的重叠 chi_k^{l,gt} 在平台期达到峰值（或在 iGPTs 中接近 ID 最小时），表明在编码阶段语义内容集中。
基于 ID 曲线的无监督策略可以识别最适合下游学习任务的层。
在 iGPTs 中，接近末端的第二个、较浅的 ID 峰值镜像前一个，表明解码阶段具有对称的自编码器样行为。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。