QUICK REVIEW

[论文解读] ByT5: Towards a token-free future with pre-trained byte-to-byte models

Linting Xue, Aditya Barua|arXiv (Cornell University)|May 28, 2021

Natural Language Processing Techniques被引用 69

一句话总结

ByT5 表明标准 Transformer 可以直接处理 UTF-8 字节，催生出无 token 的预训练模型，在多任务中与基于 token 的基线竞争，并在噪声鲁棒性方面有所提升。

ABSTRACT

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

研究动机与目标

激励并评估在原始字节上运行而非使用 token 词汇表的无 token 的 NLP 模型。
以尽量少的改动使 Transformer 架构适应字节序列处理。
评估 ByT5 与基于 token 的基线在参数量、FLOPs 和推理速度方面的权衡，覆盖多语言任务。
展示字节级建模对输入噪声和拼写变体的鲁棒性。
发布预训练的 ByT5 模型及附带的代码/数据。

提出的方法

将 token 词汇表替换为一个 256 字节的嵌入和特殊 token；直接将 UTF-8 字节输入 Transformer。
使用字节范围掩蔽的自监督训练目标，平均掩蔽字节跨度长度为 20 字节（为哨兵复用最后 100 个字节 ID）。
让编码器比解码器更深（编码器深度为解码器的 3 倍），以补偿缺少词汇嵌入矩阵。
训练五种模型尺寸（Small、Base、Large、XL、XXL），序列长度设为 1024 字节，训练 100 万步，批次为 2^20 个 token。
将 ByT5 架构在参数量上与 mT5 对齐，同时调整 d_model 和 d_ff 以维持约 2.5 倍的 d_ff/d_model 比例。
在英语和多语言基准（包括 GLUE、SuperGLUE、XSum、TweetQA、DROP、Dakshina、Sigmorphon 与 xtreme 任务）上进行评估；并与 mT5 进行比较。

实验结果

研究问题

RQ1能否在尽量少的架构改动下，将标准 Transformer 有效适应到字节级输入？
RQ2从 token 基于输入切换到字节基输入时，在参数数量、FLOPs 和推理成本上的权衡是什么？
RQ3与 mT5 相比，ByT5 在英语和多语言的分类、生成和词级任务上的表现如何？
RQ4 ByT5 在各语言中的噪声和拼写变体鲁棒性是否优于基于 token 的模型？
RQ5 编码器/解码器深度平衡如何影响无 token 的 Transformer 的性能？

主要发现

ByT5 在英语和多语言基准上与 mT5 竞争，且在较小模型尺寸时甚至优于 mT5。
字节级 ByT5 产生强劲的生成性能，在 XSum、TweetQA 和 DROP 上的多种尺寸经常超越 mT5。
无 token 的 ByT5 显著减少与词汇相关的参数，并将其重新分配到 Transformer 层，实现编码器/解码器深度比为 3:1 的配置与密集参数使用。
ByT5 在各任务和语言中的抗噪性和对混乱文本的鲁棒性更强，在各种输入污染下的降级较小，相较于 mT5。
在极端跨语言任务中，ByT5 总体具有竞争力；当所有目标语言的数据都可用时，它在同语种基准上超过 mT5，并根据模型大小展现出强的零样本和翻译-训练性能模式。
Ablation studies show a heavy encoder best benefits ByT5, longer byte-span masking (mean 20 vs 3 or 40) improves certain tasks, and a 256-byte vocabulary shifts most parameters away from a vocabulary matrix into dense layers.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。