QUICK REVIEW

[论文解读] Compressing Word Embeddings via Deep Compositional Code Learning

Raphael Shu, Hideki Nakayama|arXiv (Cornell University)|Nov 3, 2017

Topic Modeling参考文献 31被引用 29

一句话总结

该论文提出深度组合编码学习，通过离散哈希码将每个词表示为若干学习到的基向量的组合，以压缩词嵌入。利用可微的Gumbel-softmax训练方法，该方法在情感分析中实现高达98%的压缩率，在机器翻译中实现94–99%的压缩率，且无性能损失，同时实现语言无关性与架构无关性的模型压缩。

ABSTRACT

Natural language processing (NLP) models often require a massive number of parameters for word embeddings, resulting in a large storage or memory footprint. Deploying neural NLP models to mobile devices requires compressing the word embeddings without any significant sacrifices in performance. For this purpose, we propose to construct the embeddings with few basis vectors. For each word, the composition of basis vectors is determined by a hash code. To maximize the compression rate, we adopt the multi-codebook quantization approach instead of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, 8), where the value of each component is limited to a fixed range. We propose to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick. Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss. In both tasks, the proposed method can improve the model performance by slightly lowering the compression rate. Compared to other approaches such as character-level segmentation, the proposed method is language-independent and does not require modifications to the network architecture.

研究动机与目标

通过压缩词嵌入来减小神经网络NLP模型的内存和存储占用，同时不牺牲性能。
解决标准词嵌入中的冗余问题，即语义相似的词由独立向量表示。
开发一种语言无关的方法，且无需对现有模型进行架构修改。
通过Gumbel-softmax技巧的可微松弛，实现离散哈希码的端到端训练。
通过多码本量化和相似词间码共享，最大化压缩效率并保持语义保真度。

提出的方法

将每个词表示为码 $ C_w = (C_w^1, C_w^2, ..., C_w^M) $，其中每个分量从码本 $ E_i $ 中选择一个码字。
将最终嵌入表示为求和形式：$ E(C_w) = \sum_{i=1}^M E_i(C_w^i) $，使用 $ M \times K $ 个基向量，而非 $ |V| $ 个独立向量。
采用多码本量化，使用离散整数码（如 $ (3,2,1,8) $）以实现优于二进制编码的压缩率。
应用Gumbel-softmax技巧，使训练过程中可通过离散码进行端到端反向传播。
通过最小化重建损失来优化码和码本参数：$ \frac{1}{|V|} \sum_w || \sum_i E_i(C_w^i) - \tilde{E}(w) ||^2 $。
直接从预训练嵌入（如GloVe）学习码，以在大幅减少参数量的同时保持语义质量。

实验结果

研究问题

RQ1词嵌入能否在不降低性能的前提下实现95%或更高的参数量压缩？
RQ2离散可学习码能否有效捕捉如'dog'与'dogs'之间的语义相似性？
RQ3所提方法在情感分析和机器翻译等多样化NLP任务中是否能保持性能？
RQ4码的利用率如何——是否所有码字都被有意义地分配，还是存在浪费？
RQ5该方法能否在无需架构修改的前提下，普遍适用于不同语言和模型？

主要发现

在IMDB情感分析任务中，该方法实现了98%的压缩率，且无性能下降，$32\times16$码的BLEU分数保持在29.04。
在机器翻译中，实现了94–99%的压缩率，性能损失极小，例如在De→En任务中98%压缩率下BLEU分数为29.04，与基线的29.45相比几乎无差异。
即使在极高压缩率下，模型仍可通过适度降低压缩率略微提升性能，表明存在可调优的性能-压缩率权衡空间。
定性分析显示，语义相似的词（如'dog'、'dogs'、'cat'）被分配到汉明空间中相近的码。
码的利用率高效：即使使用频率最低的码字也分配给了超过1000个词，表明无显著码字浪费。
该方法具有语言无关性，无需修改网络架构，可广泛部署于移动设备和低资源设备。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。