Skip to main content
QUICK REVIEW

[Paper Review] Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations

Ting Chen, Martin Renqiang Min|arXiv (Cornell University)|Jun 21, 2018
Advanced Graph Neural Networks40 citations
TL;DR

KD encoding replaces standard one-hot embeddings with K-way D-dimensional discrete codes and a code-composition network, enabling end-to-end learning that greatly reduces embedding parameters while maintaining or improving task performance.

ABSTRACT

Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying a linear transformation based on a "one-hot" encoding of the discrete symbols. Despite its simplicity, such approach yields the number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work, we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the "one-hot" encoding. In the proposed "KD encoding", each symbol is represented by a $D$-dimensional code with a cardinality of $K$, and the final symbol embedding vector is generated by composing the code embedding vectors. To end-to-end learn semantically meaningful codes, we derive a relaxed discrete optimization approach based on stochastic gradient descent, which can be generally applied to any differentiable computational graph with an embedding layer. In our experiments with various applications from natural language processing to graph convolutional networks, the total size of the embedding layer can be reduced up to 98\% while achieving similar or better performance.

Motivation & Objective

  • Motivate compact embedding representations to reduce parameter count and overfitting in large vocabularies.
  • Propose a KD encoding scheme that represents each symbol by a D-dimensional code with alphabet size K.
  • Develop an end-to-end learning framework that optimizes discrete codes and the code-composition embedding function.
  • Provide theoretical and empirical analysis of parameter savings and performance across NLP and graph convolution tasks.

Proposed method

  • Represent each symbol with a K-way D-dimensional code c_i = (c_i^1, ..., c_i^D) where each c_i^j ∈ {1,...,K}.
  • Use a code allocation function φ to map symbols to codes and a differentiable code-composition function f to generate embeddings from the codes.
  • Embed each code dimension with a dedicated code-embedding matrix W^j ∈ R^{K×d'}, and compose the final symbol embedding via a (potentially linear or nonlinear) transformation f_e of the code-embedding vectors.
  • Provide a continuous relaxation of discrete codes via tempered Softmax to enable SGD-based learning; employ a straight-through estimator during inference to recover discrete codes.
  • Introduce entropy-based regularization and guidance mechanisms (online distillation guidance, pre-trained distillation guidance) to stabilize end-to-end learning of discrete codes.
  • Relate linear KD-code composition to a sparse binary low-rank factorization of the embedding matrix, and show that nonlinear composition increases expressiveness.

Experimental results

Research questions

  • RQ1Can a K-way D-dimensional discrete coding scheme learn semantically meaningful symbol embeddings end-to-end?
  • RQ2How much embedding parameter count and overall model size can be reduced using KD encoding without sacrificing performance?
  • RQ3What are effective strategies (e.g., continuous relaxations and guidance) to train discrete codes in neural networks?
  • RQ4How does KD encoding compare to low-rank embedding factorization and other baselines across NLP and graph tasks?

Key findings

  • KD encoding can reduce embedding layer size by up to 95-98% across tasks while achieving comparable or better performance.
  • End-to-end code learning with continuous relaxations and distillation guidance significantly improves performance over naive or random/code learning approaches.
  • Across language modeling and text classification, the method achieves similar or better perplexity/accuracy with substantially fewer embedding parameters and bits.
  • In graph convolutional networks, KD encoding delivers competitive accuracy with markedly fewer embedding parameters and fewer total bits.
  • Learned codes show semantic neighborhood structure, with similar words mapped to same or nearby codes under reasonable K and D choices.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.