QUICK REVIEW

[论文解读] Temporal Convolutional Attention-based Network For Sequence Modeling

Hao, Hongyan, Yan Wang|arXiv (Cornell University)|Feb 28, 2020

Topic Modeling参考文献 16被引用 39

一句话总结

TCAN 将时域卷积与注意力及增强残差相结合，以建模序列，在 PTB 和 WikiText-2 上实现最先进的困惑度/每字符比特，且架构紧凑、非循环。

ABSTRACT

With the development of feed-forward models, the default model for sequence modeling has gradually evolved to replace recurrent networks. Many powerful feed-forward models based on convolutional networks and attention mechanism were proposed and show more potential to handle sequence modeling tasks. We wonder that is there an architecture that can not only achieve an approximate substitution of recurrent network, but also absorb the advantages of feed-forward models. So we propose an exploratory architecture referred to Temporal Convolutional Attention-based Network (TCAN) which combines temporal convolutional network and attention mechanism. TCAN includes two parts, one is Temporal Attention (TA) which captures relevant features inside the sequence, the other is Enhanced Residual (ER) which extracts shallow layer's important information and transfers to deep layers. We improve the state-of-the-art results of bpc/perplexity to 30.28 on word-level PTB, 1.092 on character-level PTB, and 9.20 on WikiText-2.

研究动机与目标

Motivate the search for a feed-forward architecture that can approximate recurrent networks for sequence modeling while retaining causality and parallelizability.
Introduce TCAN, a hybrid of Temporal Convolutional Networks and attention mechanisms to capture internal sequence correlations.
Propose Enhanced Residuals to propagate important information across layers without adding parameters.
Demonstrate state-of-the-art performance on PTB word-level, PTB character-level, and WikiText-2 datasets.

提出的方法

Propose Temporal Convolutional Attention-based Network (TCAN) with two modules: Temporal Attention (TA) and Enhanced Residual (ER).
Use a causal dilated convolution backbone to model sequence dependencies with increasing receptive field (dilation d=2^l).
In TA, compute keys, queries, and values from layer inputs and apply a lower-triangular masked attention to preserve causality.
In ER, weight and aggregate information from TA to form an enhanced residual that is combined with the standard residual path.
Train with Adam optimizer; compare TCAN to RNN-, CNN-, and Transformer-based baselines on PTB and WT2.

实验结果

研究问题

RQ1Can a feed-forward, non-recurrent architecture match or surpass recurrent models on standard language modeling benchmarks?
RQ2Does integrating temporal attention with a causal dilated convolution preserve causality while capturing long-range dependencies?
RQ3Does an enhanced residual mechanism improve information propagation without increasing model parameters?
RQ4How does TCAN perform on word-level PTB, character-level PTB, and WikiText-2 compared to state-of-the-art models?

主要发现

TCAN achieves 30.28 perplexity on word-level PTB, 1.092 bits-per-character on PTB character-level, and 9.20 perplexity on WikiText-2 (no future leakage).
TCAN outperforms several baselines including AWD-LSTM, TrellisNet, and generic TCN across the evaluated datasets.
An ablation shows Temporal Attention is more effective than a comparable convolutional layer for this task.
Enhanced Residuals provide performance gains without adding extra parameters.
TCAN is smaller in size than Transformer- and RNN-based models while delivering strong performance.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。