QUICK REVIEW

[论文解读] Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks

Jun Xiao, Hao Ye|arXiv (Cornell University)|Aug 15, 2017

Recommender Systems and Techniques参考文献 15被引用 119

一句话总结

AFM 通过注意力网络学习特征交互对重要性的权重，扩展 Factorization Machines，从而在保持紧凑模型的同时提高预测能力与可解释性。它在稀疏数据任务上优于 FM 和若干深度基线。

ABSTRACT

Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a $8.6\%$ relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep and DeepCross with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https://github.com/hexiangnan/attentional_factorization_machine

研究动机与目标

通过区分不同特征交互的效用，激励改进因子分解机。
提出一种使用注意力机制对交互进行加权的轻量级模型。
证明基于注意力的加权能够提升对稀疏数据的预测性能。
表明 AFM 通过揭示交互重要性提供更好的可解释性。
提供在真实数据集上比较 AFM、FM 与深度基线的实证证据。

提出的方法

使用稀疏的独热编码表示输入特征，并将非零特征嵌入到密集向量中。
引入成对交互层，通过逐元素乘积生成所有成对交互向量。
应用基于注意力的池化层，使用注意力网络为每个交互学习归一化权重 a_{ij}。
将注意力网络定义为一个小型MLP（多层感知机），它计算 a^{\u0003a}_{ij} = h^T ReLU(W (v_i \u0018v_j) x_i x_j + b)，然后 a_{ij} = softmax(a^{\u0003a}_{ij})。
通过 p^T ∑_{i<j} a_{ij} (v_i \u0018v_j) x_i x_j 结合加权交互，并添加线性项以得到最终预测。
使用平方误差进行回归训练，采用带 W 正则化的 SGD，以及对交互层的 dropout 以防止过拟合。

实验结果

研究问题

RQ1注意力机制是否能够有效地学习 AFM 中特征交互的重要性？
RQ2关键超参数（交互层的 dropout、注意力网络的正则化）如何影响 AFM 的性能？
RQ3AFM 是否在稀疏数据预测任务中优于传统 FM 和最先进的深度模型？
RQ4由于显式的交互注意力分数，AFM 是否更具可解释性？
RQ5嵌入维度和注意力因子对模型性能与收敛性有何影响？

主要发现

AFM 在回归任务上相对于 FM 提升了 8.6% 的相对误差改进，但参数更少。
AFM 一贯优于 Tested 数据集上的 Wide&Deep 和 DeepCross，结构更简单。
成对交互层的 dropout 提高了 AFM 与 FM 的性能，且有数据集特异的最优比率。
对注意力网络的正则化进一步提升了 AFM 的泛化能力，超出单纯 dropout。
AFM 相比 FM 收敛更快，并通过学习的注意力分数提供可解释的交互重要性。
AFM 在 Frappe 与 MovieLens 的测试中，在所评估的基线中达到最佳的 RMSE。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。