QUICK REVIEW

[论文解读] Question Embeddings Based on Shannon Entropy - Solving intent classification task in goal-oriented dialogue system

Aleksandr Perevalov, Даниил Сергеевич Курушин|arXiv (Cornell University)|Mar 4, 2019

Topic Modeling参考文献 8被引用 2

一句话总结

本文提出了一种基于香农熵的新型问题嵌入方法，以提升低资源、领域特定对话系统中的意图分类性能。通过计算数据集中每个词的熵分布，并应用截断奇异值分解（SVD），该方法生成紧凑且密集的向量表示，其性能优于TF-IDF、word2vec和FastText，在仅含1,300个标注样本的学生查询数据集上，F1得分提高了2%。

ABSTRACT

Question-answering systems and voice assistants are becoming major part of client service departments of many organizations, helping them to reduce the labor costs of staff. In many such systems, there is always natural language understanding module that solves intent classification task. This task is complicated because of its case-dependency - every subject area has its own semantic kernel. The state of art approaches for intent classification are different machine learning and deep learning methods that use text vector representations as input. The basic vector representation models such as Bag of words and TF-IDF generate sparse matrixes, which are becoming very big as the amount of input data grows. Modern methods such as word2vec and FastText use neural networks to evaluate word embeddings with fixed dimension size. As we are developing a question-answering system for students and enrollees of the Perm National Research Polytechnic University, we have faced the problem of user's intent detection. The subject area of our system is very specific, that is why there is a lack of training data. This aspect makes intent classification task more challenging for using state of the art deep learning methods. In this paper, we propose an approach of the questions embeddings representation based on calculation of Shannon entropy.The goal of the approach is to produce low dimensional question vectors as neural approaches do and to outperform related methods, described above in condition of small dataset. We evaluate and compare our model with existing ones using logistic regression and dataset that contains questions asked by students and enrollees. The data is labeled into six classes. Experimental comparison of proposed approach and other models revealed that proposed model performed better in the given task.

研究动机与目标

解决在标注数据有限的低资源、领域特定对话系统中意图分类的挑战。
开发一种密集的、低维的文本表示方法，避免深度学习模型对大量数据的依赖。
在数据稀缺条件下，超越经典方法（TF-IDF）和现代方法（word2vec、FastText）在意图分类中的表现。
创建一种可扩展、高效的向量表示，同时在降低维度的同时保持语义保真度。

提出的方法

该方法计算数据集中每个词在其所有问题中分布的香农熵。
基于熵值构建词向量，捕捉上下文语义模式，且无需神经网络。
应用截断奇异值分解（SVD）将基于熵的矩阵降维至200维。
通过聚合每个问题内基于熵的词向量，生成问题嵌入表示。
最终向量表示作为输入，输入至采用一对多策略的逻辑回归分类器。
采用五折交叉验证进行评估，并以F1得分作为主要指标。

实验结果

研究问题

RQ1基于香农熵的嵌入方法是否能在小型、领域特定数据集上实现优于TF-IDF、word2vec和FastText的意图分类性能？
RQ2与TF-IDF相比，该基于熵的方法是否在显著降低向量维度的同时仍保持高性能？
RQ3该方法在具有平衡类别的通用数据集（如IMDB影评）上的表现如何？
RQ4该方法是否能有效应对低资源设置下的类别不平衡问题？
RQ5基于熵的表示在不同自然语言处理任务中是否具有鲁棒性和泛化能力？

主要发现

所提出的香农熵方法在学生查询数据集上取得了0.74的F1得分，优于TF-IDF（0.72）、word2vec（0.67）和FastText（0.63）。
该方法在相同数据集上比TF-IDF高出2%的F1得分，比word2vec高出7%。
FastText表现最差，F1得分为0.63，可能由于数据稀缺。
在IMDB数据集上，该方法的F1得分与TF-IDF持平（0.90），但维度显著降低（200 vs. 8,623）。
该方法表明，基于熵的表示能够以更低的计算成本高效保留语义信息。
PRIV类别因类别不平衡而分类效果较差，提示未来工作需重新设计分类体系。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。