QUICK REVIEW

[論文レビュー] Benchmarking Natural Language Understanding Services for building Conversational Agents

Xingkun Liu, Arash Eshghi|arXiv (Cornell University)|Mar 13, 2019

Topic Modeling参考文献 2被引用数 88

ひとこと要約

本論文は、四つのNLUサービス（Rasa、Dialogflow、LUIS、Watson）を大規模かつクロスプラットフォームで評価し、25k-utterance、21-domain のデータセット（64 intents、54 entity types）に対して意図認識とエンティティ認識の性能を比較する。Watson は意図で卓越しているがエンティティ認識では劣る結果、一方 Dialogflow、LUIS、Rasa はよりバランスの取れた結果を示す。

ABSTRACT

We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission. The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision. Again, Dialogflow, LUIS and Rasa perform well on this task.

研究の動機と目的

会話型エージェント構築のための人気NLUサービスの広範で公正な評価を提供する。
複数ドメインでベンチマークを可能にする大規模な注釈付きNLUデータセットを公開する。
意図分類と固有表現認識（NER）におけるプラットフォームの性能を比較する。
各プラットフォームの長所と短所を特定し、開発者のツール選択を支援する。

提案手法

25,716 utterances across 21 domains, annotated with 64 intents and 54 entity types.
Amazon Mechanical Turkを介してデータをクラウドソースし、アノテーションの不整合を手動で修正する。
10-fold cross-validationを用いて、プラットフォーム固有のトレーニング構成で4つのNLUサービス（Rasa、Dialogflow、LUIS、Watson）を評価する。
インテントとエンティティに対してマイクロ平均の精度、再現率、F1を用い、統計的有意性を評価するために対比較t検定を実施する。

実験結果

リサーチクエスチョン

RQ14つのNLUサービスは、IntentとEntity Typeタスクを合わせた全体のF1でどのように比較されるか？
RQ2Intent分類とEntity Type認識における精度、再現率、F1はサービス間でどう異なるか？
RQ3Dialogflow、LUIS、Rasa、Watsonの意図とエンティティの性能に統計的差はどの程度か？
RQ4文脈や注釈のニュアンスが、特にエンティティ認識のプラットフォーム性能に影響を与えるか？
RQ521ドメイン、64 intents、54 entitiesというデータセット特徴は各サービスの強み/限界をどのように示すか？

主な発見

プラットフォーム	Intent Prec	Intent Rec	Intent F1	Entity Prec	Entity Rec	Entity F1
Rasa	0.863	0.863	0.863	0.859	0.694	0.768
Dialogflow	0.870	0.859	0.864	0.782	0.709	0.743
LUIS	0.855	0.855	0.855	0.837	0.725	0.777
Watson	0.884	0.881	0.882	0.354	0.787	0.488

Watson は4つのプラットフォームの中で最高の Intent 精度/再現率/F1 を達成（F1 = 0.882）だが、偽陽性が多く Entity F1 が著しく低い。
Dialogflow、LUIS、Rasa は全体の性能が類似しており、Int ent F1 に有意差はなし。
Luis は4サービスの中で最も高い Entity F1 を獲得（0.777）。
Watson の Entity F1 は有意に低く（0.657）、大きな効果量で、精度が低いことが主因。
総じて、意図とエンティティをプールすると明確な勝者はいない。Watson は意図性能を強く、エンティティ性能は弱くトレードオフする。
この camera-ready バージョン時点で Watson の Contextual Entity 機能の評価は行われていない。
著者らは公開データセットとツールキットを提供し、継続的なベンチマークを支援する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。