QUICK REVIEW

[論文レビュー] Comparative Study of Domain Driven Terms Extraction Using Large Language Models

Sandeep Chataut, Tuyen Do|arXiv (Cornell University)|Apr 2, 2024

Advanced Text Analysis Techniques被引用数 5

ひとこと要約

この論文は GPT-3.5、Llama-2-7B、Falcon-7B をプロンプトを用いたキーワード/用語抽出で比較し、Inspec と PubMed で Jaccard 相似度を用いて評価し、プロンプト設計、幻覚、付随する Python パッケージについて議論します。

ABSTRACT

Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data. They are essential to data enrichment because they form the basis for detailed annotations that provide a more insightful and in-depth view of the underlying data. Keyword/domain driven term extraction is a pivotal task in natural language processing, facilitating information retrieval, document summarization, and content categorization. This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We employed a custom Python package to interface with these LLMs, simplifying keyword extraction. Our study, utilizing the Inspec and PubMed datasets, evaluates the performance of these models. The Jaccard similarity index was used for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for GPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This paper underlines the role of prompt engineering in LLMs for better keyword extraction and discusses the impact of hallucination in LLMs on result evaluation. It also sheds light on the challenges in using LLMs for keyword extraction, including model complexity, resource demands, and optimization techniques.

研究の動機と目的

3 つの大規模言語モデル（GPT-3.5、Llama-2-7B、Falcon-7B）のドメイン駆動型キーワード抽出の有効性を評価する。
Inspec および PubMed の参照キーワードに対して、統一された指標でモデル出力を評価する。
LangChain 統合を含む Python パッケージを用いて、LLM ベースのキーワード抽出を可能にする。
プロンプト設計の役割と評価に対する幻覚の影響を検討する。
キーワード抽出のためのモデル性能、要件、制限に関する実務的な洞察を提供する。

提案手法

LangChain によって構築されたカスタム Python パッケージを介して三つの LLM（Llama-2-7B、GPT-3.5、Falcon-7B）をインターフェースする。
参照として Inspec と PubMed のキーワードの結合を用いた真実値を用いてキーワード抽出を評価する。
モデル出力と参照キーワード集合の重なりを測る指標として Jaccard 相似度を使用する。
ゼロショット prompting を含むプロンプト設計技法を探究し、[MASK] プレースホルダを用いた formal f_KewwordExtraction(P,L) プロンプト構築を採用する。
推論時間を報告し、幻覚、追加用語、定義などのモデル固有の挙動とそれらが精度に与える影響について議論する。

実験結果

リサーチクエスチョン

RQ1GPT-3.5、Llama-2-7B、Falcon-7B が Inspec および PubMed の参照に対して評価した場合、ドメイン駆動型キーワード抽出でどのように性能を示すのか。
RQ2プロンプト設計と温度パラメータがキーワード抽出品質に与える影響は。
RQ3幻覚やドメイン特有の用語が Jaccard 相似度などの評価指標にどのように影響するのか。
RQ4三つのモデル間のキーワード抽出タスクにおける実用的なトレードオフ（精度、速度、リソース使用量）は？

主な発見

LLM	Avg Inference Time	Hardware Specifications	Remarks
Falcon-7B	7-12 secs	T4 GPU	Small variation in inference time based on different input lengths
Llama2-7B	4-8 secs	T4 GPU	Small variation in inference time based on different input lengths
GPT 3.5	3-5 secs	CPU	Almost no variation in inference time based on different input lengths

GPT-3.5 は Inspec で平均 Jaccard スコア 0.64、PubMed で 0.21 を達成。
Llama-2-7B は 0.40（Inspec）と 0.17（PubMed）を達成。
Falcon-7B は 0.23（Inspec）と 0.12（PubMed）を達成。
温度を 0.2 に下げると決定性が高まり、キーワードの多様性と潜在的な幻覚に影響を与える。
Llama-2-7B は参照データに存在しない追加のキーワードや定義を生成し、PubMed の Jaccard 相似度を低下させる場合がある。
GPT-3.5 は簡潔で、不要な用語が最小限で、良く整合したキーワードを生成するものの、幻覚により新しい用語が現れることがある。
推定実行時間: Falcon-7B 7–12s、Llama-2-7B 4–8s、GPT-3.5 3–5s（CPU）。
LangChain に統合された専門的な Python パッケージを用いて、標準化されたプロンプト枠組みでマルチ-LLM キーワード抽出を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。