QUICK REVIEW

[論文レビュー] On the Use of a Large Language Model to Support the Conduction of a Systematic Mapping Study: A Brief Report from a Practitioner's View

Carolina Barros, Author|arXiv (Cornell University)|Feb 9, 2026

Artificial Intelligence in Healthcare and Education被引用数 0

ひとこと要約

この論文は、統計的マッピング研究を支援するためにLLMsをエンドツーエンドで活用した経験を報告し、時間削減、精度、プロンプトの調整、そして人間の監視の必要性を詳述します。

ABSTRACT

The use of Large Language Models (LLMs) has drawn growing interest within the scientific community. LLMs can handle large volumes of textual data and support methods for evidence synthesis. Although recent studies highlight the potential of LLMs to accelerate screening and data extraction steps in systematic reviews, detailed reports of their practical application throughout the entire process remain scarce. This paper presents an experience report on the conduction of a systematic mapping study with the support of LLMs, describing the steps followed, the necessary adjustments, and the main challenges faced. Positive aspects are discussed, such as (i) the significant reduction of time in repetitive tasks and (ii) greater standardization in data extraction, as well as negative aspects, including (i) considerable effort to build reliable well-structured prompts, especially for less experienced users, since achieving effective prompts may require several iterations and testing, which can partially offset the expected time savings, (ii) the occurrence of hallucinations, and (iii) the need for constant manual verification. As a contribution, this work offers lessons learned and practical recommendations for researchers interested in adopting LLMs in systematic mappings and reviews, highlighting both efficiency gains and methodological risks and limitations to be considered.

研究の動機と目的

ソフトウェア工学におけるSMSをサポートするためのLLMsのエンドツーエンドの使用を実演する。
LLM支援のスクリーニングとデータ抽出の時間効率と精度を manual 法と比較して評価する。
LLMsをSMSワークフローに統合する際に必要な課題、リスク、調整を特定する。
LLMsを用いた系統的マッピングおよびレビューでの研究者向けの実践的推奨と教訓を提供する。

提案手法

KitchenhamおよびChartersおよびWohlinらのガイドラインに沿ったプロトコルを定義する。
最初はタイトル/要約を手動でスクリーニングし、比較のために構造化されたプロンプトをChatGPT-4に適用する。
predefined テンプレートを用いて、手動およびLLM支援条件でデータ抽出を実施する。
幻覚や不一致を緩和するための二重確認の検証戦略を適用する。
サブセットで追加モデル（Gemini PRO、Manus、Copilot）を試して、クロスモデルの性能を探る。

実験結果

リサーチクエスチョン

RQ1LLM支援のスクリーニングはSMSにおいて、時間と精度の点で manual スクリーニングとどのように比較されるか。
RQ2LLM支援のデータ抽出はSMSにおいて、時間と精度の点で manual 抽出とどのように比較されるか。
RQ3LLMsをSMSワークフローに統合する際の実践的な調整、リスク、検証の必要性は何か。
RQ4代替LLMs（Gemini PRO、Manus、Copilot）は、スクリーニングと抽出のタスクでどのように機能するか。

主な発見

Aspect	Manual Execution	LLM-Assisted Execution (ChatGPT-4)
Screening Time	Approximately 23 days (219 studies)	Approximately 9 hours (reduction of 98%)
Extraction Time	Approximately 7 days (13 studies)	Approximately 1 hour (reduction of 99%)
Screening Accuracy	208 correct out of 219 studies; 11 hallucinations identified	Approximately 95% agreement (208/219)
Extraction Accuracy	12 correct out of 13 studies; 1 error identified	Approximately 92.3% agreement (12/13)
Main Risks	Human reading errors or fatigue	Hallucinations, dependence on prompt engineering, inconsistency across model versions
Verification Applied	Cross-checking among human reviewers	Double-checking: comparison with manual results + review of discrepancies

LLM支援のスクリーニングは、所要時間を約23日から約9時間へ短縮（98%削減）。
LLM支援の抽出は、所要時間を約7日から約1時間へ短縮（99%削減）。
スクリーニングのLLM精度は約95%の合意（208/219）だが11件の幻覚が特定された。
抽出のLLM精度は約92%の合意（12/13）だ。
LLMの出力には人間による検証が必要で、幻覚を緩和し一貫性を確保する。
Gemini PROは試験サブセットでスクリーニングと抽出の両方で90%の精度を示した。Manusはスクリーニングで98%、抽出で40%を示した。Copilotは両タスクで60%の精度を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。