QUICK REVIEW

[論文レビュー] CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Norbert Tihanyi, Mohamed Amine Ferrag|arXiv (Cornell University)|Feb 12, 2024

Topic Modeling被引用数 7

ひとこと要約

CyberMetric は、セキュリティ専門家の知識とLLMの知識を比較評価するための、10,000-question のサイバーセキュリティベンチマーク（80-question の人間検証済みサブセットを含む）を導入します。多段階の半自動 QA 生成パイプラインと広範な人間検証を用いて評価します。

ABSTRACT

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance, and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b.

研究の動機と目的

サイバーセキュリティ知識の人間検証付きベンチマークを包括的に動機付け、さまざまなドメイン（暗号学、ネットワークセキュリティ、ガバナンス等）にわたって LLM を評価する。
LLM から人間の専門家までを対象とする高品質なサイバーセキュリティ問題を生成する、LLM と人間専門家を組み合わせたスケーラブルな QA 生成パイプラインを作成する。
人間の専門家と様々な LLM を公正に比較するフレームワークを提供し、現モデルの強みとギャップを浮き彫りにする。
研究者がサイバーセキュリティ特化型の LLM のベンチマークと開発を指向できるようにする。

提案手法

約 58 向けの公開サイバーセキュリティ文書から総計 ~100,000 ページを収集。
半自動的な質問生成：GPT-3.5 が質問を作成；Falcon-180B がバリデータとして機能；人間のバリデータが関連性と文法を洗練。
文法訂正（T5-base）および文脈関連性チェック（Falcon-180B、GPT-4 分析）による質問の後処理。
テスト段階で GPT-4 が潜在的に不正確な項目をフラグ付けし、人間のレビューを要する。問題をカテゴリ化（複数回答、時代遅れの文脈、文脈の不完全、出典エラー、参照の欠如）。
最終化により 9 ドメインにわたって正確に 10,000 問が分布（CyberMetric 表）
CyberMetric-80: サイバーセキュリティ専門家によって人間対LLM 比較のために選択された 80 問の検証済みサブセット；多様な背景を持つ 30 名の参加者が調査を完了。

Figure 1: Covered Domains in CyberMetric

実験結果

リサーチクエスチョン

RQ1RQ1: 現在の LLM はサイバーセキュリティ分野で人間の専門家と同等かそれを上回るか？
RQ2RQ2: サイズに対して最も効率的なモデルはどれか？
RQ3RQ3: 人間が依然として LLM を上回る領域はどの分野か？

主な発見

LLMモデル	会社	サイズ	ライセンス	実行 1	実行 2	実行 3	実行 4	平均	標準偏差
GPT-4.0-1106-preview	OpenAI	1.6T*	Proprietary	97.50	93.75	96.25	95.00	95.63	1.61
Mixtral-8x7B-Instruct	Mistral AI	45 B	Apache 2.0	93.75	92.50	91.25	92.50	92.50	1.02
GEMINI-pro (Bard)	Google	137 B	Proprietary	90.00	91.25	92.50	90.00	90.94	1.20
GPT-3.5-turbo-1106	OpenAI	175B*	Proprietary	90.00	87.50	85.00	87.50	87.50	2.04
Falcon-180B-Chat	TII	180B	Apache 2.0	82.50	82.50	82.50	82.50	82.50	0.00
Flan-T5-XXL	Google	11B	Apache 2.0	81.75	82.50	81.75	81.75	81.94	0.63
Zephyr-7B-beta	HuggingFace	7B	MIT	81.25	81.25	81.25	80.00	80.94	0.63
Llama 2-70B	Meta	70B	Apache 2.0	75.00	72.50	72.50	75.00	72.38	0.14
Mistral-7B-Instruct	Mistral AI	7B	Apache 2.0	72.50	72.50	72.50	72.50	72.50	0.00
Falcon-40B-Instruct	TII	40B	Apache 2.0	67.50	66.25	61.25	61.25	64.06	3.28
Llama 2-13B	Meta	13B	Open	55.00	56.25	52.50	51.25	53.75	2.28
Flan-T5-Base [35]	Google	0.25B	Apache 2.0	51.25	51.25	51.25	51.25	51.25	0.00
Llama 2-7B	Meta	13B	Open	46.25	46.25	50.00	43.75	44.06	2.95
Dolly V2 12b BF16 [36]	Databricks	12B	MIT	33.75	33.75	32.5	30.0	32.50	1.77

GPT-4 は CyberMetric-80 での平均正答率が最高で 95.63% であった。
オープンソースの選択肢の中では Mixtral-8x7B-Instruct と Falcon-180B が最も強力であった； Zephyr-7B-beta は 7B パラメータで 80.94%を達成。
LLMs は一般に 80 問の調査で人間を上回るが、専門家の特定の科目ケースや最新情報・複雑なトピックでは人間の方が高い性能を示した。
CyberMetric-80 の平均人間正答率は約 53.83% で、経験豊富な参加者が約 72.24%、高度に経験豊富な専門家が個別ケースで最大約 88.75% まで到達した。
二段階評価（CyberMetric-80 対 CyberMetric-10,000）はクロスチェックとして機能：より大きなデータセットは、専門家パネルが検出できる質問の正確さや範囲の問題を明らかにする。
最新情報とリトリーバル能力（RAG）が最近公表されたガイドライン（例：NIST SP 800-63B、BSI TR-02102-1）の回答に与える影響を強調する。

Figure 2: Framework for AI-driven question generation methodology, incorporating human validation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。