QUICK REVIEW

[论文解读] CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Norbert Tihanyi, Mohamed Amine Ferrag|arXiv (Cornell University)|Feb 12, 2024

Topic Modeling被引用 7

一句话总结

CyberMetric 引入一个 10,000 问的网络安全基准测试（其中有一个经人工验证的 80 问子集），用于评估并比较 LLM 的知识与网络安全专业人员的水平，采用多阶段、半自动化的问答生成流程并进行广泛的人类验证。

ABSTRACT

Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance, and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b.

研究动机与目标

Motivate a comprehensive, human-validated benchmark for cybersecurity knowledge to assess LLMs across diverse domains (cryptography, network security, governance, etc.).
Create a scalable QA generation pipeline combining LLMs and human experts to produce high-quality cybersecurity questions.
Provide a fair framework to compare human experts and various LLMs, highlighting strengths and gaps in current models.
Enable researchers to benchmark and guide development of cybersecurity-specialized LLMs.

提出的方法

Data collection from 580 public cybersecurity documents totaling ~100,000 pages.
Semi-automated question generation: GPT-3.5 creates questions; Falcon-180B acts as a validator; human validators refine relevance and grammar.
Question post-processing with grammar correction (T5-base) and context relevance checks (Falcon-180B, GPT-4 analysis).
Test phase where GPT-4 flags potentially incorrect items for human review, categorizing issues (multiple answers, outdated context, incomplete context, source errors, missing references).
Finalization yields exactly 10,000 questions distributed across nine domains (CyberMetric table).
CyberMetric-80: a vetted subset of 80 questions selected by cybersecurity experts for human-vs-LLM comparison; 30 participants from diverse backgrounds completed the survey.

Figure 1: Covered Domains in CyberMetric

实验结果

研究问题

RQ1RQ1: Do current LLMs match or surpass human experts across cybersecurity domains?
RQ2RQ2: Which available model offers the best efficiency relative to its size?
RQ3RQ3: In which domains do humans still outperform LLMs?

主要发现

LLM 模型	公司	尺寸	许可	Run 1	Run 2	Run 3	Run 4	Mean	Std
GPT-4.0-1106-preview	OpenAI	1.6T*	Proprietary	97.50	93.75	96.25	95.00	95.63	1.61
Mixtral-8x7B-Instruct	Mistral AI	45 B	Apache 2.0	93.75	92.50	91.25	92.50	92.50	1.02
GEMINI-pro (Bard)	Google	137 B	Proprietary	90.00	91.25	92.50	90.00	90.94	1.20
GPT-3.5-turbo-1106	OpenAI	175B*	Proprietary	90.00	87.50	85.00	87.50	87.50	2.04
Falcon-180B-Chat	TII	180B	Apache 2.0	82.50	82.50	82.50	82.50	82.50	0.00
Flan-T5-XXL	Google	11B	Apache 2.0	81.75	82.50	81.75	81.75	81.94	0.63
Zephyr-7B-beta	HuggingFace	7B	MIT	81.25	81.25	81.25	80.00	80.94	0.63
Llama 2-70B	Meta	70B	Apache 2.0	75.00	72.50	72.50	75.00	72.38	0.14
Mistral-7B-Instruct	Mistral AI	7B	Apache 2.0	72.50	72.50	72.50	72.50	72.50	0.00
Falcon-40B-Instruct	TII	40B	Apache 2.0	67.50	66.25	61.25	61.25	64.06	3.28
Llama 2-13B	Meta	13B	Open	55.00	56.25	52.50	51.25	53.75	2.28
Flan-T5-Base [35]	Google	0.25B	Apache 2.0	51.25	51.25	51.25	51.25	51.25	0.00
Llama 2-7B	Meta	13B	Open	46.25	46.25	50.00	43.75	44.06	2.95
Dolly V2 12b BF16 [36]	Databricks	12B	MIT	33.75	33.75	32.5	30.0	32.50	1.77

GPT-4 achieved the highest mean accuracy on CyberMetric-80 at 95.63%.
Among open-source options, Mixtral-8x7B-Instruct and Falcon-180B were strongest performers; Zephyr-7B-beta achieved 80.94% with 7B parameters.
LLMs generally outperformed humans on the 80-question survey, but humans showed higher performance in several expert-subject cases and in certain up-to-date or complex topics.
CyberMetric-80 results show a mean human accuracy of about 53.83%, with experienced participants scoring ~72.24% and highly experienced experts reaching up to ~88.75% on individual cases.
A two-tier evaluation (CyberMetric-80 vs CyberMetric-10,000) serves as a cross-check: larger datasets reveal issues in question accuracy or scope that the expert panel can detect.
The study highlights the impact of up-to-date information and retrieval capabilities (RAG) on answering recently published guidelines (e.g., NIST SP 800-63B, BSI TR-02102-1).

Figure 2: Framework for AI-driven question generation methodology, incorporating human validation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。