QUICK REVIEW

[논문 리뷰] Towards Leveraging LLMs to Generate Abstract Penetration Test Cases from Software Architecture

Jafari, Mahdi, Sharma, Rahul|arXiv (Cornell University)|2026. 03. 24.

Information and Cyber Security인용 수 0

한 줄 요약

논문은 Abstract Penetration Test Case (APTC) 메타모델을 정의하고 PCM 모델로부터 아키텍처 기반의 APTCs를 LLM 기반으로 생성하는 것을 조사하며, 여러 사례 연구에서 프롬팅 전략을 평가한다. 결과는 최대 93%의 유용성과 86%의 정확성을 보여 아키텍트와 테스트 담당자에 대한 실용적 지원을 시사한다.

ABSTRACT

Software architecture models capture early design decisions that strongly influence system quality attributes, including security. However, architecture-level security assessment and feedback are often absent in practice, allowing security weaknesses to propagate into later phases of the software development lifecycle and, in some cases, to remain undiscovered, ultimately leading to vulnerable systems. In this paper, we bridge this gap by proposing the generation of Abstract Penetration Test Cases (APTCs) from software architecture models as an input to support architecture-level security assessment. We first introduce a metamodel that defines the APTC concept, and then investigate the use of large language models with different prompting strategies to generate meaningful APTCs from architecture models. To design the APTC metamodel, we analyze relevant standards and state of the art using two criteria: (i) derivability from software architecture, and (ii) usability for both architecture security assessment and subsequent penetration testing. Building on this metamodel, we then proceed to generate APTCs from software architecture models. Our evaluation shows promising results, achieving up to 93\% usefulness and 86\% correctness, indicating that the generated APTCs can substantially support both architects (by highlighting security-critical design decisions) and penetration testers (by providing actionable testing guidance).

연구 동기 및 목표

소프트웨어 생명주기의 초기 단계로 침투 테스트를 앞당김으로써 아키텍처 수준 보안 평가를 촉진한다.
아키텍처 산출물에 뿌리를 둔 Abstract Penetration Test Case(APTC)의 구조화된 메타모델을 정의한다.
다양한 프롬팅 전략을 활용한 LLM의 APTC 생성 효과를 아키텍처 모델로부터 평가한다.
생성된 APTC가 아키텍트와 침투 테스트 담당자에게 어떤 도움을 주는지 평가하고 한계와 필요한 아키텍처 주석을 식별한다.

제안 방법

대상 위협, 약점, 공격 벡터, 영향을 받는 아키텍처 요소를 설명하는 APTC 메타모델을 제안한다.
PCM 아키텍처를 보안 지향적 텍스트 표현으로 직렬화하고 제약 프롬프트를 통해 스키마 호환 APTC 생성을 강제한다.
프롬프트 엔지니어링(zero-shot, one-shot, few-shot; chain-of-thought 여부에 상관없이)을 사용하여 두 개의 LLM(GPT와 Gemini)으로 APTC를 생성한다.
CAWE 약점에 대해 전문가 기반 평가 및 LLM 보조 전문가 평가를 통해 생성된 APTC를 평가한다.
아키텍처적 추적 가능성과 상호운용성을 보장하기 위해 미리 정의된 JSON 스키마와 대조하여 산출물을 검증한다.

실험 결과

연구 질문

RQ1RQ1: How should an abstract penetration test case (APTC) be defined to support architecture-level security assessment?
RQ2RQ2: To what extent can LLMs analyze and understand the security implications of a software architecture?
RQ3RQ3: To what extent can LLMs generate meaningful APTCs from software architecture models?

주요 결과

모델	지표	유지보수	PowerGrid	Bank	Total/15	성공률
GPT-5.2	Correctness	2/5	3/5	4/5	9/15	60.0%
GPT-5.2	Usefulness	5/5	4/5	4/5	13/15	86.7%
Gemini-3-Pro	Correctness	4/5	2/5	4/5	10/15	66.7%
Gemini-3-Pro	Usefulness	3/5	2/5	5/5	10/15	66.7%
GPT-5.2	Correctness	4/5	3/5	4/5	11/15	73.3%
GPT-5.2	Usefulness	4/5	3/5	4/5	11/15	73.3%
Gemini-3-Pro	Correctness	5/5	4/5	4/5	13/15	86.7%
Gemini-3-Pro	Usefulness	5/5	5/5	4/5	14/15	93.3%
GPT-5.2	Correctness	4/5	4/5	3/5	11/15	73.3%
GPT-5.2	Usefulness	4/5	4/5	4/5	12/15	80.0%
Gemini-3-Pro	Correctness	2/5	2/5	2/5	6/15	40.0%
Gemini-3-Pro	Usefulness	2/5	5/5	3/5	10/15	66.7%

LLMs can generate architecture-grounded APTCs that are largely meaningful and aligned with CAWE weaknesses.
Prompting strategy and model choice significantly affect correctness and usefulness, with Gemini often outperforming GPT in usefulness under certain prompts.
The approach achieves up to 93.3% usefulness and 86.7% correctness in aggregate across three scenarios.
Some outputs incorrectly identify weaknesses or reference non-existent architectural elements, highlighting limitations in semantic grounding.
A structured APTC metamodel enables traceable, schema-compliant generation suitable for integration into security workflows.
The evaluation discusses threats to validity and suggests extensions to cover more CAWEs and richer threat models.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.