QUICK REVIEW

[論文レビュー] Time for aCTIon: Automated Analysis of Cyber Threat Intelligence in the Wild

Giuseppe Siracusano, Davide Sanvito|arXiv (Cornell University)|Jul 14, 2023

Software Engineering Research被引用数 12

ひとこと要約

本論文は aCTIon を紹介します。GPT-3.5ベースのパイプラインを用いて unstructured reports から STIX バンドルを生成する自動化CTI抽出フレームワークであり、204件のオープンベンチマークによって裏打ちされています。従来手法に比べF1スコアの大幅な改善を達成します。

ABSTRACT

Cyber Threat Intelligence (CTI) plays a crucial role in assessing risks and enhancing security for organizations. However, the process of extracting relevant information from unstructured text sources can be expensive and time-consuming. Our empirical experience shows that existing tools for automated structured CTI extraction have performance limitations. Furthermore, the community lacks a common benchmark to quantitatively assess their performance. We fill these gaps providing a new large open benchmark dataset and aCTIon, a structured CTI information extraction tool. The dataset includes 204 real-world publicly available reports and their corresponding structured CTI information in STIX format. Our team curated the dataset involving three independent groups of CTI analysts working over the course of several months. To the best of our knowledge, this dataset is two orders of magnitude larger than previously released open source datasets. We then design aCTIon, leveraging recently introduced large language models (GPT3.5) in the context of two custom information extraction pipelines. We compare our method with 10 solutions presented in previous work, for which we develop our own implementations when open-source implementations were lacking. Our results show that aCTIon outperforms previous work for structured CTI extraction with an improvement of the F1-score from 10%points to 50%points across all tasks.

研究の動機と目的

構造化CTI抽出のための大規模でオープンなベンチマークの不足と、既存ツールの制限に対処する。
大規模言語モデルを活用した自動化CTI情報抽出パイプラインを開発する。
現実的なCTIベンチマークで最先端ツールを評価し、LLMベースの手法による改善を示す。
CTIレポートとそれらのSTIXバンドルのデータセットを公開し、さらなる研究を促進する。

提案手法

一般的なエンティティ/関係抽出用と、攻撃パターンに特化したもう一つの2系統パイプラインアーキテクチャを提供する。
テキストを要約する前処理と CTI エンティティを同定・分類する抽出の2段階LLMワークフローを、サービサーとしてのLLM（GPT-3.5-turbo）とプロンプトベースの推論を用いて活用する。
入力内容に固定化された推論者としてモデルを用い、検証ステップとバンドルレビューを追加することでLLMの幻覚を緩和する。
長い非構造化レポートを反復的要約で蒸留し、4kトークンの入力/出力制約に合わせる。
品質を確保するため、Group Cのアナリストを通じて抽出されたSTIXバンドルを手動で検証・レビューする。

実験結果

リサーチクエスチョン

RQ1現在のツールは、大規模で現実的なベンチマークに対して構造化CTI抽出をどの程度うまく実行できるか？
RQ2LLMベースの prompting とコンテキスト内学習は、CTIレポートにおけるマルウェア、脅威アクター、ターゲット、攻撃パターンのエンティティ抽出を改善できるか？
RQ3幻覚を減らし、CTIメトリクスの精度を維持するために、どのような設計選択（前処理、 prompting、検証）が有効か？
RQ4高品質なSTIXバンドルを作成する際の自動化とアナリストによる検証のトレードオフは何か？

主な発見

ACTIon は従来の最先端ツールを上回り、マルウェア、脅威アクター、ターゲット抽出のF1スコアを15–50ポイント向上させる。
攻撃パターン抽出は aCTIon でF1スコアが約10ポイント改善する。
データセットは 204 件のレポートと STIX バンドルを含み、9 種類のエンティティと 5 種類のリレーションにまたがって、総計 36.1k のエンティティと 13.6k のリレーションを含む。
レポートは 62 のソースにまたがり、MITRE ATT&CK Enterprise 技術の 90% を網羅し、188 のマルウェア変種と 91 の脅威アクターが表現されている。
著者は再現性とさらなる研究を可能にするため、データセットをオープンアクセスで提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。