QUICK REVIEW

[論文レビュー] STELLAR: A Search-Based Testing Framework for Large Language Model Applications

Lev Sorokin, Ivan Vasilev|arXiv (Cornell University)|Jan 1, 2026

Topic Modeling被引用数 0

ひとこと要約

STELLARは、離散化された特徴空間（内容、スタイル、摂動）に対する進化的探索を用いてLLMベースのアプリのテスト入力生成を自動化し、欠陥のあるまたは安全でない応答を発見します。random searchやASTRALなどのベースラインと比較して、安全性とナビゲーションのユースケースで上回ります。

ABSTRACT

Large Language Model (LLM)-based applications are increasingly deployed across various domains, including customer service, education, and mobility. However, these systems are prone to inaccurate, fictitious, or harmful responses, and their vast, high-dimensional input space makes systematic testing particularly challenging. To address this, we present STELLAR, an automated search-based testing framework for LLM-based applications that systematically uncovers text inputs leading to inappropriate system responses. Our framework models test generation as an optimization problem and discretizes the input space into stylistic, content-related, and perturbation features. Unlike prior work that focuses on prompt optimization or coverage heuristics, our work employs evolutionary optimization to dynamically explore feature combinations that are more likely to expose failures. We evaluate STELLAR on three LLM-based conversational question-answering systems. The first focuses on safety, benchmarking both public and proprietary LLMs against malicious or unsafe prompts. The second and third target navigation, using an open-source and an industrial retrieval-augmented system for in-vehicle venue recommendations. Overall, STELLAR exposes up to 4.3 times (average 2.5 times) more failures than the existing baseline approaches.

研究の動機と目的

静的ベンチマークや手動プロンプト調整を超えたLLMベースのアプリケーションの堅牢なテストを動機づける。
自然言語入力を内容、スタイル、摂動の特徴に離散化して高次元入力空間を管理する。
失敗を誘発する入力を発見する自動化された進化的探索フレームワークを開発する。
安全性重視のLLMシステムとナビゲーション志向のLLMシステムでSTELLARを評価し、ベースラインと比較する。

提案手法

テスト生成を適応度主導の探索最適化問題としてモデル化する。
入力空間を特徴F = {F_S（スタイル）, F_C（内容）, F_P（摂動）}によって領域制約C_Fとともに離散化する。
最適化のために特徴ベクトルをエンコードし、テスト生成前に制約処理を適用する。
ドメイン特有のプロンプトを具現化し、取得強化生成（RAG）を用いて実行可能なテスト入力を生成する。
可能な多目的適合度関数とオラクルで入力を評価し、失敗を識別する。
遺伝子オペレータ（トーナメント選択、順序特徴のSBX交差、カテゴリ特徴の一様/変異）を用い、NSGA-IIによる生存を図る。

Figure 2 : Results for RQ 1 (SafeQA). Number of failures found by each testing approach after 2 hours of search time (top). Mean ratio between failures found and in total generated test cases with standard deviation (bottom). Results averaged over 6 runs.

実験結果

リサーチクエスチョン

RQ1RQ0: LLMベースのジャッジはテストの合否を評価する際にどの程度正確か？
RQ2RQ1: STELLARはLLMアプリケーションの失敗をどの程度効果的に特定するか？
RQ3RQ2: 生成された失敗の多様性はどの程度か？

主な発見

STELLARはベースライン手法より最大で4.3倍（平均2.5倍）の失敗を検出する。
SafeQAおよび NaviQA 全体で、STELLARはランダム探索、組み合わせ探索、ASTRALのようなカバレッジベースのベースラインより常に多くの失敗入力を見つける。
LLM評価に基づくジャッジはSafeQAで二値F1が最大0.79、連続F1が約0.79、NaviQAは二値F1が0.65–0.73の範囲を示す。
クラスタリングによる多様性分析は、アプローチ間で意味のある失敗タイプのカバレッジを示す。
本研究は、1つの安全性重視ケースと2つのナビゲーション志向・取得強化システム（オープンソースと産業用）にわたるSTELLARの有効性を実証する。
フレームワークはドメイン特有のプロンプトテンプレート、RAG取得、探索と活用のバランスを取る進化的探索を統合している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。