QUICK REVIEW

[论文解读] STELLAR: A Search-Based Testing Framework for Large Language Model Applications

Lev Sorokin, Ivan Vasilev|arXiv (Cornell University)|Jan 1, 2026

Topic Modeling被引用 0

一句话总结

STELLAR 使用对离散特征空间（内容、风格、扰动）的进化搜索，为基于大模型的应用自动化生成测试输入，发现错误或不安全的回应，在安全与导航用例上优于随机搜索和 ASTRAL 等基线。

ABSTRACT

Large Language Model (LLM)-based applications are increasingly deployed across various domains, including customer service, education, and mobility. However, these systems are prone to inaccurate, fictitious, or harmful responses, and their vast, high-dimensional input space makes systematic testing particularly challenging. To address this, we present STELLAR, an automated search-based testing framework for LLM-based applications that systematically uncovers text inputs leading to inappropriate system responses. Our framework models test generation as an optimization problem and discretizes the input space into stylistic, content-related, and perturbation features. Unlike prior work that focuses on prompt optimization or coverage heuristics, our work employs evolutionary optimization to dynamically explore feature combinations that are more likely to expose failures. We evaluate STELLAR on three LLM-based conversational question-answering systems. The first focuses on safety, benchmarking both public and proprietary LLMs against malicious or unsafe prompts. The second and third target navigation, using an open-source and an industrial retrieval-augmented system for in-vehicle venue recommendations. Overall, STELLAR exposes up to 4.3 times (average 2.5 times) more failures than the existing baseline approaches.

研究动机与目标

推动针对基于LLM的应用的稳健测试，超越静态基准测试和手动提示调优。
将自然语言输入离散化为内容、风格和扰动特征，以管理高维输入空间。
开发一个自动化、进化搜索框架以发现诱发失败的输入。
在关注安全和导航的LLM系统上评估STELLAR，并与基线进行对比。

提出的方法

将测试生成建模为带有适应度驱动目标的基于搜索的优化问题。
将输入空间离散化为特征集合F = {F_S（风格）、F_C（内容）、F_P（扰动）}，并设定域约束C_F。
对优化进行特征向量编码并在测试生成前应用约束处理。
通过实例化领域特定的提示并使用检索增强生成（RAG）来生成可执行的测试输入。
使用可能是多目标的适应度函数和评测者来评估测试输入以识别失败。
使用遗传算子（决赛选择、用于序数特征的SBX交叉、类别特征的均匀/变异）以及NSGA-II进行生存进化。

Figure 2 : Results for RQ 1 (SafeQA). Number of failures found by each testing approach after 2 hours of search time (top). Mean ratio between failures found and in total generated test cases with standard deviation (bottom). Results averaged over 6 runs.

实验结果

研究问题

RQ1RQ0: 基于LLM的评审在评估测试通过/失败结果方面有多准确？
RQ2RQ1: STELLAR 在识别基于LLM的应用失败方面有多有效？
RQ3RQ2: 生成的失败样本有多丰富多样？

主要发现

STELLAR 比基线方法暴露的失败数量多出最多4.3×（平均2.5×）。
在 SafeQA 与 NaviQA 的测试中，STELLAR 一贯发现比随机搜索、组合搜索以及覆盖度基线如 ASTRAL 更多的失败输入。
基于LLM评估的评审在 SafeQA 中达到二值F1分数高达0.79，连续F1约为0.79；在 NaviQA 的二值F1范围为0.65–0.73。
通过聚类的多样性分析显示各方法在失败类型覆盖方面具有 meaningful coverage。
该研究展示了 STELLAR 在一个关注安全的用例以及两个面向导航、以检索增强的系统（开源与工业化实现）上的有效性。
该框架整合了领域特定的提示模板、RAG 检索和在探索-开发之间取得平衡的进化搜索。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。