QUICK REVIEW

[論文レビュー] RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection

Sami Abuzakuk, Lucas Crijns|arXiv (Cornell University)|Mar 2, 2026

Software System Performance and Reliability被引用数 0

ひとこと要約

RIVAは2つのエージェント（VerifierとTool Generation）からなるシステムで、複数の独立したツール呼び出しを相互検証することでドリフトに robust に対処し、いくつかのツールが誤解を招く出力をする場合でも信頼性を向上させる。

ABSTRACT

Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.

研究の動機と目的

IaCの設定ドリフトに対して、信頼性の低いツールを用いても堅牢な検証を可能にする。
ツール出力を相互検証するためのマルチエージェント協調を活用し、誤警報を減らす。
faulty-tool 条件下でAIOpsLabベンチマークに対するReActベースラインと比較してRIVAを評価。
ツール呼び出し履歴とハイパーパラメータKが検証信頼性に与える影響を定量化。

提案手法

VerifierエージェントとTool Generationエージェントという2エージェントアーキテクチャを導入し、Tool Call Historyを共有する。
属性ごとにK個の独立したツール呼び出しを横断的検証してドリフト信頼性を決定する。
Tool Generationエージェントが同一属性に対して多様で異なるツール呼び出しを提案し、結果をTool Historyに記録する。
属性が満たされたまたは違反と結論付けられる前にK件の検証済みツールパスを要求する。
信頼性の低いツールでのsilentエラーを模擬するために修正済みAIOpsLabベンチマークを用いて評価する。

実験結果

リサーチクエスチョン

RQ1ツールが誤った出力を出しても、エージェントAIはIaC適合性をどのように信頼性高く検証できるか。
RQ2複数のツール呼び出しによるクロス検証は、単一エージェントのベースラインと比べてドリフト検出精度を改善するか。
RQ3診断パスパラメータKの検証成功率と効率性への影響はどうなるか。
RQ4誤ったツール応答下で、RIVAは局在化・検出・解析タスクでどう機能するか。

主な発見

RIVAはツールが誤った出力を返す場合の平均タスク精度を、欠陥ツールを用いたReActの27.3%から平均50.0%へ向上させた。
誤りのないツールがある場合、RIVAは平均精度を28%（ReAct）から43.8%へ引き上げた。
K=2のRIVAはタスク全体でReActを上回り、いくつかの設定で43.75%対28.00%などの高い成功率を達成した。
RIVAは一般的にReActより少ないステップとトークン数で済み、効率性が高い（多くのタスクが15ステップ以内で完了 vs. ReActの広範な軌跡；正しいツール時には38,000トークン対78,000トークン）。
誤ったツールを用いる場合、RIVAは最大で17ステップで済むが、一部のReAct実行は45ステップに達することが37%以上の時間である。
Kを3に上げるとAIOpsLabの環境制約により報告された成功がゼロになるため、Kの重要性と環境依存性を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。