QUICK REVIEW

[論文レビュー] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Sumanth Balaji, Piyush Mishra|arXiv (Cornell University)|Jan 2, 2026

AI in Service Interactions被引用数 0

ひとこと要約

JourneyBench は、顧客サポートでポリシー遵守する LLM エージェントを評価するための graph ベースの SOP を導入します。動的プロンプト指揮は遵守を改善し、ポリシー遵守において小型モデルが大型モデルを上回る可能性を示します。

ABSTRACT

Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

研究の動機と目的

rigid な IVR システムを超えて、ビジネスルールに従う自律的な LLM 搭載エージェントへ移行する動機付け。
現実のサポートワークフローにおける多段階のポリシー遵守とタスク依存関係の把握。
さまざまなシナリオ下での SOP への忠実度を評価するスケーラブルなベンチマークの提供。
生産準備が整った AI エージェントのための構造化ワークフロー指示統制の利点を示す。

提案手法

SOP をタスクノードと条件付きエッジを持つ有向非巡回グラフ（DAG）として表現。
複数段階の LLM 駆動パイプラインと人間による検証で合成 SOP グラフとユーザージャーニーを生成。
SOP で規定されたアクションシーケンスへの遵守を測定する User Journey Coverage Score（UJCS）を定義。
Static-Prompt Agent（SPA）と Dynamic-Prompt Agent（DPA）の設計を比較。DPA はワークフロー状態を管理するオーケストレーターを使用。
3 ドメイン（E コマース、ローン申請、通信）の703 回の会話を横断してエージェントを評価。
欠損入力やツール障害を含むシナリオでロバスト性を評価。実運用に近い QA と実デプロイに基づく評価を地盤とする。

Figure 1: Example SOP graph for loan application processing, showing sequential tasks and decision points.

実験結果

リサーチクエスチョン

RQ1ポリシー認識型エージェントは、タスク依存関係と分岐ロジックをエンコードしたグラフベースの SOP を用いて信頼性高く評価できるか？
RQ2Dynamic Prompt orchestrations（DPA）はリアルタイムの顧客サポートタスクにおいてポリシー遵守を Static Prompt（SPA）より改善するか？
RQ3構造化ワークフロー制御の下で、サイズの異なる LLMS は一般的な障害（入力欠如、ツール障害）を扱う際にどう性能を発揮するか？
RQ4JourneyBench の評価フレームワークは本番挙動とQA基準に沿っているか？

主な発見

Dynamic-Prompt Agents（DPA）は Static-Prompt Agents（SPA）よりポリシー遵守を著しく高く達成する。
GPT-4o with DPA は平均で User Journey Coverage Score（UJCS）0.717、SPA は 0.564。
小型モデル（GPT-4o-mini） with DPA（0.649）は、大型モデル with SPA（0.564）を上回る。
JourneyBench は 3 ドメインで合計703 回の会話を報告し、平均 10.91 ターン、1 会話あたり 3.34 ツール呼び出し、41 ツールを使用。
実運用では、DPA ベースのオーケストレーションがクライアントのコンタクトセンターで日次 6,000 件超の呼び出しを安定して処理。
Synthetic conversations は全体 QA 実在性 84.37%（会話能力 82.33%、目標達成 87.78%）、実運用 QA 分布と同等の水準。

Figure 2: Components within a single node: the task description (prompt), available tools for execution, and conditional pathways (edges) that define transitions to the next node based on outcomes.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。