QUICK REVIEW

[논문 리뷰] FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

Jiayi Zhou, Yang Sheng|arXiv (Cornell University)|2026. 02. 11.

Ethics and Social Impacts of AI인용 수 0

한 줄 요약

FormalJudge는 신경-기호적이고 양방향의 Formal-of-Thought 파이프라인을 활용하여 LLM이 높은 수준의 의도를 Dafny 명세로 컴파일하고 SMT 솔버가 원자적 사실을 검증함으로써 에이전트 감독에 대한 형식적 보장을 달성하고 LLM-as-a-Judge baselines를 능가합니다.

ABSTRACT

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

연구 동기 및 목표

고위험 영역에서 점점 더 자율적인 LLM 기반 에이전트를 위한 신뢰할 수 있는 감독의 필요성을 제시한다.
자연어 요구사항과 형식적 검증 간의 간극을 좁혀 수학적 보장을 제공한다.
의도를 원자적 사실로 분해하고 SMT 솔버로 검증하는 양방향 아키텍처를 제안한다.
형식적 검증이 여러 벤치마크와 에이전트 모델에 걸쳐 순수 확률적 판정보다 우수할 수 있음을 보인다.

제안 방법

LLMs는 명세 컴파일러로 작동하여 사용자의 의도를 상향식으로 원자적이고 검증 가능한 제약으로 분해한다.
맥락 인식 추출은 결정론적 구문 분석과 LLM 보조 시맨틱 추출을 통해 궤적 데이터를 근거 있는 원자적 사실로 분할한다.
원자적 사실의 논리적 구성을 인코딩하는 Dafny 명세를 모은다.
Boogie와 Z3 SMT 해결이 Dafny로 생성된 명세를 검증하여 형식적 판정을 산출한다.
해당 아키텍처는 신경적 시맨틱 추출을 결정론적 논리 검증과 분리하여 입증 가능한 보장을 제공한다.

Figure 1 : Teaser: LLM agents face scalable oversight challenges: it is difficult to identify a reliable oversight agent. While the LLM-as-a-Judge baseline relies on probabilistic Chain-of-Thought reasoning, we introduces a Formal-of-Thought architecture that leverages LLMs as specification compiler

실험 결과

연구 질문

RQ1형식적 검증이 확률적 심판 대비 더 신뢰할 수 있고 확장 가능한 에이전트 궤적 감독을 제공할 수 있는가?
RQ2 oversight를 원자적 사실로 분해하고 SMT 솔버로 검증하는 것이 에이전트 모델 전반에서 안전 실패 및 기만의 탐지를 더 잘 달성하는가?
RQ3LLMs가 자연어 의도를 형식 명세로 효과적으로 컴파일하여 다양한 에이전트 강도에서도 견고하게 유지될 수 있는가?
RQ4반복적이고 형식적인 피드백이 에이전트 행동의 점진적 안전 개선을 가능하게 하는가?

주요 결과

FormalJudge는 3개 벤치마크와 7개 에이전트 모델에 걸쳐 LLM-as-a-Judge 기준선 대비 평균 16.6%의 향상을 달성한다.
FormalJudge 하의 7B 판사는 72B 에이전트에서 기만 탐지 정확도가 90%를 넘으며 약한 일반화에서 강한 일반화로 이어진다.
형식적 검증 피드백은 세 차례의 반복을 통해 거의 선형에 가까운 안전 개선을 가능하게 한다.
구조화된 입력은 도움이 되지만 완전한 형식 검증으로 다리 놓지는 못하며, 결정론적 구성의 필요성을 강조한다.
실행 가능한 파이썬 체크는 도움이 되지만 Dafny+Z3 같은 형식적 보장을 제공하지는 않는다.
FormalJudge는 공격 정교화 수준과 판정 모델 크기에 관계없이 높은 탐지 정확도를 유지하는 반면, 순수 확률적 기준은 그렇지 않다.

Figure 2 : The neuro-symbolic architecture and verification pipeline of FormalJudge . Panel (a) outlines the oversight workflow where an LLM compiles user intent into Dafny specifications and extracts atomic facts, enabling a Z3 SMT solver to provide deterministic proofs of correctness independent o

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.