QUICK REVIEW

[Paper Review] Tracking Capabilities for Safer Agents

Martin Odersky, Yaoyu Zhao|arXiv (Cornell University)|Mar 1, 2026

Security and Verification in Computing0 citations

TL;DR

The paper proposes tacit, a Scala 3–based safety harness that uses tracked capabilities to constrain AI agents’ tool use, preventing information leakage and unsafe side effects while preserving expressiveness.

ABSTRACT

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.

Motivation & Objective

Motivate safety risks when agents interact with tools and data in the real world.
Propose a capability-based safety harness that tracks access and effects via type systems.
Show that capability tracking enables static guarantees against information leakage and unsafe side effects.
Demonstrate practical implementation (tacit) and evaluate safety and expressiveness with LLM-generated code.

Proposed method

Introduce tracked capabilities in Scala 3 where types encode the set of capabilities a value may capture.
Define Classified containers that wrap sensitive data and enforce pure transformations through capability-aware map operations.
Use a safety harness (tacit) with capability-safe interfaces and a safe-mode compiler to ensure capability tracking across turns.
Provide a runtime and API design where all interactions with files, processes, and networks go through a capability library with scoped lifetimes.
Implement a two-channel output model: normal channel for agent feedback and a secure channel for human users, ensuring classified content cannot leak into the agent’s context.
Evaluate using end-to-end safety benchmarks against prompt-injection attacks and assess expressiveness on agentic benchmarks.

Experimental results

Research questions

RQ1RQ1: Does the type system reliably prevent unsafe agent behaviors such as information leakage and unauthorized side effects under adversarial conditions?
RQ2RQ2: Can agents generate capability-safe code with no loss in task performance compared to conventional tool-calling interfaces?

Key findings

In classified mode, the type system blocks all injection and exfiltration attempts across tested models and tasks (100% security).
Under classified mode, task utility remains high (Claude Sonnet 4.6 ≈ 99.2%, MiniMax M2.5 ≈ 90.0%).
In unclassified mode, security depends on model alignment, with some leakage observed for less-aligned models (e.g., MiniMax M2.5 27.3% leakage in malicious tasks).
The approach generalizes to stock AgentDojo domains with comparable security and similar utility to CaMeL baselines across domains.
Experiments show typed, capability-safe code matches or exceeds performance of traditional tool-ccalling in several benchmarks (τ2-bench) and is competitive in SWE-bench Lite.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.