Governed Code Mode: From Tool-Calling to Zero-Trust Execution

Introduction#

For most of the last two years, production agents have looked roughly the same:
a tool list in the prompt, a JSON schema, and a loop that calls tools one step at a time.¹

It works. But as the number of tools grows, this pattern starts to crack:

Context windows fill up with tool descriptions.
JSON schemas become fragile and verbose.
Multi-step workflows become hard to express declaratively.

Recently, a different pattern has started to emerge:

Cloudflare’s Code Mode generates a TypeScript API from MCP tools and asks the LLM to write code against it, executed in Workers isolates.²³
Anthropic’s “Code Execution with MCP” moves from prompt-based tool calling to model-written code running in an isolated runtime that talks to MCP servers.⁴
Google’s Vertex AI Agent Engine and NVIDIA’s WebAssembly work explore managed sandboxes for LLM-generated code.⁵⁶

No mainstream agent framework is “all-in” on this yet. It’s not a standard; it’s a direction. This post is about that direction—and how a “governed code mode” architecture fits into it.

What this is (and isn’t)

This is not a benchmark of today’s agents. It’s a map of an emerging pattern across Cloudflare, Anthropic, Google, NVIDIA, and others—and how a governed variant can look.

1. How We Got Here: Tool Lists, JSON, and Workflow Fatigue#

The default agent stack today still looks like this:

Load 10–50 tools into the prompt (OpenAPI, MCP, or custom schemas).
Ask the model to pick a tool and emit JSON arguments.
Execute, observe, repeat (ReAct / plan-and-execute loop).¹⁷

Frameworks like LangGraph formalized this into plan-and-execute agents:
one model plans, another executes the steps.¹⁷
Other frameworks express workflows in YAML or graph configs, but still rely on tool-calling under the hood.

This approach has clear benefits:

Deterministic schemas.
Good for single-step operations or short chains.
Easy to reason about.

But it also hits hard limits:

Context bloat: every new tool increases prompt size.
Schema brittleness: small models struggle to produce perfect JSON for complex nested types.
Expressiveness limits: cross-tool logic (loops, conditionals, retries, error backoff) becomes painful to express purely as declarative plans.

That’s the gap the new “code execution” work is trying to close.

2. Code Execution Arrives (Cautiously)#

Three separate lines of work are starting to converge:

Cloudflare: Code Mode on MCP#

Cloudflare’s Code Mode takes MCP servers—which were originally designed to be exposed directly as tools—and instead:

Generates TypeScript bindings from MCP tool schemas.
Exposes them as a typed API (codemode.*) inside Cloudflare Workers.
Asks the LLM to write TypeScript that calls this API, rather than calling tools directly.²³⁸

The code runs inside Workers isolates, with access to tools controlled through a dedicated proxy object (CodeModeProxy). API keys are hidden; outbound requests can be filtered and rate-limited.³⁸

The result is a runtime where the model can:

Coordinate multiple MCP servers.
Implement conditionals, loops, and retries in code.
Still be confined to a restricted, instrumented environment.

Anthropic: Code Execution with MCP#

Anthropic’s “Code Execution with MCP” goes in the same direction: instead of stuffing tool schemas into the prompt, the model writes code that calls MCP servers from inside a secure execution environment.⁴

The motivation is pragmatic:

Reduce prompt size by loading tools at runtime instead of prompt time.
Shift data processing out of the LLM’s context (e.g., local joins, filters).
Keep tools behind a protocol boundary (MCP), not embedded in the prompt.

But Anthropic also calls out the risk: running model-written code creates new responsibilities for security and platform teams—sandbox design, monitoring, network control.⁴

Google, NVIDIA, and Others: Sandboxes as a First-Class Primitive#

This pattern isn’t limited to MCP:

Google Vertex AI Agent Engine now ships a managed code execution sandbox: LLMs generate Python, which runs in a controlled environment with explicit lifecycle management.⁵
NVIDIA’s WebAssembly work explores running LLM-generated Python inside a browser-based Wasm sandbox (Pyodide) to isolate code and reduce risk to backend systems.⁶
Independent practitioners document similar “code sandboxes for agents” patterns: a dedicated code-execution service with strict isolation and an in-sandbox API for tools.⁹

Important

These are all early-stage, evolving systems. They don’t replace tool-calling; they add a second path for tasks that truly need code.

3. The Pattern Behind “Code Mode”#

If you strip away product names, a common architecture emerges.

1. Context Stage: Minimal Interfaces, Not Full Tools#

Rather than dumping every tool into the prompt, systems:

Generate thin language bindings from schemas (TypeScript for Cloudflare, Python for Vertex).²⁵
Expose only the signatures relevant to the LLM: method names, args, rough semantics.
Keep secrets and low-level details out of the prompt.

This reduces context churn and makes life easier for small models.

2. Code Stage: The Model Writes a Program#

The model is then asked to solve the task by writing code:

// Pseudocode style for illustration
const files = await codemode.listFiles({ path: "/projects" });
const recent = pickMostRecent(files);
const status = await codemode.queryDatabase({ project: recent.name });
 
if (needsAttention(status)) {
  await codemode.createTask({ title: `Review ${recent.name}` });
  await codemode.sendEmail({ to: "[email protected]", subject: "Review needed" });
}

This is not a general-purpose VM:

Imports are locked down.
Only a curated set of bindings exist.
The runtime can enforce limits and inspect calls.³

3. Sandbox Stage: Isolated Execution#

The code runs inside a sandbox:

Cloudflare Workers isolates.³[^12]
Managed code execution in Vertex AI Agent Engine.⁵[^14]
WebAssembly-based sandboxes in the browser (Pyodide).⁶

These sandboxes:

Control network egress.
Limit CPU, memory, and sometimes syscalls.
Provide clear boundaries between user code and system resources.

4. Gateway Stage: Policy-Enforced Tool Access#

The sandbox does not talk to tools directly. Instead, it emits structured I/O:

{"tool": "codemode.listFiles", "args": {"path": "/projects"}}

A gateway outside the sandbox then:

Injects real credentials.
Checks policy (ABAC/RBAC).
Calls MCP servers or other backends.
Returns only the result (or a handle to it) back into the sandbox.³⁸

This “I/O trap” is the heart of the zero-trust approach.

"The industry isn’t throwing away tools. It’s giving them a programmable runtime and a policy gate."

4. What a “Governed Code Mode” Adds#

AgentGovernor fits into this emerging family by focusing less on how to run code and more on how to govern it.

In this lens, “Governed Code Mode” has three layers:

A. Code as a Plan, Not as Truth#

The model’s code is treated as a plan, not a trusted artifact.

Before execution:

A static analyzer walks the AST.
It derives a manifest of intent:

{
  "title": "Clean stale users and notify team",
  "tools_used": ["gdrive.read", "slack.post"],
  "estimated_egress_bytes": 84000,
  "side_effects": ["chat.postMessage"]
}

This manifest becomes the object of review—by humans, by policy engines, or both.

The goal is the same one Anthropic and others highlight: give operators something more reliable than “the model’s own explanation” of its behavior.⁴⁶

B. Keyless Bindings and ABAC at the Edge#

Like Code Mode, the agent code never sees raw credentials.²⁸

Instead:

Bindings such as gdrive, slack, or db are proxies.
Calls from the sandbox are serialized into structured requests.
An ABAC/RBAC gateway decides, per call:
- Is this user allowed to call this tool?
- At this scope (tenant/project/document)?
- With this data volume and rate?

Only then does the gateway talk to MCP servers or other backends.

This aligns with the security goals seen in Cloudflare’s proxy design and Google’s “Agent Sandbox” direction on Kubernetes.³¹⁰

C. Audit as a First-Class Artifact#

Finally, instead of depending purely on runtime logs, a governed code mode treats audit data as a first-class output:

The static manifest (what the code intends to do).
The I/O trace (what calls were actually made).
A stable “plan hash” that allows replay and comparison.

This is the difference between “we logged some stuff” and “we can explain, in plain language and structured data, what this agent did and why.”

Key idea

Code Mode gives agents a language powerful enough for real work. Governance makes that power compatible with security and compliance.

5. Tradeoffs: Where Code Mode Helps—and Where It Doesn’t#

Where It Shines#

Complex workflows: Multi-step, cross-tool logic is more natural in code than in deeply nested JSON or hand-authored YAML.³
Context efficiency: Tool schemas live outside the prompt; the model only sees a thin API surface.²⁴
Performance: Heavy data work (filtering, joins, transforms) can move closer to the data, reducing token traffic.⁴⁵

Where It’s Still Hard#

Infrastructure complexity: You now own a code runtime, sandbox orchestration, and monitoring. Google explicitly frames this as a cloud-level responsibility, not a toy feature.⁵¹⁰
Debugging model-written programs: Logs, manifests, and constraints are mandatory; otherwise, you’re just chasing stack traces generated by a stochastic compiler.
Not everything needs code: Classic JSON tool-calling and plan-and-execute patterns remain simpler and easier to reason about for many tasks.¹⁷

The emerging consensus isn’t “all agents should use Code Mode.” It’s closer to:

“Agents should have access to a governed code path for the 10–20% of tasks that truly need it—and that path must be designed as a zero-trust system, not a convenience feature.”

Conclusion#

Right now, most production agents still lean on tool lists, JSON schemas, and plan-and-execute loops. That stack isn’t going away.

What’s changing is the ceiling:

Cloudflare’s Code Mode shows that MCP tools can be compiled into a typed API and orchestrated via code in Workers.²³⁸
Anthropic’s Code Execution with MCP shows how to shift work from prompts into a secure runtime while keeping tools behind a protocol boundary.⁴
Google, NVIDIA, and others are standardizing the idea of “agent sandboxes” as a core runtime primitive.⁵⁶¹⁰

A governed code mode architecture—like the one AgentGovernor aims at—takes these building blocks and asks a stricter question:

How do we let agents write code without giving up policy, auditability, or control?

The answer isn’t one library call or one product feature. It’s an architecture: code as plan, sandbox as chassis, gateway as law.

LangChain / LangGraph. “Plan-and-Execute Agents.”
https://blog.langchain.com/planning-agents/ ↩ ↩² ↩³ ↩⁴
Cloudflare Blog. (2025-09-26). “Code Mode: the better way to use MCP.”
https://blog.cloudflare.com/code-mode/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Cloudflare Agents GitHub Docs. “codemode.md – Workers isolates, CodeModeProxy, MCP workflow model.”
https://github.com/cloudflare/agents/blob/main/docs/codemode.md ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Anthropic Engineering. (2025). “Code Execution with MCP.” (Summary & analysis).
https://i10x.ai/news/anthropics-code-execution-with-mcp-i10x-analysis ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Google Cloud Developer Forums. “Introducing Code Execution: the code sandbox for Vertex AI Agent Engine.”
https://discuss.google.dev/t/introducing-code-execution-the-code-sandbox-for-your-agents-on-vertex-ai-agent-engine/264336 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
NVIDIA Technical Blog. “Sandboxing Agentic AI Workflows with WebAssembly.”
https://developer.nvidia.com/blog/sandboxing-agentic-ai-workflows-with-webassembly/ ↩ ↩² ↩³ ↩⁴ ↩⁵
LangGraph Documentation. “Plan-and-Execute Tutorial Notebook.”
https://langchain-ai.github.io/langgraph/tutorials/plan-and-execute/plan-and-execute/ ↩ ↩² ↩³
T. N. Vu. Independent Analysis. “Code Mode: the better way to use MCP.”
https://docs.tuannvm.com/blog/code-mode-the-better-way-to-use-mcp ↩ ↩² ↩³ ↩⁴ ↩⁵
Amir Malik. “Code Sandboxes for LLMs and AI Agents.”
https://amirmalik.net/2025/03/07/code-sandboxes-for-llm-ai-agents ↩
Google Cloud Blog. “Agentic AI on Kubernetes and GKE: Agent Sandbox.”
https://cloud.google.com/blog/products/containers-kubernetes/agentic-ai-on-kubernetes-and-gke ↩ ↩² ↩³