A quiet shift is happening underneath the agent stack: the filesystem is becoming the most important piece of LLM infrastructure.
The filesystem is the agent's working memory
A useful mental model: context window is RAM, filesystem is disk. Everything that matters should be written to disk.
Take Manus. While executing a task, the agent writes three Markdown files — task_plan.md (goals and progress), notes.md (research findings), plus a results file. When the context window fills up, the agent doesn't lose the thread, because it can re-read task_plan.md and pull the objective back into attention. That's a clean fix for "Lost in the Middle" on long-horizon tasks.
The interesting thing: three independent products — Manus, Claude Code (CLAUDE.md + Skills + .claude/MEMORY.md), and OpenClaw — all converged on the same primitive. Use Markdown files as agent memory.
What the filesystem changes
Memory. No dedicated memory module, no vector database. CLAUDE.md is project-level long-term memory. task_plan.md is task-level working memory. .claude/MEMORY.md is the experience log. The industry spent millions on vector DBs and RAG; the design that actually shipped is a few text files in a folder.
Skills. Claude Code's Skills system is just files — SKILL.md loaded on demand. 40+ agent tools now speak the same Skills convention. Skills aren't code. They're files.
Context engineering. Manus stores full tool-call results to the filesystem and keeps only file-path references in context. When the agent needs detail, it globs and greps on demand. That's exactly what Anthropic calls just-in-time context — don't stuff the database into the window; maintain an index and pull when needed.
So what is a sandbox?
Put plainly: a sandbox is an isolated execution environment with a filesystem. Without a filesystem the agent has no state, no memory, no workspace, no skill-loading mechanism. Everything above — task_plan.md, CLAUDE.md, Skills, externalised context — depends on a persistent, readable, writable disk.
That's why the sandbox layer is one of the most certain bets in 2026 agent infrastructure.
Three products that proved it
Manus: the "cloud computer" that made the agent step-change

Manus uses E2B to assign a full virtual computer per task (source: E2B Blog).

Manus's multi-agent architecture: a planner decomposes, executors run inside E2B sandboxes (source: E2B Blog).
Manus isn't a single LLM agent. It's a multi-agent orchestration system: planner decomposes, executor runs, verifier checks. The thing that took it from chatbot to agent that ships work was assigning a full cloud VM per task — filesystem, browser, terminal, network access. Inside that VM the agent writes code, builds sites, runs analysis, even ships mobile apps.
Manus's co-founder put it bluntly: "Manus isn't running a few lines of code. It uses 27 different tools and needs E2B to give it a full virtual computer."
E2B's own estimate: if Manus had built this infra themselves, 3–5 full-time infra engineers for several months. They chose to self-host E2B and put their headcount on multi-agent orchestration. The signal is unambiguous — the sandbox is infrastructure, not something you should build.
AutoGLM: the more aggressive "cloud phone + cloud computer"

AutoGLM gives each user a cloud phone and a cloud computer the agent operates like a human.
Zhipu's AutoGLM goes further. Instead of giving the agent a sandbox, it gives the user a cloud phone and a cloud computer. The phone has 30 apps pre-installed (Weibo, Xiaohongshu, Taobao, Douyin). The computer is Ubuntu plus browser plus LibreOffice. The agent operates inside these environments the way a person would.
Why? Because the real world is too noisy. Different WeChat versions, different UI layouts, popup ads. Zhipu's bet isn't make the model smarter. It's create a standardised world. The upper bound of an agent's capability is the completeness of its sandbox. Give it a browser, it browses. Give it a full computer, it does anything Turing-complete.
Claude Code: the agent moves into your filesystem
Manus gives the agent its own VM. AutoGLM gives it a cloud phone. Claude Code goes the other direction — the agent moves into your existing project directory.
This is "filesystem as agent infrastructure" in its purest form. Claude Code doesn't create a new environment; it operates on your codebase directly. It reads sources to understand architecture, edits files to fix bugs, runs your tests, tails your logs. CLAUDE.md becomes long-term memory. The directory layout becomes the agent's cognitive map. Git history becomes accumulated experience.
The lesson: the filesystem isn't just storage. It's the agent's working memory and its cognitive interface. No "memory module" needed — files are memory. No knowledge base needed — the codebase is the knowledge base.
Even the session itself is a file — JSONL under ~/.claude/projects/, growing to multiple GB. The entire agent state, history, and context live on disk. That's why Anthropic's Managed Agents pulled the session out of the container into separate external storage. These files are too important to die with the container.
Stack the three side by side: Manus gives the agent a new computer (isolated filesystem). AutoGLM gives the agent a standardised device (controlled filesystem). Claude Code gives the agent your computer (shared filesystem). Different shapes, identical requirement — a persistent environment where the agent can read and write files. That's what a sandbox is.
A whirlwind tour of the market
Perplexity. 340M searches per month, used E2B for code execution and data visualisation on Pro. Integration to ship took one week. Now Perplexity is building their own Sandbox API — K8s-pod isolation, one pod per session. Their security design is worth noting: sandboxes have no direct network access; egress is brokered by an external proxy that matches by domain and injects credentials. The sandbox itself never sees an API key. That's exactly Anthropic's "credentials don't enter the sandbox" stance — two teams independently arriving at the same answer.
This also reveals a structural threat to E2B: large customers tend to graduate and roll their own.
Hugging Face. Open R1 RL training on E2B, spinning up hundreds of sandboxes concurrently. LMArena runs Web-Arena evals on E2B. Meta uses Modal for Code World Model, thousands of concurrent sandboxes for RL. The second use case for sandboxes isn't just runtime — it's training time. Agents need to learn how to operate environments, which means usage will dwarf the "execute a user task" load.
Devin (Cognition), $10.2B valuation. Took "agent works in a sandbox" to its logical extreme. Every Devin instance runs in an isolated sandbox with shell, code editor, browser, and persistent filesystem. Devin 2.0 runs multiple parallel instances per user, each in its own cloud IDE. ARR went from $1M in Sep 2024 to $73M by Jun 2025; merged ARR after the Windsurf acquisition is roughly $150M. Goldman Sachs runs Devin alongside 12,000 engineers. The structural lesson: when you give the agent a whole computer — not just a completion API — the product category changes. Devin's debug-run-deploy loop is in a different dimension from Cursor or Copilot, and the difference is the sandbox.
Bolt.new (StackBlitz). The most dramatic story in the space. StackBlitz spent seven years on WebContainers (a full Node.js runtime inside the browser). By late 2023 ARR was $80K. Investors gave them a last chance. Then Claude 3.5 Sonnet shipped in June 2024, they combined it with WebContainers, and: 30 days from $0 to $4M ARR, six months to $40M, 5M users, $700M valuation. Their sandbox isn't a cloud VM, isn't Firecracker, isn't Docker — the sandbox is your browser tab. Millisecond startup, zero network round-trips, near-zero server cost because compute lives on the user's machine. CTO Albert Pai: "Everyone thinks we have a huge server farm. The server is your browser."
Lovable. Vibe coding's poster child, sandbox on Fly.io containers, every user-build pays for server time. Interesting contrast with Bolt.new — same product category, opposite cost structure, opposite business model.
v0 (Vercel). Evolved from component generator to full-stack tool in early 2026, with sandbox-based runtime on Vercel's own Sandbox (Firecracker microVM + Fluid compute). 6M developers, $9.3B valuation.
OpenHands (formerly OpenDevin). 68.6K stars, $18.8M Series A. Each task in a Docker sandbox. SWE-bench Verified score with Claude: 77.6%. Their V1 SDK is moving from "Docker required" to "sandbox optional" — not every task needs full isolation, which matches Anthropic's load the sandbox on demand design.
Replit Agent. One of the earliest "online IDE + containerised execution" products. Self-built Nix environments. Cost-per-container is the trade-off.
Phoenix.new. Chris McCord (Phoenix framework) built Phoenix.new on Fly.io Sprites. After the agent generates a Phoenix app, you can see its runtime logs — impossible on ephemeral sandboxes where the box dies with the task. Persistent sandboxes let the agent use the app's full lifecycle: not just write the code, but tail logs, debug, monitor.
What Managed Agents actually fixed
Three core design moves are worth lifting straight out of Anthropic's writeup.
1. Separate reasoning from execution

After decoupling: the harness is pulled out of the container, the session stored separately, the sandbox provisioned on demand.
The original mistake was packing everything into one container. That container became a pet — when the session died, it died with it; when it got stuck, you had to go in and resuscitate it. Worse, customers wanted to connect Claude to their own VPC, and when harness and sandbox were one box, the network boundary became unsolvable.
The fix: split the agent into three independent interfaces. The harness (the orchestration loop, stateless) calls the sandbox (execution environment) the way it'd call any tool:
execute(name, input) → string
Both container and harness become cattle, not pets. If one dies, replace it. The reported numbers: p50 TTFT down 60%, p95 down 90%+. The security boundary also moves: credentials never enter the sandbox, Git tokens are written to a remote during init, OAuth tokens live in an external vault. Designed-in, not bolted-on. microsandbox's "secrets never leave the host" is the same idea.
2. Load the sandbox on demand
The easiest design to underrate. Previously every session waited on container startup — clone the repo, install dependencies, replay events — even if the user just asked a one-line question. After decoupling, the container is provisioned only when Claude decides it needs to execute code. Most sessions' TTFT no longer touches sandbox cold-start.
For sandbox vendors: your cold-start speed might matter less than you think. If the layer above is well-designed, most requests don't trigger the sandbox at all. The flip side: when one is required, latency dominates the experience. That's why Zeroboot's 0.79 ms start matters — if a sandbox is as cheap as a function call, the agent can fork a fresh environment at every decision point.
3. Decouple session storage from context window
Anthropic pulled the session log out of the container and made it an externally persisted, append-only event stream:
getEvents() // slice by position
emitEvent(id, event) // append
Three benefits: containers can die without data loss; the context window decouples from history (store everything, recall what's needed, transform freely in the harness); harness upgrades don't invalidate history.
Why split storage from management? Anthropic was explicit: "we can't predict what kind of context engineering future models will need." Their real example — Sonnet 4.5 had context anxiety and needed context reset. The same harness on Opus 4.5 didn't. Reset became dead code. Don't bake today's coping strategy into your durable store.
Player landscape, in brief

E2B: cloud sandbox execution for AI agents (source: E2B).

E2B sandbox creation growth (source: E2B).

Fly.io Sprites: persistent Firecracker microVMs that auto-sleep when idle (source: Fly.io Blog).
E2B. 200M+ sandboxes shipped, 88% of the Fortune 100, customers including Manus, Perplexity, Hugging Face, Groq. Open-core, self-host-able. Pricing — per-second plus $150/mo Pro — hurts small users while large ones hit beta-storage and 24h limits. Strategy is moving toward an open "sandbox protocol" plus Secrets Vault, monitoring, multi-sandbox console. Sandbox creation grew 375× in a year (40K → 15M monthly). Risks: pause/resume in beta with known data-loss bugs, no real SSH, ARR ~$1.5M (Jun 2025) is small relative to funding.
Daytona. Fastest-growing challenger. Pivoted from dev environments in 2025. Sub-90ms cold start. Fork / snapshot / Computer Use support. Customers: LangChain, Turing, Writer. $1M ARR in 3 months, doubled in six weeks. $24M Series A led by FirstMark, with Datadog and Figma Ventures. Risks: Docker isolation (weaker than microVM), single region, 20-person team. Apache 2.0 is a real advantage.
Fly.io Sprites. Persistence faction. Persistent Firecracker VMs, 100GB Tigris-backed storage, 30s auto-sleep, Claude Code preinstalled. Community benchmarks: 60–70% less custom code than using Machines directly. cgroup-measured billing — a four-hour Claude Code session is about $0.44. Risks: only launched Jan 2026, no SLA, no region choice, closed source.
Modal. AI infra platform, sandbox is one slice. ARR $50M, valuation pushing $2.5B. Meta for RL, Scale AI for MCP servers. Native serverless GPU, excellent Python SDK, $30/mo free credit. Trade-offs: gVisor (weaker than microVM), 24h sandbox lifetime, no BYOC, no sandbox-specific optimisation.
Quick takes.
- Zeroboot — 0.79ms boot, 190× faster than E2B. If it matures, sandboxes become as cheap as function calls.
- microsandbox (YC X26) — local-first microVM, network-layer secret injection. Designed to run
claude --dangerously-skip-permissionssafely. - Vercel Sandbox — Firecracker + Fluid; I/O wait isn't billed, bursty workloads see 95% cost drops. 5h session cap.
- Google Agent Sandbox — open source, K8s-native, best for teams already on K8s.
- Alibaba OpenSandbox — protocol-driven, multi-language SDK. Open-source K8s-scale solution.
From "sandbox" to "agent OS"

Managed Agents architecture overview: the Session / Harness / Sandbox three-layer virtualisation.
Step back from the line-items and the most interesting move isn't who's fastest or cheapest. The whole industry is pivoting from sandbox to agent OS.
E2B wants to be the HTTP of sandboxes. Anthropic shipped session/harness/sandbox as three OS-like abstractions. Manus gives every task a full personal computer. AutoGLM gives every user a cloud phone. Sprites calls itself a persistent computer you can summon in a second. Daytona positions as a programmable, composable computer.
Everyone is saying the same thing: the agent needs a computer. The differences are surface area — ephemeral or persistent, desktop or phone, open or closed, single-user or multi-user.
Anthropic's Managed Agents architecture spells out the cleanest answer: don't program against a specific computer, program against the interface that says "I can use any computer."
execute(name, input) → string
What's underneath is replaceable. That's the point.
Where Mana sits

Every agent needs a computer.
We're building Mana — natural-language native iPhone apps and system extensions. The architecture was sandbox-first from day one, because the agent has to do things on a computer to generate, test, and ship the user's app.
Every active user session gets its own execution environment. Inside it, the agent runs code, installs dependencies, builds and validates the app. We hit every problem this piece describes — container-as-pet, harness/sandbox coupling, cold-start eating UX. We solved them the same way: pull the session out into durable storage, treat the sandbox as cattle, and gate provisioning on whether code actually needs to run.
The call we made: the endgame here isn't a single winner. It's exactly what Anthropic predicted — the interface standardises, the implementation becomes swappable. Today we run on Fly machines. Tomorrow we might run on something else. As long as execute(name, input) → string doesn't change, the agent logic above doesn't move. That's why we hid the execution environment behind an interface on day one.