Radar

The Subsidy Ended: What Tool-Using Agents Actually Cost

Bennie Haelen — Tue, 09 Jun 2026 11:09:17 +0000

On June 1, GitHub Copilot’s usage-based billing became active for all Copilot plans, and developers reacted quickly and loudly. A Pro plan still costs $10, but it now comes with a monthly pool of AI credits. Those credits are priced at a penny each, and they’re consumed according to the model used and the tokens processed, including input, output, and cached tokens. For a heavy agentic session running a frontier model, that makes spend feel very different from a flat subscription.

That’s the news, and it’s worth understanding, but it isn’t the important part. Nothing about the underlying cost of agentic work actually changed on June 1. The tokens were always being consumed, the loops were always running, and the tool calls were always expanding the context. What changed is that the meter became visible. A workload that had been quietly subsidized under a flat rate started showing up as an itemized bill.

Where the tokens go

To see why the bill landed so hard, it helps to compare two things that look similar and bill very differently. A chat completion is close to a single transaction. You send a prompt, the model sends an answer, and you pay roughly once for the input and once for the output. A tool-using agent doesn’t work that way at all. An agent doesn’t answer a question so much as work toward it, and it works by looping. It reasons about the task, calls a tool, reads the result, reasons again, calls another tool, and continues until it decides it’s finished.

Every pass through that loop carries a cost that’s easy to miss. In many agent harnesses, each turn carries forward a large share of the accumulated context: prior messages, tool descriptions, retrieved files, and tool results. Even when some of that context is cached, summarized, or pruned, the system is still doing metered work to preserve enough state for the next decision. The final answer you actually wanted is only a thin slice of what you paid for. The loop is the bill.

This is why agent cost doesn’t scale politely. It scales with the number of turns, and the number of turns scales with how much discovery the agent has to do, which in turn scales with how vague the request was and how much irrelevant context it’s dragging along. A clean, well-scoped task might finish in three turns, while the same task posed as an open-ended question might wander through 15, each carrying the cost of everything that came before it. Under a flat rate, that difference was invisible. Under usage-based billing, it’s the difference between a small interaction and an expensive one.

Tool design is now part of the cost model

I wrote recently about a hidden tax on Model Context Protocol servers: the way an overstuffed tool catalog quietly degrades a model’s ability to route to the right tool. Bloated descriptions, overlapping responsibilities, and vague parameters make the model’s job harder and its choices worse. That argument was about accuracy. The billing change adds a second invoice for the same bloat, and this one is denominated in dollars.

The tool catalog is often part of what gets carried through the agent’s loop. A tool described in three tight sentences and a tool described in three rambling paragraphs may both function, but the second one pays rent in the context window every time an agent has it loaded. Multiply that across a catalog of 40 tools and a workflow that runs a dozen turns, and the cost of verbose tool design stops being a rounding error. Tool design was already a correctness discipline. It’s now a cost discipline as well. The same audit that tightens routing accuracy tightens the bill.

Where prompt discipline runs out

There’s a layer of this that individual users can control, and it’s worth knowing because the savings are real and immediate. Two patterns matter most, and I’ve been handing both to the engineers on a pilot I run for a large healthcare organization. They aren’t magic tricks. They’re ways to keep the agent out of unnecessary discovery loops.

The first pattern is about input. Prompt the agent like a short requirement rather than a broad question. A request such as “look at the encounter data and tell me what you find” forces the agent into discovery mode, where it burns turns figuring out what you meant, and every one of those turns carries the full context forward. Compare that to a prompt that front-loads the specifics by naming the project and the table, naming the date field to filter on, stating the output shape you want, and calling out anything that should be excluded. A better prompt would be: “Using the curated clinical project and the silver-zone encounters table, show total encounters by month for calendar year 2025, use admission_date_time for inclusion, and return one row per month ordered chronologically.” The second prompt collapses the loop. The agent has what it needs on the first turn, so it does the work instead of interviewing you for it.

In practice, the difference isn’t just polish. The vague version forces the agent to discover the data model, infer the date semantics, choose an aggregation, and decide on a display format. The specific version turns the task into a bounded query. That difference shows up in accuracy, latency, and cost.

The second pattern is about output, and it’s the lever most people overlook. Ask for plain text or Markdown during the intermediate steps, and save rich HTML formatting for the final, confirmed deliverable. Formatted output is expensive to generate, and requirements shift. If you ask for a polished HTML report on the first pass and then change a filter, you pay full output-token freight to regenerate all that layout, often more than once. The cheaper habit is to validate the numbers in text and format only at the end.

These patterns work, and they also have a ceiling. Both of them put the entire burden of cost control on the user, and they hold only as long as every user exercises the discipline on every prompt. The day someone reverts to “tell me what you find,” the savings evaporate, and the only thing standing between the team and a surprise invoice is a budget cap that reports the overspend after it has already happened.

Cost is a governance problem, not a budgeting one

That fragility is the real lesson. A budget cap is a backstop rather than a control. It will stop a runaway, but it tells you that you overspent rather than why, and it does nothing to make the next run cheaper. Treating cost as a budgeting problem leaves you forever reacting to the meter, while treating it as an architecture problem lets you build the savings in once and stop relying on everyone’s good behavior.

That means the controls that matter belong on the platform rather than in individual prompts. By the platform I don’t mean the agent itself, the coding assistant or chat client a developer drives day-to-day, and I don’t mean the model or a router sitting beneath it. I mean the control plane that sits above the agents, the layer where an organization enforces policy, access, observability, and now cost across every agent and model its developers touch. An administrative console that gives IT visibility into who is doing what and which capabilities they can install is an early, narrow instance of it. A router that sends planning to a cheap model is one feature that belongs there. The platform is where the rules live, and the agent is a consumer of those rules rather than the place you set them. The platform should route models by task, using cheaper models for planning and reserving frontier models for work that earns the price. It should bound the loop, requiring the agent to check in after a fixed number of iterations. It should cap tool-result payloads so a careless query cannot dump a million rows into the context window. It should default intermediate work to plain text, making the cheap path the path of least resistance instead of something users have to remember.

Every one of those controls is something a user can approximate by hand and something the platform can simply guarantee. This is the same principle I keep returning to in the context of data access, where safe behavior cannot depend on the person at the keyboard remembering the rules. Prompts guide behavior. Guardrails make the cheaper and safer behavior the default. Cost governance is guardrails as control plane, with a dollar sign attached, enforced at the same layer where you already enforce who is allowed to see which row.

The pattern, not the vendor

It would be a mistake to read this as only a GitHub story. GitHub is the current example because its change is visible and recent, but usage-based billing for agentic work is the direction of travel for many AI tools. The economics under the hood are similar: Agentic workloads turn single answers into loops of model calls, tool calls, and context management. The flat-rate subsidy was always going to come under pressure once the workload shifted from autocomplete to autonomy.

The organizations that treat June 1 as a pricing event will optimize a few prompts, grumble, and move on until the next vendor changes its meter. The ones that treat it as an architecture signal will push the cost controls down into the platform, where they hold regardless of which provider is counting which token. That’s the more durable place to stand. The bill didn’t get bigger this month. It got honest, and an honest bill is the kind you can engineer against.

Long-Running Agents

Addy Osmani — Mon, 08 Jun 2026 15:59:06 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.

For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal; the agent calls some tools; you watch tokens stream by; you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It reintroduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.

Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.

This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.

What “long-running” actually means

“Long-running” used to mean at least three different things in practice, and it helps to keep them separate.

Long-horizon reasoning. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn 10 steps ago. METR has been tracking this with their time horizon metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been doubling roughly every seven months since 2019, and their TH1.1 update earlier this year doubled the count of eight-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.

Long-running execution. The agent’s process runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24-7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a harness story, and it’s the one this post is mostly about.

Persistent agency. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the Memory Bank flavor of long-running.

In practice the three blur together. A real production agent does long-horizon reasoning inside a long-running execution backed by persistent agency. But the engineering problems are different in each, and so are the products that solve them.

Why this matters

There are two reasons I believe this work matters a lot right now.

The first is a phase change in what’s economically feasible to delegate. An agent that runs for 10 minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for 10 hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s Claude Sonnet announcements put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including one run that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “Should I delegate this?” is no longer obvious.

The second is that persistence changes what the agent is. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s Project Vend was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and the second phase ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.

Those are the same problems every team building production agents now hits.

The three walls every long-running agent hits

Three walls show up in basically every write-up I’ve read this year.

Finite context. Even a 1M-token window fills. And context rot, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.

No persistent state. A new session starts blank. Anthropic’s framing in their scientific computing post is the cleanest version I’ve seen: “Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.

No self-verification. Models reliably skew positive when they grade their own work. Asked “Are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.

Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.

The Ralph loop: One of the simpler practitioner versions of long-running agents

The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is literally a bash script that loops:

Pick the next unfinished task from a list (prd.json or equivalent).
Build a prompt with the task, the relevant context, and any persistent notes.
Call the agent.
Run tests or other checks.
Append what happened to progress.txt.
Update the task list (done, failed, blocked).
Go back to step 1.

The reason it works is the same reason any of the harnesses below work: State lives outside the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s Compound Product extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open source version of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Improving Coding Agents”: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.

The big-lab stories below are different ways of paying for that production-readiness.

Anthropic: Harnesses, then the brain/hands/session split

Anthropic has been the most public about the engineering. Two posts are worth reading end to end.

The first is “Effective Harnesses for Long-Running Agents,” which lays out a two-agent harness for autonomous full stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured feature-list.json, and write an init.sh that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a claude-progress.txt note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” InfoQ’s writeup extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.

The second is “Scaling Managed Agents: Decoupling the Brain from the Hands,” the architectural post behind Claude Managed Agents (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.

This sounds abstract, but it isn’t. Here’s Anthropic’s framing: “Every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become cattle, not pets, and a brain crash doesn’t lose the run. A fresh container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 just from being able to start inference before the sandbox is ready.

The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.

For the scientific computing crowd, Anthropic’s “long-running Claude” post reduces all of this to a simpler stack: CLAUDE.md as a living plan the agent edits as it learns, CHANGELOG.md as portable lab notes, tmux plus SLURM plus git as the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent back into context whenever it claims completion and asks if it’s really done. Their flagship case study is a Boltzmann solver Claude Opus 4.6 built over a few days that reached subpercent agreement with a reference CLASS implementation. Months to years of researcher time, compressed.

Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.

Cursor: Planners, workers, judges

Cursor’s “Scaling Long-Running Autonomous Coding” is the other essential read this year. They walked into walls that Anthropic mostly papered over.

Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:

Planners continuously explore the codebase and emit tasks. They can recursively spawn subplanners.
Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.
Judges decide when an iteration is finished and when to restart.

Two things stand out from the post. One: “A surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: Different models slot into different roles. Their reported finding is that a GPT model was better than Opus for extended autonomous work specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.

This pairs with Composer 2 (their proprietary frontier coding model that ships in Cursor 3) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit run in cloud when you realize it’ll take 30 minutes, and reattach later from your phone. Each agent runs in an isolated Git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.

The shape ends up close to Anthropic’s: Roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with Git as the coordination substrate.

Google: Long-running agents on the Agent Platform

Google’s announcement at Cloud Next ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.

The pieces that matter for this post:

Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.
Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.
Agent Memory Bank is the persistent long-term memory layer, generally available as of Next ’26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory Bank-backed agent cut submission time by over 50%.
Agent Sandbox handles hardened code execution.
Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.

Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with ADK (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.

Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.

Five patterns for long-running agents in production

Shubham Saboo and I wrote up five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.

Checkpoint-and-resume. The most common multiday failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with subsecond latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.

Memory-layered context. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: The agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “What are my agents doing?” and becomes “What are my agents remembering, and how is that changing their behavior?”

Ambient processing. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.

Fleet orchestration. In real systems, you rarely have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.

The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: What’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The full write-up with code samples covers each pattern in depth.

So how do you actually build one today?

This is the practical question, and it has a different answer depending on what you’re building.

You’re a developer who wants long-running coding work on your own repo. Just use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Treat your AGENTS.md like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use the Ralph loop when the agent claims it’s done and you don’t believe it. For multihour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.

You’re building a hosted agent product. Don’t build the runtime. Pick a managed one. The three real options today: Google’s Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, or roll something on top of ADK, the Claude Agent SDK, or Codex SDK and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.

You’re doing something autonomous and operational (monitoring, research, ops). Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.

A few things matter regardless of which path you take.

Write down the done condition before the agent starts. This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine done midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/worker/judge pipeline, or a generator/evaluator pair, is a real architectural pattern, not a stylistic preference. Even if it’s the same model in different roles with different prompts.

Invest in the session log, not just the prompt. The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.

Treat compaction and context resets as first class. Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.

There are some real limitations right now

A few things are still genuinely unsolved.

Cost. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.

Security. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: Credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.

Alignment drift. Over many context windows, agents drift. The original goal gets summarized, then resummarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”

Verification. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.

The human role. This is the one I keep coming back to. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

Where this is going

Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.

Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”

The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just in time for a task instead of being preconfigured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.

The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.

The AI Agents Stack (2026 Edition)

Paolo Perrone — Mon, 08 Jun 2026 10:56:59 +0000

The following article originally appeared on Paolo Perrone’s The AI Engineer Substack and is being reposted here with the author’s permission.

Your team picks LangGraph for a customer support chatbot. Three weeks in, you’ve got 14 nodes in a state graph, a custom checkpointer writing to Redis, and retry logic for tool calls that fail once a week. The agent answers refund questions. It calls one API. A 50-line script on the OpenAI SDK with two MCP servers would have done the same thing. But nobody mapped which layers the problem actually needed.

In November 2024, Letta published an AI agents stack diagram that became the default reference for half the engineering teams I talk to. If you’ve seen a “layers of an agent” visual on LinkedIn or pinned in a Slack channel, it probably traces back to that article.

That diagram is 14 months old now, and a lot has changed since. MCP didn’t exist yet. Memory was still treated as a subset of your vector database. Nobody was shipping provider-native agent SDKs. Eval wasn’t even on the map. The stack has six layers in 2026, and at least three of them didn’t exist as distinct categories when Letta drew the original.

So we drew it from scratch. This is the 2026 version.

TL;DR

That’s the starting stack. Add complexity when something specific breaks, not before.

What are we even mapping?

Before the stack, there was a loop. In “What Is an AI Agent?,” we defined an agent as the think-act-observe cycle: The model reasons about a task, takes an action (calls a tool, writes to memory), observes the result, and loops until the task is done. That loop is the atomic unit. Everything in this issue is infrastructure that makes that loop work reliably, at scale, in production.

The agent stack is not the LLM stack. A chatbot needs inference and maybe RAG. An agent needs state management across multistep execution, tool access governed by protocols, memory that persists across sessions, autonomous reasoning loops, and guardrails that constrain behavior in real time. That’s a fundamentally different set of infrastructure problems.

We’re mapping the six layers between your LLM and a production agent. We’re not covering training infrastructure, data pipelines, or model fine-tuning. Those are adjacent stacks. We covered RAG in depth in Issue #5. Today we’re zooming out to show where RAG fits in the bigger picture.

Three things redrew the map between 2024 and 2026. MCP standardized tool connectivity, and the entire tools layer is new because of it. Reasoning models changed what agents can do autonomously, with single-call agents replacing some multistep chains. And memory became a first-class architectural primitive, not an afterthought bolted onto a vector database.

How to evaluate each layer

When choosing tools at each layer, ask three questions. How much state do you need to manage? A stateless tool caller and a multi-session agent that learns over time are different engineering problems, and the layers where state management is hardest (memory, frameworks) are where most teams get stuck. How much vendor lock-in can you tolerate? MCP is an open standard, provider SDKs are not, and every tool choice either increases or decreases how painful your next migration will be. And how hard is it to go from demo to production? Some layers (model serving) have almost no gap, while others (eval, guardrails) have a massive one. The layer where you feel that gap most is the one to invest in first.

We take each layer from the bottom up, starting with the most stable and ending with the least mature.

Layer 1: Models and inference

How you run the model that powers your agent: call an API, use a managed open weight provider, or self-host.

The inference layer changed more in tone than in substance. Reasoning models like o1, o3, DeepSeek R1, and Claude with extended thinking shifted what agents can plan and execute. Agents that previously needed multistep chains can now solve problems in a single reasoning call. Open weight models like Llama 3.3, DeepSeek V3, and Qwen 2.5 closed the quality gap dramatically, so “always use the biggest closed model” is no longer default advice. The emerging pattern is to prototype on closed source and deploy on open weight.

The honest take: This layer is commoditizing. Model differences matter less each quarter. The real decision is the cost and latency trade-off, not which model is “smartest.”

On the evaluation side, API calls are stateless. Send a request, get a response. Nothing to manage. Lock-in risk runs high for closed APIs because each model reasons differently, so switching providers means retuning prompts, adjusting for different failure modes, and retesting your eval suite. It’s low for open weight, where you can swap the model and keep the infra. The prototype-to-production gap is the smallest of any layer. Your demo API call is the same as your production API call.

Self-host when your agent call volume makes API pricing untenable or when you need sub-100ms latency that API round-trips can’t deliver.

Layer 2: Protocols and tools

How your agent calls external tools and APIs: through MCP servers, browser automation, or agent-to-agent protocols.

This layer didn’t exist as a distinct category in 2024. Every framework had its own JSON schema for tool definitions. Now MCP is the standard, with 97M monthly SDK downloads, adoption by OpenAI, Google, and Microsoft, and a donation to the Linux Foundation.

Browser Use exploded in parallel, hitting 78K GitHub stars in under a year. Nobody was shipping browser agents in production in 2024. And agents can now talk to other agents. IBM launched ACP, and Google launched A2A. Neither is standard yet, but the problem they solve (agents coordinating with other agents) is real and growing.

Security is the open problem. Endor Labs analyzed 2,614 MCP servers and found 82% prone to path traversal and 67% to code injection.

The honest take: The protocol debate is over. MCP won. The only question left is how you lock down your MCP servers before someone exploits them.

State management is nonexistent here. Your agent calls a tool, gets a response, done. No session, no memory between calls. Lock-in risk is low because MCP is an open standard, so if you build MCP servers, any MCP-compatible agent can use them. The prototype-to-production gap is medium. Your demo MCP server works until someone sends a malicious tool description. Security and governance are the gap.

MCP standardized how agents use tools. It says nothing about how agents talk to each other. ACP and A2A are trying to solve that, but neither has reached critical mass. If you need multi-agent coordination today, you’re building it yourself at the framework layer. We covered MCP in depth in Issue #4.

Layer 3: Memory and knowledge

How your agent stores and retrieves what it knows: in-context state, vector search, or persistent memory across sessions.

All three tiers feed into the same place: The context window your agent sees on every call.

In 2024, memory meant “pick a vector database and do RAG.” In 2026, memory is a first-class architectural primitive with three distinct tiers. Context windows got massive. Gemini hit 1M+ tokens, Claude 200K. Bigger windows didn’t kill the need for memory. They changed the trade-off: What do you stuff in-context versus what do you retrieve on demand?

“Context engineering” replaced “prompt engineering” as the core discipline. Instead of writing a better prompt, you architect what information the agent sees on every call. Memory blocks appeared as named, structured fields in the context window that the agent can read and overwrite every turn. Instead of dumping everything into the system prompt, the agent manages its own state: what to keep, what to update, what to drop.

On the infrastructure side, pgvector became the default for teams that don’t need a dedicated vector database. It’s just Postgres with an extension. GraphRAG emerged as a second retrieval option: follow relationships between entities instead of matching embeddings, with Neo4j leading this space. Sleep-time compute, where agents process information during idle time, is research stage but signals where tier 3 is heading.

The honest take: Most teams overcomplicate memory. Start with conversation history in Postgres and a structured system prompt. Add vector search when your history exceeds context limits. Add agentic memory management only when your agent needs to learn across sessions.

This IS the state layer. You’re deciding what your agent remembers, how it retrieves it, and when it forgets. Highest complexity in the stack. Lock-in risk is medium. pgvector is portable because it’s just Postgres, while specialized tools like Mem0 or Zep are harder to migrate away from. The prototype-to-production gap is large. Demo memory works because context windows are big enough. Production memory breaks when conversations get long and your agent starts forgetting the important parts.

In-context memory breaks down when agents need to share memory across instances or maintain state across model provider switches. That’s where dedicated memory infrastructure like Letta, Zep, and Mem0 earns its keep.

Layer 4: Frameworks and SDKs

How you wire together the model calls, tool use, and control flow that make your agent work: a provider’s built-in toolkit (SDK), a graph-based framework like LangGraph, or raw code.

Every major AI lab now ships its own agent SDK. OpenAI has the Agents SDK (evolved from Swarm). Google released ADK. Microsoft has Semantic Kernel and AutoGen. Hugging Face built smolagents. Two years ago, LangChain was the only game. Now you pick between three camps: provider SDKs that are fast to start but locked to one model, graph-based frameworks like LangGraph that are portable but require more setup, or no framework at all. That choice didn’t exist in 2024.

LangGraph solidified as the graph-based orchestration leader with v1.0 released October 2025 and production deployments at Uber, JPMorgan, LinkedIn, and Klarna. LangChain agents are now built on LangGraph under the hood. Meanwhile, the “build it yourself” camp grew. Teams that tried LangChain in 2024 and fought the abstraction are now writing thin wrappers over provider APIs + MCP. No framework means full control. This works until your agent needs state management or complex branching.

A quick note on naming: “LangChain” and “LangGraph” are not the same thing. LangChain is the integration layer handling model connectors, tool calling, and prompt templates. LangGraph is the orchestration engine managing state, control flow, and graphs. Most production teams use both together, but LangGraph is where the agent logic lives.

The honest take: Most teams pick too much framework. If your agent calls a model and a few tools, you don’t need LangGraph. A provider SDK and a couple of tool calls will get you to production faster than any graph.

Provider SDKs manage state for you. LangGraph makes you define every state transition explicitly. Build-it-yourself means you roll your own. Lock-in risk is the highest in the stack. Your orchestration code doesn’t port. A LangGraph agent rewritten for CrewAI is a new codebase. Provider SDKs are worse because you’re locked to one model too. The prototype-to-production gap is large. Demo works because nothing goes wrong. Production means handling tool failures, retries, timeouts, and humans who need to approve before the agent acts.

The framework you pick determines your migration cost. Provider SDKs are fastest to start but lock you to one model. LangGraph is portable but complex. Building your own gives you full control until your agent outgrows your wrapper. MCP is the one layer that transfers across all three camps.

Layer 5: Eval and observability

How you measure whether your agent is doing its job: tracing runs, scoring outputs, and catching regressions before users do.

This layer barely existed in 2024. Now it’s the gap. LangChain’s State of Agent Engineering survey found 89% of teams with production agents have implemented observability, but only 52% have evals. That 37-point gap is where production quality dies.

“Evaluation as infrastructure” is converging on three tiers: fast checks on every PR (Did the agent call the right tools?), nightly regression suites that use an LLM to judge output quality, and continuous production monitoring that alerts when agent performance drifts. New agent-specific benchmarks have emerged too, including Context-Bench for memory management, Recovery-Bench for error recovery, and Terminal-Bench for coding agents.

The honest take: Most teams skip eval until something breaks in production. By then they’re debugging blind. The teams that don’t have this problem built evals before they deployed.

State management matters here because your agent runs 12 steps, step 3 picked the wrong tool, and steps 4–12 were doomed from there. If your eval only checks the final output, you’ll never know why. Lock-in risk is moderate. Most tools export OpenTelemetry traces, so switching observability providers is doable, but switching eval frameworks means rebuilding your test suites. The prototype-to-production gap is the biggest of any layer. Most prototypes have zero eval. You don’t feel the pain until production users find the failures for you.

Current eval tools are strongest for single-turn and tool-calling evaluation. Multi-agent evaluation, long-horizon task assessment, and evaluating agents that learn over time are all unsolved problems. If your agent does any of those, you’ll need custom eval infrastructure beyond what the platforms offer today.

Layer 6: Guardrails and safety

How you stop your agent from doing things it shouldn’t: filtering inputs, authorizing tool calls, and validating outputs.

Agent guardrails became a separate discipline from LLM guardrails. In 2024, guardrails meant input/output filters on a model. In 2026, your agent calls tools, spends money, and takes actions. Guardrails now means authorizing tool calls, enforcing rate limits, and validating what the agent actually did.

The “guardrails before action” pattern emerged from teams that learned the hard way. They now enforce authorization at the tool execution layer, not the output layer. By the time you filter the response, the agent already sent the email. OWASP published the MCP Top 10 (beta), which is the first real security checklist for tool-connected agents. Deployment is still DIY. LangGraph Cloud and Bedrock Agents exist, but most production teams are still deploying with FastAPI and their own infra. This layer is where you’ll spend the most unplanned engineering time.

The honest take: This is the least mature layer in the stack. No dominant framework, no established patterns. You’re writing policy code from scratch.

Guardrails need to know what the agent is doing right now to decide what it shouldn’t do next. That means tracking agent state in real time. Lock-in risk is low because most guardrails are custom policy code you write yourself. NeMo Guardrails is the closest thing to a framework, but you’ll still write most rules from scratch. The prototype-to-production gap is effectively infinite. Your demo has no guardrails because nobody’s trying to break it. Production will.

Current guardrails tools focus on single-agent systems. If you’re running multi-agent workflows where agents delegate to each other, guardrail propagation across agent boundaries is an unsolved problem. You’ll need custom authorization logic.

What are you building?

This is the decision that cuts through the framework confusion. The agent type determines which layers you invest in and which tools to pick at each one.

A stateless tool caller answers questions from a knowledge base, looks up an order, or checks inventory. You need a provider SDK, MCP, and Postgres. No framework, no vector database. This is a weekend project.

A multistep workflow processes a refund end to end, reviews a PR across five files, or triages and routes support tickets. Steps depend on each other, things fail in the middle, and humans need to approve before the agent acts. You need LangGraph, MCP, and eval. Build evals before you deploy because these agents break silently.

An agent that learns remembers your preferences across sessions, gets better at your codebase over time, or tracks project context across weeks. You need a memory-first architecture, a vector DB, and eval. Orchestration is the easy part. The hard part is deciding what to remember, what gets dropped, and how you stop old context from polluting new answers.

A multi-agent system has agents that delegate to other agents, split a research task across specialists, or run parallel workstreams. You need the full stack. Two agents passing context to each other is already hard to debug. Five is impossible without trace-level evals on every handoff. Build eval infrastructure before you build the second agent.

Coding agents: All 6 layers in action

Coding agents like Cursor, Claude Code, Codex, and Windsurf are the most proven application of the AI agents stack. All six layers, working together.

At the inference layer, these tools serve hundreds of millions of daily requests. Cursor routes between Claude, GPT-4, and its own fine-tuned models depending on the task. At the protocols layer, MCP servers connect to editors, terminals, filesystems, and Git, which is how the agent reads your code and runs commands. The memory layer uses codebase-aware retrieval with reranking. The agent doesn’t read your whole repo. It retrieves the files that matter for this specific edit.

At the framework layer, these are custom orchestration systems with RL loops. Not LangGraph, not a provider SDK. Purpose-built control flow for code generation, review, and iteration. At the eval layer, Cursor retrains its acceptance-rate model every 90 minutes based on whether users accept or reject suggestions. That’s eval running in production, continuously. And at the guardrails layer, sandboxed execution prevents runaway agents. The agent can write code and run it, but inside a container that limits what it can touch.

The AI agent stack cheat sheet

Every layer scored on the three questions from the evaluation framework: How much state do you need to manage? How much vendor lock-in can you tolerate? And how hard is it to go from demo to production?

The bigger picture

Most teams are building like it’s still 2024. They pick LangGraph before they know if they need state. They add a vector database before they’ve outgrown Postgres. They design multi-agent architectures before they’ve shipped one agent that works. The decision flowchart above exists because a tool-calling chatbot and a multi-agent research system share almost no infrastructure. Treat them the same and you’ll overbuild the first and underbuild the second.

The teams that got past this run evals on every deploy, not once a quarter. Their guardrails sit at the tool call layer, not the output layer. Their memory architecture was designed, not inherited from whatever the framework defaulted to. Most teams ship the opposite: no evals, output-only filtering, and a system prompt that grows until the context window chokes. The gap isn’t talent or budget. It’s knowing which layers matter for your specific agent instead of half-building all six.

The stack is going to collapse. Provider SDKs are already absorbing memory, tool calling, and basic eval into a single API. By early 2027, most teams won’t build each layer separately. They’ll get an increasingly opinionated stack from their model provider and that will be fine for 80% of use cases. The other 20%, agents at scale where the defaults break, will still build custom at every layer. But even then, when something fails in production, you need to know which layer failed. That’s what this article is for.

Sources

“The AI Agents Stack,” Letta, November 2024.
“Donating the Model Context Protocol and Establishing the Agentic AI Foundation,” Anthropic, December 2025.
“120+ Agentic AI Tools Mapped Across 11 Categories [2026],” StackOne, February 2026.
Henrik Plate and Darren Meyer, Dependency Management Report, Endor Labs, January 2026.
Jason Liu, Context Engineering Series: Building Better Agentic RAG Systems, August 2025.
“LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones,” LangChain, October 2025.
State of Agent Engineering, LangChain, December 2025.
Yunfei Bai, Allie Colin, Kashif Imran, and Winnie Xiong, “Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon,” Amazon, February 2026.
OWASP MCP Top 10, OWASP.

This Week in AI: Production Viability

Michelle Smith — Fri, 05 Jun 2026 15:55:20 +0000

On this week’s episode, host and the founder of AI advisory firm Intelligence Briefing Andreas Welsch brought together Maya Mikhailov, cofounder and CEO of Savvi AI, and Doug Shannon, generative AI and intelligent automation leader, to cover a handful of interconnected topics that practitioners are navigating right now: OpenAI’s push into personal finance, the role of metacognition in AI-assisted technical work, the growing backlash against token-based productivity metrics, and the new role of forward-deployed engineer. Together, these stories sketch a picture of an industry that’s good at generating output but is still figuring out what output is worth.

Why OpenAI wants your bank account data

When OpenAI announced it was analyzing users’ transaction data in partnership with financial institutions, the coverage focused on the consumer benefit: a smarter way to track spending, comparable to what Credit Karma or Mint offered but with a more conversational interface.

But that’s not all the company’s interested in, or even the main thing. Maya reframed the stakes: “What OpenAI wants to do is figure out consumer intent.” Being able to access users’ financial data is less about helping people manage their money and more about completing a profile the company can then monetize. OpenAI already builds a surprisingly accurate picture of users from their chat histories. Add transaction data and you get specifics that weren’t there before: what someone is saving for, what they’re anxious about, where their money is actually going. That’s a data asset worth a great deal to advertisers.

We’ve seen this pattern before, and as Andreas noted, companies have long held (and used) potentially invasive data to recommend products. The Target pregnancy prediction story is now more than a decade old, but it’s still being taught in business school, including by Andreas, precisely because it illustrates how behavioral data can be combined to infer things people haven’t explicitly disclosed—and spotlights the fine line between effective recommendations and those that feel too personalized, reminding consumers just how much information companies have on them. Companies’ profile-building capability hasn’t changed, but AI chat adds a new wrinkle, said Maya. A conversational interface makes disclosure feel natural, so the knowledge graph based on your chat history is very powerful. And these tools are also better positioned to share recommendations than traditional avenues. “By having this style that is agreeable, that is engaging,” Maya explained, “those recommendations are going to be a lot stickier than what a fragment of a sentence I type into a regular search engine.”

Metacognition as a professional skill

When you delegate thinking to a system that averages across a massive range of inputs to produce an answer, you need to know when that answer is good enough and when it isn’t.

“We’re essentially being averaged out,” Doug said. The model is doing many things behind the scenes to find a mean response. The human’s job is to ask questions about the questions, to push past the first answer, and to know whether their own judgment is still in the loop. That’s why Doug’s been pushing for a renewed interest in metacognition, or “thinking about thinking.” Offloading cognitive load that’s peripheral to your work is fine, Doug and Maya agreed. Offloading the reasoning that’s central to your job’s value—what Doug called cognitive surrender—is where organizations get into trouble.

The future advantage won’t come from access to AI. Everyone will have some kind of access to it. The advantage will come from knowing what to offload, what to question, and what should never leave human judgment. This is a skill-development question as much as a philosophical one. The people who’ll be most effective with AI tools aren’t the ones who use them most; they’re the ones who understand what to hand off and what to keep. That requires domain knowledge, judgment about when a model’s answer is plausible but wrong, and enough fluency with how these systems work to recognize when you’re being handed an average instead of an answer.

Tokenmaxxing and the wrong incentive

The tokenmaxxing debate seems to be coming to a head. Amazon abolished its AI productivity leaderboard after employees started gaming it by writing inefficient code to rack up token usage. And one company reportedly burned through $500M in Anthropic tokens in a single month after failing to set limits. The companies encouraging tokenmaxxing are incentivizing the wrong metrics, Maya argued. It’s like determining which bakery is best by the amount of flour it uses. The right question is “Are we making a quality product?”

Andreas shared his own vibe coding experience as an example of how token consumption and technical debt compound in practice. A developer starts with a modest plan and burns through their quota running agents in half an hour. They upgrade to a higher tier, paying five times more, but now the sunk-cost logic kicks in. As Andreas pointed out, now they feel like they “should also be getting five times more the value out of [their subscription],” so scope expands from a single tool into a unified business operating system. Three weeks later, the accumulated complexity has outpaced the ability to evaluate it: Repeated security audits keep surfacing new issues, each pass generating recommendations that require cybersecurity expertise most vibe coders don’t have. Here’s where Doug’s point about metacognition applies: The more a builder stays actively involved in understanding what the system is actually doing, the better their judgment about whether it is working. For less engaged users, the risk is accepting the output, shipping the debt, and discovering the consequences later.

Most of the misalignment originates in the gap between what executives expect from AI and what practitioners deal with day-to-day. Executives see a capability that could change the slope of productivity, Maya explained. Engineers and analysts live with the technical debt, the version control problems, and the regulatory constraints that don’t disappear because you have a better code completion tool. The leaderboard problem is a symptom of that disconnect.

GitHub’s recent shift from unlimited to usage-based pricing for Copilot is likely to realign these incentives faster than any internal policy change would. When more CFOs start seeing the actual bills, the leaderboards will all come down.

Doug identified a related problem emerging with the “cognitive surrender” to LLMs. When organizations encourage employees to pipe internal processes, proprietary logic, and institutional knowledge into foundation models without governance, they’re not just running up token bills. They’re giving away the operational knowledge that differentiates them. Process documentation, workflow logic, and institutional memory about why certain decisions were made are all forms of intellectual property, and once they’re encoded into a general-purpose model, the organization’s advantage from them diminishes.

Forward-deployed engineers aren’t enough on their own

Is the answer to these challenges to put a skilled engineer directly inside the customer environment to translate between what a model produces and what an organization actually needs? That’s the promise of the forward-deployed engineer (FDE) approach popularized by AI firms. Doug and Maya both had some criticisms of the model.

Maya’s objection was structural. Enterprise AI deployment isn’t a matter of adding capability on top of existing infrastructure. Organizations arrive with siloed data, legacy systems, and regulatory constraints that no forward-deployed engineer can resolve on technical skill alone. You can’t “just sprinkle some AI on it, and it’ll work just by a package of tokens,” she said. Engineers have to know the context behind why certain data can’t be used or why a particular model can’t be deployed in a regulated context. FDEs coming into an organization fresh don’t have this understanding and as a result may undo decisions that were made carefully and for reasons that aren’t written down anywhere obvious.

Doug’s concern was about communication. FDEs, in his experience, tend to arrive with strong technical instincts and limited organizational context. They get into the work quickly but struggle to communicate across the full stack of stakeholders involved. That’s why business analysts exist, to understand the customers’ problems and what the process actually is before engineers can address them. Skip that step and you get technically correct output that solves the wrong problem.

What both Maya and Doug were underscoring is that AI deployment at the enterprise level is fundamentally a context problem. The models are capable. What’s hard is knowing which capability to apply, where to do it, and with what constraints in place. That knowledge doesn’t live in the model; it lives in the people who’ve worked inside the organization long enough to know why things are the way they are.

The measurement problem

All the topics in this episode circle back to the same question: What are we actually measuring, and what incentives are we setting in place with those measurements? Token counts and lines of code don’t always correlate to the outcomes companies want. You need human expertise and a contextual knowledge of the business to figure out what goals you want to achieve and what to measure to ensure you get there.

On next Monday’s episode of This Week in AI, RecoMind founder Miguel Fierro joins host Christina Stathopoulos to discuss responsible AI, multimodal content creation, and more on how LLMs are changing personalization and user understanding. Miguel will also lead a live demo that offers a glimpse of the next generation of recommendation experiences—register here.

We’ll continue to publish our takeaways here on Radar each Friday and share full episodes on YouTube, Spotify, Apple, or wherever you get your podcasts.

I Let an AI Agent Run 40 Experiments While I Slept

Vanchhit Khare — Fri, 05 Jun 2026 10:27:18 +0000

I set up an AI agent on a rented GPU, pointed it at a training script, and went to bed. By morning it had run 40 experiments, improved validation loss by 5.9%, and cut memory usage from 44 GB to 17 GB. It also spent four hours chasing a bug that a linter introduced behind its back. The agent never flagged it. I only found out because the numbers stopped improving and I started reading logs.

The setup was based on Andrej Karpathy’s autoresearch project: Give an agent one file it can edit (train.py), one metric to optimize (validation bits per byte), a fixed five-minute training budget per experiment, and Git for checkpointing. If an experiment beats the current best, keep the commit. If not, revert. Loop forever. Karpathy’s own run produced 700 experiments and 20 genuine improvements across 48 hours, an 11% speedup on already-optimized code. Shopify’s Tobi Lütke pointed the same pattern at Liquid, their templating engine, and got 53% faster rendering from 93 automated commits. The pattern clearly works. The question is what breaks when you run it yourself.

The first failure: Agents fixing agents

Before running autoresearch, I had a separate problem. I had 15 custom skills for Claude Code (think reusable prompt templates with tool access, structured inputs, and specific behaviors). Most of them were broken when dispatched as parallel background agents. Vague descriptions meant the system couldn’t figure out when to invoke them. Missing tool permissions caused silent failures. Duplicate scopes between similar skills created routing confusion.

So I used the same pattern: dispatch background agents in parallel, one per skill, each tasked with reading the skill definition, identifying problems, and rewriting it. 13 out of 15 came back improved. Descriptions got specific. Dead references to nonexistent files were removed. Tool permissions were added. Two skills were left untouched because the agents couldn’t find anything wrong with them. The whole batch took under an hour.

But here’s what I didn’t expect. Three of the “improved” skills had subtle regressions. One agent removed an AskUserQuestion gate that was there for a reason, because the gate’s purpose wasn’t documented and the agent read it as unnecessary friction. Another agent rewrote a skill description so precisely that it stopped triggering on the fuzzy, misspelled queries real users actually type. I caught these during manual review, but if I had trusted the parallel output without checking, three skills would have silently degraded in production.

The second failure: The linter in the loop

Then I started the training loop. The agent worked through hyperparameters methodically. It halved the batch size early (experiment 4), which turned out to be the single biggest win: more gradient steps in the same five-minute window. It reduced model depth from eight to seven layers, dropped weight decay from 0.2 to 0.05, and tuned the learning rate schedule. Each change was small. The cumulative effect was a 5.9% improvement in validation loss and a 60% reduction in peak GPU memory.

Out of 40 experiments, the agent kept nine, discarded 28, and crashed three. That keep/discard ratio felt about right. Most ideas don’t work. The point of automation isn’t to have better ideas. It’s to try bad ones faster.

Then the numbers plateaued. Experiments 30 through 38 produced nothing worth keeping. I started digging through the logs and found something I hadn’t expected: A linter running on the remote machine had been silently modifying a hyperparameter in train.py. It changed SCALAR_LR from 0.5 to 0.3 every time the agent saved the file. The agent would set the value, commit, and run the experiment, but the linter would alter the file between the save and the execution. The agent had no way to detect this because it checked Git diffs, not the runtime state of the file. Every experiment after a certain point was running with a learning rate the agent never chose.

I lost roughly four hours of compute to this. The agent kept going, proposing new ideas, running experiments, logging results. From its perspective nothing was wrong. The experiments ran, produced numbers, and the numbers were plausible. There was no crash, no error, no alert.

Why this matters beyond my GPU bill

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs and inadequate risk controls as the primary drivers. My overnight session was a toy example: a single GPU, a small model, and a low-stakes experiment. But the failure pattern scales. An agent that can’t detect when its inputs are being modified between decisions will make the same class of error whether it’s tuning hyperparameters or managing a production pipeline.

The autoresearch constraints are smart: one file, one metric, and Git for state. But they assume the environment is stable. Nobody checks whether something outside the loop is modifying the file between commits. The agent optimizes within its sandbox, and the sandbox has a hole in the wall that nobody thought to look for.

Anyone who has run distributed systems recognizes this. When the linter changed that hyperparameter, it was the equivalent of someone editing a database record between a read and a write. We solved that problem years ago with compare-and-swap, optimistic locking, checksums. We just haven’t brought any of it to autonomous AI workflows. The SkyPilot team recently scaled autoresearch to 16 GPUs and 910 experiments. At that scale, an undetected environment mutation doesn’t cost you four hours. It costs you a cluster.

Next time I run autoresearch, I’ll add a file integrity check before every experiment. It’s three lines of code, but it would have saved me four hours and produced a better final result. The agent did its job. The environment didn’t.

The Tidy House

Tim O’Reilly — Thu, 04 Jun 2026 16:25:11 +0000

DJ Patil has spent the past several months on a listening tour. Wherever he travels, he finds a local university, pings faculty and students and anyone else who wants to show up, and runs an AMA. He’s heard from grad students who can’t get callbacks, hospital administrators dealing with federal policy changes that land like a change in the laws of physics, and executives who can’t forecast their AI spending past six months. He’s trying to synthesize all of it and help reframe the wider conversation.

DJ co-coined the term “data scientist,” served as America’s first chief data scientist under President Obama, and was chief scientist at LinkedIn. He’s a longtime O’Reilly author, going back to Building Data Science Teams and Ethics and Data Science, and he’s on the founding team at Devoted Health, where he’s spent the past decade building the kind of data infrastructure most organizations are still struggling to put in place. He calls it “the tidy house.” He sat down with me to talk about “the broken promise” in the job market that is driving AI sentiment, and why weak data infrastructure is a big part of the gap between what AI can do and what most institutions can actually absorb.

The broken promise

What DJ keeps hearing on his tour is anger and angst. One word that keeps coming up is “terrified.” Workers are worried about layoffs. Meanwhile, students, including those from top-tier universities like MIT, Carnegie Mellon, and UC Berkeley, have been applying to 300+ internships and getting fewer than 10 callbacks. Many had zero offers going into the summer. And the industry’s response has been to tell them to learn more AI and burn more tokens. What it comes down to, DJ explained, is “effectively a broken promise”:

We said, “Go to college, get these things, you’re going to get an internship, you’re going to get job training, you’re going to pay off your student loans, and then you’re going to have all the other things that are part of that social contract.”

What the students are feeling for the first time [is]. . .“Wait, if I can’t get this internship, . . .I’m fundamentally off trajectory from getting this job.” And it doesn’t have to be a technical person. It could be someone that is in marketing. It could be someone that’s in the liberal arts. It could be a researcher. . . .There are plenty of students that I have talked to who are supposed to be going to a doctoral PhD program or a medical school or something like that. The slots aren’t there because of the overall budget impacts. And so whether you call it AI impact or economic reframing, the thing is broken.

This is where both DJ and I have been trying to build a counter narrative. The story coming from the AI labs is destructive: “We’re going to put all of you out of work, and we’ll figure out the rest once the intelligence explosion arrives.” That’s bad PR for AI, but it’s also magical thinking. An economy is a circulatory system. You can’t put your customers out of work and at the same time expect that the economy will hum along as usual. A catastrophic recession could easily interrupt the funding that keeps AI on its growth path and the concentration of value that they assume will fund universal basic income and an expanded safety net.

That’s why I’m a fan of mechanism design: start from the outcome you want, then figure out the rules of the game that produces it. Right now, they’ve designed a game that concentrates all the value in the hands of AI first movers. They could be designing a game that generates value throughout the economy. But they aren’t building affordances for that.

YouTube ContentID is a good example of mechanism design leading to economic value creation. When unauthorized music use by online video creators triggered a backlash from rights holders, YouTube replied to the takedown notices with a way for both the people who owned the music and the people who wanted to use it to get paid. A whole creator economy came out of that design choice. The labs have the same opportunity in front of them and mostly aren’t taking it.

DJ had one concrete mechanism in mind:

Imagine OpenAI and Anthropic and Microsoft. . .get together and [say], “If you’re building something for your local community, we’ll fully subsidize the token cost for some period of time.”. . .We’re talking about marginal token usage relatively on the spectrum of things, but the potential innovation and use of AI to help local communities could be astounding. You’re not putting anybody out of a job with that. . . .You’re filling the holes that already exist in the system.

The OpenAI Foundation just announced it will put $1 billion into public-benefit projects this year, including $250 million aimed at building economic futures. It’s a start. But it mostly seems designed to ameliorate the bad effects of AI rather than to forestall them by building a more inclusive AI future. If the labs start investing in the human-plus-AI economy rather than just studying the job losses, the payoff to local communities could be real.

A makerspace to bridge the internship gap

DJ’s plan is to build a bridge. He’s launching a program, basically a makerspace, for students who don’t have an internship this summer. Over two four-week sprints, an initial cohort will get mentors, speakers, and the space to explore whatever they’re interested in. It doesn’t have to be AI. Whether they’re doing investigative journalism, screenwriting, or building civic tech, participants will get some experience with current tools and produce a tangible asset they can use to prove what they know. As I told DJ in our conversation, I think he’s really on to something, and I’d love O’Reilly to be part of what he’s building.

There’s a kind of person who has always been at the center of the O’Reilly community and never waited for a job description. High school and college dropouts who started companies, built open source software packages, or otherwise took the future into their own hands. People who looked around, found something that needed doing, and did it. DJ is one of them. He’s a community college kid who learned from a good local library, from the books with the “funny animals” on the cover, and from open source. That path is still open. The early O’Reilly business came out of exactly this instinct. We were a tech-writing consulting shop, and when we ran out of paid work, we wrote manuals that didn’t exist yet but that we thought were needed. Later, when there were big conferences for every corporate technology and none for open source, we ran the first one for Perl. Conferences became a whole new business for us. You look for the gap and you fill it.

DJ pushes the same idea down to the level of the neighborhood:

If you want to feel rewarded, go fix something in your neighborhood. Go help out the food pantry. Go help out the local foster child care system. Go help out. . .parks and rec. Use those skills to go do something, and then you’re going to see. . .people respond in a different way. . . .The target-rich area for problems is massive. You just have to look.

I’ve never bought the jobless-future story. Back when I wrote WTF? in 2016, I pointed out that there is so much around us that needs to be made better. The constraint has never been a shortage of problems. AI gives us new tools for solving them. It should be a way to put people to work, not out of work.

The organization is the AI bottleneck

DJ has also been visiting hospitals and clinics and talking to CIOs and CTOs as part of the tour, and what he’s seeing is alarming.

The federal changes to Medicaid and the Affordable Care Act are landing on systems that were already near collapse. Hospitals that depended on outpatient procedures like colonoscopies for margin are watching volumes drop 20% to 30% because people can’t afford insurance. Some are running $1 million a day behind, a $300 to $400 million shortfall for the year.

At the same time, AI companies are telling those same hospitals to move into the new world, and partly because of the “you will soon be replaced” narrative from the AI labs, labor is responding the way the Kaiser nurses did in California, where any use of AI was off the table as a bargaining condition. As DJ pointed out, we can’t afford to disregard AI when it has the potential to automate the most painful parts of healthcare workers’ jobs and let them “do the job they’re trained for” without the administrative burden. Businesses need to change not just their narrative but their strategy. They need to be saying, “We’re going to use AI to help you do more for our customers. We’re going to make your job more human and let the machines deal with the BS.”

There’s a version of this where the efficiencies AI creates get plowed back into better patient care. There’s also the version that’s actually happening in most places, where private equity captures the savings as profit. The difference is institutional design, and that’s where reform isn’t happening. I saw this directly with a Code for America project called Clear My Record. A California initiative had turned a number of petty crimes into misdemeanors, but very few people were petitioning to have their status changed. We started using software to streamline an absurdly convoluted criminal record expungement process, but then we asked ourselves why we were helping people fill out forms that shouldn’t exist. The law had already changed the record. The process should have been a database update, not something that required a petition to the court. That’s the kind of problem AI was born to solve. It can help us refactor old stuck processes and move to something way better.

Done right, DOGE could have been an opportunity to carry out that kind of real institutional change at scale. Instead it became a wrecking ball, and it’s given the whole idea of institutional reform a bad name.

The Silicon Valley default assumes that incumbents will just get disrupted by startups, the way media was by Google and Meta and retail was by Amazon. There’s some truth to that. But disruption takes much longer than people think, and in a domain as central as healthcare or government services, the delay means real harm to real people. Healthcare is a third of the economy. You can’t just let it fail and rebuild it fresh while people depend on it for survival.

Data infrastructure is the competitive advantage

DJ’s term for the alternative he’s living with at Devoted is “the tidy house.” He built the boring infrastructure years before LLMs existed, and that’s why the company could move the moment AI arrived. People don’t think about having well organized, effective data infrastructure as the deep secret behind enterprise AI adoption, but DJ is right. As we work on O’Reilly’s own transformation and talk with our customers about what’s holding them back, it’s a huge part of the problem.

One of the ways we’ve tried to make this work is fundamentally still data 101, unified data environments, data flows that are clean, that have a lot of organization. . . .Because we invested so heavily in that infrastructure, the dumb, boring, painful parts of making sure you’ve got a really great data warehouse, great data engineering pipes, all of the metadata that goes with it, when AI shows up, you get to use it right away. Now you get to focus on the orchestration, the harness, all those pieces.

While other organizations are reconstructing ETL inside context windows and paying for it in GPU costs, Devoted’s team gets to work on the actual clinical problems. As DJ put it, transforming a healthcare system is “like walking and chewing gum while balancing bowling balls on your head and on a unicycle,” with the laws of physics changing on you the whole time. The organizations that come through it will be the ones that did the unglamorous work of keeping clean, flowing data with its lineage and metadata intact. The ones that didn’t will keep paying to reconstruct context they should have had all along.

The pharmacists who built their own agents

The tidy house pays off when you put the tools in the hands of people who already know the domain. At Devoted, clinicians are building things without waiting for a product manager to learn the problem first. These frontline workers have already spent decades understanding it.

A pharmacist. . .says, “Hey, you know what? I’m really worried when I see these kinds of drugs show up together. That’s not a good thing. . . .Why don’t I have an agent that alerts me every time this happens? I should just automate it because maybe one of the patients gets prescribed something by another provider and we don’t see it.” So the pharmacist [says,]. . .”I’m just going to build that agent.” Now I’ve got an agent always looking for bad drug interactions. And another pharmacist says, “I’ve got my own version of that.” . . .So I say, “Hey, agent, I want you to go ask all the pharmacists that we have a quick survey of what might be happening. . . .What are the universe of things that we should be watching out for?” Now I’ve got a robust medical layer. . .looking out and protecting all of our members from bad drug interactions. Having the right infrastructure makes it possible to act on decades of accumulated judgment distributed throughout the organization.

The histogram is still the most powerful product

You don’t need exotic tooling to get value out of data, and DJ punctured the assumption that you do.

Oftentimes, I tell people, the most powerful data product you can build is still a histogram. Just give me a distribution of what’s going on. . . .AI gives us a tremendous opportunity to let people [access this data quickly], but we’ve got to figure out the guardrails, so people don’t ask [questions] or get answers. . .[without realizing] that there’s a flaw in how they’re asking it.

Every time a new technology empowers employees to make innovative use of corporate data, there is resistance. We’ve been in this loop since the beginning of the data movement, DJ explained. The stewards of the data warehouse stand at the gate and say, “You shall not pass!” Then democratization breaks it open, and the gatekeepers reconstitute themselves in the next era. Hadoop did it last time. LLMs are doing it now, and the temptation to insist that only experts can use the tools correctly is as strong as it’s ever been. You do need ways to catch errors. But the goal should always be access.

The real opportunity is in the layers above AI models

DJ and I also talked about the new discipline forming inside computer science, engineering the trade-offs between conventional software and LLMs, when to reach for a local or open weight model, and understanding what inference actually costs against the value it returns.

Getting that right requires an expanded view of mechanism design. While this isn’t how economists talk about it, many advances in technology are really just that: redesigning the rules of a game to get better outcomes. Pay-per-click advertising started as a crude auction that sold to the highest bidder, and then Google refined it into something that worked. Rob McCool wired a web server to a database with CGI and ushered in a decade of invention of new mechanisms for data-driven websites. Or take Apache Kafka, which DJ reminded us began as a project to help LinkedIn rein in its Splunk bill and only later became the foundation for a company and an ecosystem.

We’re at the front of an architectural innovation cycle now, and the biggest opportunities are not in the models themselves but in the layers above them. That’s also where a renaissance of open source for the AI era could happen.

DJ and I are both, as he says, “this giant human LLM, summarizing and distilling all the things we’re hearing” from a lot of people. What we’re hearing is that the technology is mostly ready, but our institutions are not. What’s lagging is the organizational and economic infrastructure that lets universities, hospitals, data teams, and the labs themselves actually deploy what’s been built.

It’s time to get busy!

On June 10, Harper Reed, cofounder of 2389 Research, will join me to talk about why the future of software depends on creativity, serendipity, and building weird stuff. And on July 9, Trail of Bits cofounder and CEO Dan Guido will stop by to share his playbook for going AI native. You can register to attend them live here. You can also follow Live with Tim O’Reilly on YouTube, Spotify, Apple, or wherever you get your podcasts.

Predict, Don’t Enumerate

Michael Roytman — Thu, 04 Jun 2026 10:57:44 +0000

A third of the way into a security-operations guide that Anthropic published in April 2026, wedged between a recommendation to patch CISA’s Known Exploited Vulnerabilities list and a suggestion to automate your deployment pipeline is a small recommendation: “Use EPSS to prioritize the rest.” For anyone who has worked on a vulnerability backlog in the last decade, the sentence is an acknowledgment of a widely felt but often unspoken fact about security programs: They have become machine-scale problems of signal to noise.

EPSS (Exploit Prediction Scoring System) is a statistical model that takes a known software flaw, runs it through a set of signals about what attackers are actually doing across the internet, and returns a probability that the flaw will be exploited in the next 30 days. It isn’t an LLM, and it does no reasoning or prompt engineering. It predicts. The company endorsing it is the same company whose newest model can surface thousands of novel, exploitable vulnerabilities in production software, many of them two or three decades old, most of them still unpatched.

As far as we can tell, this is the first time a frontier AI lab has publicly endorsed a purpose-built predictive model as the right tool for a defensive problem. LLM labs usually recommend LLMs. That Anthropic did not is worth noting, but the recommendation itself isn’t news to the practitioners it’s aimed at. It’s a description of what they’ve been doing.

The quiet consensus

The volume problem isn’t new. Anyone running a scanner against a large enterprise estate in 2015 was already generating hundreds of thousands of findings per month. Anyone running one against a cloud environment in 2020 was generating millions. Enterprises have spent the better part of a decade staring at dashboards where the number of open critical findings was larger than the capacity of the team supposed to fix them. In other words, cybersecurity has become machine scale.

Risk-based vulnerability management, as a product category, has existed since around 2018. EPSS, as a public resource, has been usable since 2021. More than 120 vendors embed it today into their products. The field has had access to a predictive baseline for years.

What has been missing is an external justification to change the status quo recommendations from auditors, model risk management teams, and even boards. Auditors want a clear set of expectations, making grading more objective and therefore easier to evaluate. Compliance frameworks like CVSS (Common Vulnerability Scoring System) because CVSS is easy, but implementing something more efficient has historically required that aforementioned external push. A working CISO could tell you she had stopped treating every vulnerability scored a severity 9.8/10 by CVSS as an emergency in 2019, but she would also tell you she still kept CVSS in the report.

Anthropic’s guidance is useful because it makes the private consensus public. Patch what you know to be exploited, then use EPSS above a threshold based on the team’s capacity or risk tolerance. DHS CISA’s practice of publishing known exploited vulnerabilities since November of 2021 is just additional proof that the existing methodologies were being overwhelmed by scale and lack of signal.

Why prediction, stated plainly

In 2014, at Black Hat, Dan Geer, then the chief information security officer of In-Q-Tel, asked the first principles question: Are vulnerabilities in software sparse or dense? Sparse meant finite, meaning every fix measurably shrank the attack surface. Dense meant weeds in a field. Geer could not answer the question because the data were not in.

Eight years later, Jonathan Spring at Carnegie Mellon’s Software Engineering Institute tied vulnerability enumeration to the halting problem and showed, in theory, that for any sufficiently complex piece of deployed software, there are always more undiscovered flaws.

The AI-driven discovery results of the last 18 months have made the density argument impossible to wave off even in a compliance review. A 27-year-old bug in OpenBSD. A 16-year-old bug in FFmpeg that five million fuzzing runs never caught. Disclosed findings, by the developers’ own accounting, are less than 1% of what has been found. But again, the volume was already a problem. With the coming release of its newest model, Mythos, Anthropic is telling teams to plan for an order of magnitude more findings over the next 24 months.

Static severity scoring can’t survive the volume problem, because it’s a human-scale solution for a machine scale problem. Neither can any process that treats every critical finding as an emergency. The threshold for action has to be probabilistic, measurable, and defensible. That’s what a predictive model is for, and that’s what working teams have been using in noisy large enterprise environments.

Pointing machines and knowing machines

Geer returned to his 2014 question in the summer of 2025, writing with Dave Aitel in Lawfare. The piece gives the industry a vocabulary for a distinction it has been fudging:

A vulnerability in the code isn’t automatically a threat. A buffer overflow is a hazard. It becomes a risk only if an attacker can exploit it reliably, in this environment, against these controls, through this traffic. Bugs are abundant but the ability to weaponize a particular bug against a particular target is much rarer.

The industry, they wrote, has built a pointing machine. It enumerates.

Even children learn early to point and name—but knowing the word “dog” doesn’t reveal whether the animal might bite. In cybersecurity, we’ve built systems that similarly point and name vulnerabilities without understanding whether they’re truly dangerous. By embracing AI solely for pattern recognition, we’ve created a powerful “pointing machine” that identifies possible threats but does not comprehend their actual impact. What we need instead is a “knowing machine,” capable of understanding how code functions within complex, real-world environments, recognizing not just hazards but the full context of how and whether those hazards might become genuine risks.

A knowing machine is a system that understands how code behaves in a particular environment and recognizes the context that turns a hazard into a risk. A predictive model is how you build a knowing machine. EPSS is the clearest public example: It covers every published CVE and is updated daily.

Global isn’t local

EPSS is a global model. It sees what attackers are doing across the whole of the internet. It picks up patterns in exploitation activity that severity scores never could. What it can’t see is any particular organization’s environment. It doesn’t know which assets carry the data the business actually cares about. It doesn’t know what compensating controls are in place, where remediation is risky, or how your telemetry and history change the odds.

A 9.8 with a 97% global probability of exploitation and a 9.8 with a 0.1% probability are not the same animal. Neither are two organizations applying the same EPSS threshold to the same CVE on different assets. One has the vulnerable code path exposed to the internet, behind a web application firewall that doesn’t inspect the relevant protocol. The other has the same CVE on an internal system that accepts authenticated input from a single service account. A scanner can’t tell them apart. A global model can’t tell them apart. Their actual risk profiles are orders of magnitude apart.

Local context is where most security teams have been stuck the entire time, and where the next decade of the field is going to be fought.

What a local knowing machine actually requires

Pair a better pointing machine with a faster remediation engine and all you’ve done is increase the speed at which you produce churn, breakage and wasted effort. You’ll also spend a king’s ransom in agent tokens fixing vulnerabilities that were never dangerous in your environment.

In contrast to an omniscient scanner, a local model trains on the specific environment being defended: asset inventory, application topology, reachability, deployed controls, attack telemetry observed on-site, and the history of the organization’s own remediations and their outcomes. The model produces probabilities specific to the enterprise. Most organizations already have the inputs, scattered across CMDBs, endpoint agents, firewall logs, ticketing systems and scanner output. This context is precisely what attackers (whether they’re using good old fashioned metasploit or Mythos with an infinite budget) are lacking in their models. The context becomes an asymmetrical advantage for defenders, perhaps the only one that exists.

The policy shifts that actually matter

The interventions that will decide whether a security program survives the next 24 months aren’t purely technical. A CISO can put most of them in motion without buying anything.

Rewrite the SLA. Most vulnerability-management SLAs are organized by severity. Criticals in 15 days, highs in 30, mediums in 90. That structure was built for a world where the count of open criticals was small enough to matter. It’s now actively harmful, because it forces teams to spend the same effort on a 9.8 nobody is exploiting and a 7.5 that’s under active attack. SLAs should be rewritten in terms of probability of exploitation and asset exposure, not severity. A CISO who can’t get that past her GRC team can at least add a second tier that makes the probability-based cut enforceable alongside the severity-based one.

Change what the board sees. If the monthly security report counts the numbers of vulnerabilities, exposures or findings in different buckets (“critical,” “open past 30 days,” etc.), the organization is being managed to the wrong metric. The metric should be exploitability-weighted exposure over time, with a second line for predicted versus observed exploitation. Boards will accept this once somebody explains it. This beats showing them a number that has no relationship to risk and is growing exponentially as new LLM models are released. More to the point: A great team can do amazing volumes of remediation work, and risk can still rise because they’re measuring and remediating the wrong thing. An efficient, context-rich team can do far less work and meaningfully move the probability of an event down.

Invest in telemetry. The single most valuable instrument a security program can build is a feedback loop between what was prioritized and what was exploited. If the loop shows you were wrong, the model improves. If the loop does not exist, you will keep being wrong indefinitely (or just not being aware of misses).

Fix the compliance conversation. The reason CVSS survives is regulatory inertia. PCI, HIPAA, and most state breach-notification frameworks still reference severity. The CISOs who will come out of the next two years in the best shape are the ones who engage their auditors now, in writing, about what a probabilistic prioritization framework looks like under the existing rules.

Staff for the bottleneck, which isn’t scanning. The industry has spent a decade hiring people to find bugs. The bottleneck now is deciding which bugs matter, getting the fixes deployed, and measuring whether the prioritization was correct. The job descriptions should reflect this. A security-data engineer may be able to increase efficiency to meet SLAs more than increasing capacity would.

None of this requires a new product. All of it requires a CISO willing to say, out loud, that the old dogma is broken and that the new one will be managed by data and probabilities. That is the shift Anthropic’s five-word sentence was really announcing. The technology is available and the models are here—both the LLM-based ones to find the vulnerabilities and the predictive knowing machines to prioritize efficiently.

Context as Code

Artur Huk — Wed, 03 Jun 2026 11:00:14 +0000

As syntax becomes cheap and abundant, architectural control becomes the scarce resource. Effective governance starts upstream, where intent, constraints, and threat models shape the agent’s working context before generation begins. The goal isn’t better prompting but build-time boundaries that prevent structurally invalid code from entering the system.

The Frankenstein factories

The dark factories (as Dan Shapiro calls them) are running. Tokens fly through trycycles, features ship overnight, and codebases are ported before breakfast. The velocity is real. And comprehension debt (a term coined by Addy Osmani) is compounding in silence behind it.

What this era is producing, at scale, deserves its own name: Frankenstein factories. Not a critique of any single approach but a description of a structural condition—generation engines so effective at producing working syntax that they have industrialized the creation of architecturally ungovernable systems. The creature walks out of the laboratory impressive, functional, and alive on delivery day.

The crisis arrives the day someone must govern it. To govern a system means to hold it accountable to its design boundaries—the ability to look at it and reliably say why it works, what is permitted to touch what, and to categorically prevent forbidden state changes before they happen. Victor’s catastrophe was not the act of creation but the absent governing frame.

For prototyping or shipping features fast, unconstrained generation is a powerful tool. It optimizes for velocity, and it delivers. But for enterprise payment systems, insurance underwriting engines, logistics orchestrators, and regulated platforms, the question is not “Does the code ship?” but “Who is liable when it does the wrong thing?” Here, automating the word “YES” to every feature request does not solve the problem. It industrializes it.

Consider a standard Jira ticket: “Add an email notification after a successful payment.”

A junior developer might attempt to wedge the email-sending logic directly into the PaymentProcessor class. A senior architect catches this in code review: “No. Fire a PaymentSuccessEvent to the message bus.” That human friction—the architectural “No”—keeps the system maintainable.

Unconstrained AI agents lack this assertiveness. By default, they are the ultimate yes-men.

Hand that same ticket to a standard coding agent and it will not argue about bounded contexts. It will burn tokens until it produces 300 lines of syntactically perfect code, import an SMTP library directly into the core of your billing domain, and submit a pull request. The tests will pass; conventional feature tests make no assertion about bounded contexts. The CI pipeline will go green. And structurally, the system is now a disaster.

This happens not through malice but because of how agentic loops are built. Without explicit architectural constraints, the system’s emergent behavior is to fulfill immediate user intent. The agent is orchestrated to ship the feature, not to defend the architecture. Comprehension debt is the structural consequence: AI generates syntax faster than human beings can read or govern it. Expecting a probabilistic model to enforce structural integrity on its own is a category error. Without a governing frame, the agent will always take the path of least resistance to a “YES.”

You cannot fix code overproduction by hiring more people to read it nor by running the generation loop faster. The only scalable answer is to build a concrete riverbed before you turn on the water.

If the current era automates the word “YES,” we should automate the word “NO.”

Securing the runtime environment prevents the monster from escaping. But to prevent it from being built in the first place, we need to step back into the IDE and the CI/CD pipeline. We need to govern generation.

The great softening: Shifting risk from build time to runtime

Compilers never guaranteed correct software. You could write catastrophic logically broken systems in C, Java, or any other compiled language. But compilers served a crucial engineering purpose: They deterministically governed a specific layer of structural risk.

By enforcing hard execution constraints—syntax validity, type compatibility, linkage rules, and executable viability—the compiler acted as an automated boundary. It didn’t verify business intent, domain correctness, or architectural quality. What it did was eliminate an entire class of low-level structural failure before execution ever began.

That delegation of risk is one of the quiet triumphs of software engineering. Our discipline has always advanced by mechanizing one class of guarantees so humans can focus on the next layer of abstraction. We automated machine-level structural correctness so engineers could spend their cognitive energy on application logic. Later, we pushed more guarantees upward, into schemas, testing, static analysis, architectural patterns, and operational controls.

Over time, we also deliberately softened certain boundaries in exchange for speed. Dynamic languages, richer runtimes, reflection, and increasingly abstract frameworks all traded deterministic compile-time guarantees for developer velocity and flexibility. The newly exposed risk was absorbed elsewhere: runtime validation, automated testing, observability, and engineering discipline.

Today, with agentic AI, we are softening boundaries again, more radically than ever before.

Natural language has become a high-level control plane for software generation. Arbitrary text increasingly shapes executable behavior. And in that shift, we have blurred one of the oldest boundaries in computing: the separation between data and instructions.

Outside the model, that boundary still exists. Systems enforce permission scopes, schema contracts, sandboxing, and execution policies. But inside the inference context, those protections collapse into the same token stream.

System prompts, retrieved documents, user messages, tool outputs, and external content all flow through the same neural weights. There is no hard privilege boundary between instruction and input. Modern models may resist naive attacks like “Ignore previous instructions,” but they remain vulnerable to indirect injections disguised as legitimate operational context. A malicious instruction embedded in a customer email, a webpage, or a tool response is not processed as passive data. It can become behavioral influence.

Inside the context window, untrusted text can shape control flow. That is the real softening.

We are generating syntax at machine speed, but we have dissolved the structural gate that once constrained how systems were built. The result is a massive shift of risk from build time to runtime. Code that appears structurally sound during generation may violate architectural boundaries, introduce unsafe execution paths, or become behaviorally compromised the moment hostile context enters the loop.

The conclusion is straightforward: The fact that AI-generated code runs is no longer a meaningful proxy for system correctness.

Syntax is abundant. Execution is easy. Structural governance is what is missing.

We outsourced the writing of logic to machines, but we did not build a deterministic boundary that governs what those machines are allowed to generate.

If we want control back, we cannot rely on human code review at machine speed. We must rebuild the build-time gate.

From dependency bloat to tailor-made architecture

For decades, the industry’s default response to complexity was abstraction by accumulation: monolithic frameworks, sprawling dependency trees, and ever-thicker layers of indirection. Importing a 50-megabyte library to avoid repetitive boilerplate was a rational trade-off when developer time and cognitive bandwidth were the scarce resources. For AI agents, that trade-off changes.

This is not an argument against foundational infrastructure. Mature primitives—like SQLAlchemy in Python or Spring Boot in Java—remain essential precisely because their conventions are widely learned and predictable. The problem isn’t abstraction but opacity. When core business logic disappears behind proprietary decorators, internal frameworks, or custom orchestration layers, execution becomes a black box. An agent cannot safely reason about code it cannot trace. It needs direct visibility into causality: what changes state, what enforces invariants, and where responsibilities begin and end. Hidden flow degrades reasoning into guesswork; guesswork silently becomes architectural drift.

At the same time, AI drives the cost of procedural code toward zero. Boilerplate is no longer expensive. Clarity is. The design question shifts from “How much can we abstract away?” to “How much must remain explicit for safe reasoning?” The answer is tailor-made architecture: thin infrastructure, explicit domain logic, hard boundaries, and narrowly scoped components with visible contracts. The value is no longer in how much code you avoid writing but in how clearly the system declares its boundaries.

That same opacity also breaks verification. AI review can catch local defects, risky patterns, and implementation mistakes, but it remains blind to architectural drift and missing business intent unless those constraints are explicitly encoded. After all, if you ask a model to review code generated from the exact same vague Jira ticket, do you actually get verification, or do you just engineer a circular hallucination, where the AI politely revalidates its own blind spots?

Figure 1. Tailor-made architecture gives generated syntax a clear structure without dissolving system boundaries.

The Context Compilation Pattern

The Context Compilation Pattern governs generation in the IDE and the CI/CD pipeline before a single syntactically plausible line ever reaches a human reviewer. If the Decision Intelligence Runtime (DIR) is the vault door that protects execution in production, context compilation is the blueprint that prevents the monster from being built in the lab.

This is not “prompt engineering,” which merely asks a probabilistic model for a better answer. What we need is build-time governance: two layers of defense assembled before the LLM inference is even triggered. The first is structured context injection (assembling the prompt from prioritized artifacts). The second is postgeneration static verification (deterministic AST checks that enforce rules no probabilistic model can override). The prompt structure biases generation toward compliant solutions; the static checks make declared, machine-verifiable boundary violations impossible to merge.

Deterministic build-time governance is not a return to formal software specification (like UML), nor is it merely “prompt engineering disguised as Markdown.” It’s a mechanical constraint on the generation space that makes explicitly declared boundary violations rejectable by design. Context compilation does not eliminate architectural review or replace engineering judgment. Instead, it ensures that the agent operates within a defined riverbed of allowed structural invariants.

Engineering evolves whenever implicit rules become explicit declarations. Application development is now crossing that boundary. The senior engineer’s new job is declarative boundary engineering: explicitly declaring what the system is absolutely forbidden from doing.

The failure is not in the frameworks. The failure is in the process: pointing an unconstrained AI agent at a codebase full of invisible magic and expecting a CI/CD pipeline designed for human-generated code to catch what goes wrong. The answer is to build a compiler for the agent’s context.

The Context Compilation Pattern is the staged pipeline that makes this concrete.

Figure 2. The Context Compilation Pattern pipeline, enforcing build-time constraints through deterministic artifact assembly and dual verification.

Step 1: The context artifacts

The most strategically valuable code in your repository may no longer live in src/. It lives in /context. The pipeline consumes versioned artifacts such as intent.md, boundaries.md, and threat-model.md, each authored by a specialist before a single line of code is generated. (Ownership and role responsibilities are covered in “Artifact-Bound Roles and Accountability” below.) What matters here is that these files are the inputs to the compiler: Without them, there’s nothing to compile.

To prevent cognitive overlap, their roles must be fiercely separated: boundaries.md declares structural invariants (e.g., dependency direction, allowed communication paths, and event emission), whereas threat-model.md models adversarial constraints as declarative abuse scenarios (e.g., prompt injection and secrets exfiltration) that must be mechanically blocked.

boundaries.md warrants a precise definition, because it anchors the entire build-time governance model. In practice, boundaries are typically defined at module or bounded-context granularity (e.g., /billing/* or /risk/*), not per class or per repository. They are implemented using hybrid artifacts: a natural language document designed to constrain the LLM, tightly paired with a deterministic rule for the CI runner.

Consider this concrete example of how an architectural boundary is explicitly declared and enforced:

1. boundaries.md (for the LLM context)
This Markdown file is injected into the agent’s prompt. It defines the vocabulary, architectural constraints, and allowed interactions.

Module: Billing
Ontology: Order, Invoice, PaymentEvent
Rule: Zero external network I/O is allowed in this domain. You must NEVER import requests or smtplib.

2. semgrep-rule.yml (for the CI/CD runner)
This static file goes to the CI pipeline to mechanize the boundary. It ensures the code check is fully deterministic.

rules:
  # Block forbidden imports at the module boundary
  - id: block-external-io-in-billing
    patterns:
      - pattern-either:
          - pattern: import smtplib
          - pattern: import requests
    message: "Architecture Violation: External I/O is strictly forbidden in the billing domain."
    severity: ERROR
    languages: [python]
    paths:
      include: ["src/billing/**"]

  # Domain layer must not talk to DB driver directly
  - id: block-db-driver-in-domain
    patterns:
      - pattern-either:
          - pattern: import sqlalchemy
          - pattern: from sqlalchemy import ...
          - pattern: import psycopg2
          - pattern: from psycopg2 import ...
    message: "Architecture Violation: Domain layer must use Repository abstraction, not database drivers directly."
    severity: ERROR
    languages: [python]
    paths:
      include:
        - "src/billing/domain/**"

Crucially, these Semgrep/CI rules are human-authored (or human-reviewed) precommit artifacts. We don’t rely on an LLM to generate the security gates on the fly. The AI reads the Markdown to guide its generation; the CI runner executes the static YAML to enforce the boundary.

If these artifacts stay current, they actively govern the generated codebase. Stale or malformed context becomes context debt: The pipeline will enforce strictly whatever was declared, even if the declaration is wrong. Governance artifacts are production code. They require strict versioning, explicit ownership, and periodic review just like the executable logic they constrain. That’s why core artifacts like boundaries.md require rigorous peer review, not just casual updates.

Step 2: The context compiler

Dumping all Markdown files into the system prompt is sometimes acceptable for small projects and small artifacts. But as the codebase grows or the context window fills with too many competing constraints, models begin to suffer from “lost in the middle” degradation and silently ignore what matters most.

The term “context compiler” might sound like a magical enterprise heavy-lift, but the reality is entirely mundane. In its simplest form, it’s just a deterministic context assembly layer combined with a routing mechanism.

Instead of treating context as a flat pile of documents, the compiler assembles it into an ordered structure. Because different artifacts apply to different parts of the project, boundaries.md in the /billing module might enforce strict isolation, while the one in /frontend might be much more permissive.

In practice, the compiler may take one of these forms:

Manual selection: The developer simply points their IDE or agent to a structured set of Markdown files.

A mundane script: A basic Python or bash script that understands a directory structure. It concatenates the .md files to build the LLM’s system prompt and hands the .yml files directly to the CI runner.

Tool-mediated context protocols: Dedicated mechanisms (e.g., MCP) that allow the agent to query the workspace and dynamically assemble the required boundaries directly within the IDE, bypassing the need for manual script invocation.

Consider a practical directory structure:

/context
  /global
    coding-standards.md
  /domain
    /billing
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /risk
      boundaries.md
      threat-model.md
      semgrep-rule.yml
    /frontend
      boundaries.md
      threat-model.md
      semgrep-rule.yml

When generating code for the billing module, the script reads /global and /billing. The compiler simply scopes the rules based on the directory, perfectly focusing the agent’s attention on the boundaries that matter while wiring the corresponding YAML rules for deterministic CI verification.

Step 3: Strict boundary hierarchy (resolving conflicts)

When faced with conflicting instructions, LLMs don’t throw a compilation error. They hallucinate a dangerous compromise. The compiler prevents this by enforcing a deterministic precedence of declared constraints before the prompt is assembled:

Threat model > Boundaries > Coding standards > Intent + acceptance criteria

Security and architectural boundaries unconditionally overrule feature delivery. This operates at two levels. At the prompt level (soft enforcement), constraint ordering biases generation toward compliant solutions. At the postgeneration level (hard enforcement), deterministic code checks parse the generated syntax, verify structural invariants, and instantly fail the build on violation.

“Resolution” in this context does not mean an LLM philosophically negotiating between two Markdown files. It means deterministic rejection via CI. If the intent.md asks to “email a receipt to the user,” but boundaries.md forbids external network calls in the billing module, an unconstrained AI might try to generate an SMTP call. The conflict is mechanically “resolved” when the CI pipeline runs a static rule (derived from semgrep-rule.yml) and instantly fails the build. The developer (context orchestrator) must then intervene and change the design to use an event bus instead. The hierarchy is enforced by deterministic code analysis, not LLM reasoning. A rejected build is not necessarily a rejected business need; it’s a signal that declared boundaries and intended capability must be reconciled explicitly before regeneration. (This mechanical rejection physically executes during the adversarial verification phase in step 5).

We do not use AI for this validation. We use existing, proven AST tools and code linters like Semgrep, Bandit, or CodeQL to enforce these boundaries in CI/CD.

However, we must be precise about what this governance actually achieves. Deterministic checks enforce invariants, not the architecture as a whole. You can statically enforce forbidden imports, forbidden outbound I/O, strict layering, and schema conformance. You cannot statically enforce domain semantics, aggregate ownership correctness, subtle coupling, or conceptual cohesion. Deterministic verification doesn’t prove architectural correctness. It proves compliance with explicitly declared structural invariants.

Step 4: Generation

Context as code matters only if generated syntax is verified against the same boundaries that shaped it. With a compiled, conflict-free context hierarchy, the developer agent generates code inside an isolated user space sandbox. In this fleeting fraction of a second, the agent inside the developer’s IDE consumes the narrowed, precompiled system prompt and outputs the actual payment_service.py. Its role is constrained synthesis: translating the boundaries in boundaries.md and the imperatives in intent.md into code.

Step 5: Adversarial verification (negative space)

This phase checks whether the generated code crossed a forbidden boundary. Before the development cycle begins, the adversarial context provider defines threat vectors in threat-model.md. Because a Markdown file only guides the LLM softly, the governance platform engineer bridges the gap to determinism by translating those declarative threats into matching executable rules (like semgrep-rule.yml) wired into the CI gates. If the threat model identifies server-side request forgery or secrets exfiltration as a risk for the /frontend module, the corresponding CI rule parses the generated code and instantly fails the build if a known attack pattern or insecure execution sink is detected.

The pipeline doesn’t ask an LLM to read the Markdown and assess if the code is safe. It mechanically executes the prewritten rules derived from it. If a generative agent helps draft the rule set, it does so before the cycle in an isolated sandbox, and a human reviews the result before it enters CI. Step 5 doesn’t prove overall correctness; it proves that declared structural and security boundaries are enforced.

Like any static gate, deterministic boundary checks trade flexibility for safety and will occasionally reject valid implementations. That friction is intentional: Explicit override and artifact refinement are part of the governance loop.

AI code review may identify suspicious code, but it cannot certify that declared boundaries survived generation. Step 5 therefore relies on deterministic CI rules, not on a probabilistic model interpreting the pull request.

Step 6: Acceptance verification (positive space)

This phase checks whether the generated code solves the business problem. The acceptance-criteria.md defines the expected behavior not as a vague user story, but as a machine-executable contract (e.g., using Gherkin syntax):

Scenario: Successful payment emits notification
  Given a valid payment of 100 EUR
  When the transaction completes
  Then the PaymentSuccessEvent is published to the message bus

The CI pipeline parses this exact Markdown block and runs the corresponding test suite. Step 6 provides what step 5 cannot: verification against a declared delivery contract.

The code is approved only when it passes adversarial checks and satisfies the acceptance criteria. Without step 5, the system could violate structural boundaries. Without step 6, it could implement the wrong intent. Both contracts must hold.

Artifact-bound roles and accountability

The traditional SDLC is a linear cascade: Requirements flow to architecture, then to code, then to QA. In an era where a machine generates 10,000 lines of syntax in the time it takes to fetch a coffee, that handoff is a fatal bottleneck.

In the context matrix, specialists define parallel, independent constraint vectors before generation begins. The titles on business cards stay the same. The artifacts they produce change entirely.

Old role	New role	Artifact	Responsibility
Business analyst	Intent definer	`intent.md` + `acceptance-criteria.md`	Define the “what” and the deterministic proof that it was delivered
Software architect	World builder	`boundaries.md`	Define domain ontology, architectural invariants, and allowed interaction patterns
QA & security engineer	Adversarial context provider	`threat-model.md`	Define threat vectors and abuse paths before generation
Platform engineer/DevOps	Governance platform engineer	Compiler pipeline + CI gates (`semgrep-rule.yml`)	Operationalize declared constraints into nonbypassable enforcement gates
Developer	Context orchestrator	`coding-standards.md` + critical code	Resolve artifact conflicts, steer generation workflows, implement critical paths, and refine context quality

In this model, accountability is distributed and artifact bound. Rather than handing off work downstream, each role owns specific upstream activities and constraints.

The intent definer (formerly business analyst): Owns the business reality. They translate user needs into intent.md and define hard acceptance-criteria.md (like BDD scenarios or API contracts). Their job is to formulate requirements so strictly that the pipeline can automatically prove delivery, acting as the first line of defense against vague “vibe coding.”
The world builder (formerly software architect): Owns the structural gravity. They write boundaries.md to establish the domain ontology and hard architectural boundaries. Instead of reviewing pull requests for drift, their daily activity is defining what modules are allowed to communicate and declaring the structural invariants the generated code must respect.
The adversarial context provider (formerly QA and security): Owns the negative space. They anticipate failure modes and define threat vectors via threat-model.md. Their responsibility is identifying the precise abuse paths that the CI pipeline must block, ensuring an LLM never tests its own code.
The governance platform engineer (formerly platform engineer/DevOps): Owns the enforcement machinery. They build the context compiler pipeline and operationalize declared constraints into nonbypassable enforcement gates. Their responsibility is the deterministic enforcement pipeline that executes declared governance artifacts at precommit and CI/CD boundaries.
The context orchestrator (formerly developer): Owns generation orchestration and critical handwritten paths. This is a hybrid reality, not the end of programming. They write coding-standards.md, manually implement zero-trust paths, and resolve runtime exception requests. For the bulk of the system, their focus shifts to a meta-level: resolving conflicting constraints, tuning the prompt’s signal-to-noise ratio, and debugging why a given artifact failed to govern the agent properly.

When a failure occurs, the investigation shifts from “What was the agent thinking?” to “Which contract failed to govern?” Because the pipeline deterministically enforces what was explicitly declared, failures are no longer opaque hallucinations. They’re traceable collisions between artifact boundaries. A structural flaw cleanly points to an unbounded boundaries.md. When the pipeline is green and the contracts are honest, the orchestrator acts as a firewall against process failure, not a scapegoat for undocumented assumptions.

Figure 3. The decision boundary architecture: Context compilation governs generation, ROA structures intent, and DIR validates execution.

The economics of governance

Context compilation makes economic sense only when the cost of architectural failure exceeds the cost of explicit governance. It adds upfront design work and cognitive overhead, so its value depends on how expensive a wrong system decision would be.

For rapid prototyping, throwaway utility scripts, marketing sites, or low-stakes internal tools—where the worst-case consequence of a hallucination is a misaligned dashboard—let the generative engines run unconstrained. Velocity is the only thing that matters.

For safety-critical automation, trading platforms, healthcare orchestrators, and regulated enterprise systems, the economics invert. Velocity without deterministic boundaries is simply the speed at which you accumulate liability. A single unconstrained agent importing an insecure dependency into a payment core costs orders of magnitude more than the engineer-hours spent writing a boundaries.md contract.

You don’t build a bank vault door for a garden shed. You apply context compilation where the systemic cost of emergent architectural failure is catastrophic.

Automating the word “NO”

When code generation becomes cheap, architectural entropy tends to scale with it. That makes post hoc code review less effective, especially when reviewers spend their attention on machine-generated boilerplate. A more durable approach is context review: peer review of the declarative constraints that shape what the machine is allowed to build. A reviewed boundaries.md can guide many later development cycles. A reviewed pull request usually governs only a single change.

The discipline has shifted from imperative engineering of procedures to declarative engineering of boundaries.

Let’s return to the Jira ticket that started this discussion: “Add an email notification after a successful payment.”

The business analyst submits the intent.md. Before the developer agent sees the prompt, the context compiler activates—at the precommit gate or via tool-mediated context protocols (e.g., script or MCP) in the IDE—before a line is written. It retrieves the architect’s boundaries.md, which states, “The /domain module has zero external dependencies. No network calls.” The SMTP import collides with that boundary instantly. Even if the agent generates the import, the build will not survive it—the prompt biases generation toward compliant solutions, and the deterministic static check in step 5 rejects it at the declared boundary. The Frankenstein is caught in the pipeline, not discovered in production three release cycles later.

Code generation is becoming abundant. Architectural discipline is becoming scarce.

Context as code governs what may be generated. Responsibility-oriented agents govern what may be proposed. Decision Intelligence Runtime governs what may be executed. Three boundaries. One governing frame.

The highest-value engineering skill is no longer writing syntax. It’s engineering the conditions under which correct syntax can emerge.

That is the ability to automate the word “NO.”

This article concludes the three-part series on engineering boundaries in agentic AI. The repository at github.com/huka81/decision-intelligence-runtime contains an open source reference implementation of the concepts described in this series.

Radar Trends to Watch: June 2026

Mike Loukides — Tue, 02 Jun 2026 10:58:22 +0000

Coauthored with Claude

Agents are making the transition from performing tasks to running operations. The Cloudflare and Stripe partnership ships an agent that opens accounts, registers domains, and deploys an application on its own (details), while Stripe/Tempo and iWallet have each published machine-to-machine payment protocols to make that kind of work a standard. Office documents, browser sessions, and, in one announcement, the phone interface itself are next on the list. View the expanded role of agents as an opportunity for humans to accomplish more.

AI Models

The model menagerie keeps expanding in size and shape. Open weight contenders run at frontier capability on modest hardware, while specialist models for voice, conversation timing, and privacy filtering take over what used to be features inside one general chat model. Treat your prompts and skills as portable; the model behind them will change.

Anthropic has released Opus Claude 4.8. This model is not Mythos, which they expect to release soon. Opus 4.8 is a “modest improvement” that claims better results on coding and greater likelihood of informing users when it is uncertain about claims. Changes to the agents may be more important. Claude Code now has the ability to plan solutions to large problems involving hundreds of subagents (“dynamic workflows”); Cowork can control the effort put into solving a problem.
Cohere’s Command A+ is an open weight mixture-of-experts model with 218B parameters, 25B active. It’s competitive with frontier models and requires relatively little hardware to run: Two H100s isn’t small, but it’s not a data center either.
Google’s announcements at this year’s I/O conference include Omni, a new model that takes any kind of input (video, audio, image) and generates any kind of output; Gemini 3.5 Flash, a fast and efficient update to their coding model; Gemini Spark, a personal agent; and intelligent eyewear, another attempt at smart glasses.
Alibaba has announced Qwen3.7-Max, its most capable model.
Thinking Machines has announced a research preview of interaction models. These models support natural conversation flow. The model can wait for a speaker to finish, interrupt the speaker, respond when the speaker interrupts the model, and keep track of time.
OpenAI has released new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They’re moving from call-and-response models to models that can take part in conversations, reason, and take actions.
OpenRouter published cost studies for both Claude Opus 4.7 and GPT-5.5. GPT-5.5 raised the token price but reduced the number of tokens in a typical conversation. Claude kept prices the same, but conversations tend to require more tokens. What’s the impact on your monthly bill?
Google has updated its Gemma 4 models, claiming that they triple token generation speed. They use a technique called multi-token prediction (MTP) to draft a sequence of tokens with a very small model and then approve those tokens with the large model.
IBM released Granite 4.1, a collection of small models (30B parameters and down).
An academic paper describes “the reasoning trap,” a phenomenon in which training models for increased reasoning also increases hallucinations about tool use.
Talkie is an LLM that was trained only on data from 1931 and earlier. If you want to know what it was like to live during the start of the Depression, this is the LLM to ask.
OpenAI has announced a privacy filter model. This is a small specialized model (1.5B) that can run on phones and other small devices. It removes personally identifiable information (PII) from text documents.

Software Development

We are beginning to see anecdotal evidence that the brief era of tokenmaxxing is coming to an end. Agents may increase productivity, but they can also use tokens at an astonishing rate. So can the latest models, like Anthropic’s Claude 4.8 with new features like dynamic workflows. Employers are realizing that the only way to measure productivity is to look at the quality of an employee’s work rather than relying on an artificial (and easily gameable) metric like token use. Teams that use AI effectively will be disciplined about token use; they’ll choose lower cost (or local) models where possible, reaching for expensive models like Claude 4.8 Opus only when necessary.

The Agentic AI Foundation is updating the MCP protocol, with a release candidate scheduled for July 28. Changes include making MCP a stateless protocol, adding a process for creating extensions, and aligning authorization with the OAuth and OpenID standards.
Google is dropping Gemini CLI and putting all of its effort behind Antigravity, its agentic software development platform. There are desktop and command line versions of Antigravity, but unlike Gemini CLI, neither are open source.
What shall we call Gas City, created by Julian Knutsen and Chris Sells? Gas Town 2.0? Steve Yegge says it’s an SDK for building your own “dark factories” by deploying teams of collaborating agents in any topology. It’s “a pivotal moment in the Mad Max school of agent orchestration.”
The problem with agentic programming is that agents serve individuals, not groups, and programming is a team sport. Is collaborative steering (context management for groups) an answer?
GitHub has released a preview of its Copilot app, a stand-alone desktop application for coding with AI. It’s completely integrated with GitHub; for example, you can launch tasks directly from GitHub issues.
If you think tokenmaxxing is your path to promotion, check out burn-baby-burn. It does what it says: burns lots of tokens, fast, using the LLM of your choice. We hope it’s a parody, but we bet it works.
Mitchell Hashimoto tweets that Anthropic’s rewrite of Bun from Zig to Rust demonstrates that programming languages are now fungible. Programming language lock-in has ended; programs can easily move from one language to another.
OpenShell is a runtime environment built with security in mind from the ground up. It’s intended to be used as a secure environment for running agents. Every agent runs in its own sandbox; an external gateway manages credentials and policies.
OpenAI is shutting down its API for fine-tuning its models. They say the current models are better and don’t require significant fine-tuning. As Latent Space points out, this doesn’t necessarily mean the end of fine-tuning as a discipline, particularly for open models. But it may be a signal. Drew Breunig writes about what this means for agents and harnesses.
Anthropic has released Claude for Office 365, allowing users to run sessions that cross Word, Excel, and PowerPoint. Integration with Outlook is coming, though Claude for Outlook is currently a separate product.
A plugin to Chrome allows Codex to use Chrome for browser tasks that require you to be logged in—for example, reading email.
Firecrawl is an API that agents can use to interact with websites in a human way. It enables agents to search for the latest data, interact with the site, and return the results at scale.
Drew Breunig’s “10 Lessons for Agentic Coding” is an invaluable list of tips, including “Implement to learn.” Letting an agent write all the code is easy, but when you really need to learn something, write it by hand first.
Deepclaude configures Claude’s autonomous agent loop to use DeepSeek V4 Pro rather than one of Anthropic’s models. It’s a good way to save (DeepSeek costs much less per token) and experiment with open models. (Fair warning: The name deepclaude may change.)
OpenAI has announced Codex for Work, an assistant that’s designed for office work rather than software development.
Kanwas is a new tool for sharing context across agents. It can be used by workgroups to collaborate on projects.
Mike is an open source AI trained for legal work and designed to run locally.
GitHub is transitioning to usage-based billing for Copilot.
OpenAI and Qualcomm are reportedly working on a phone where the user interface is an agent. There won’t be any apps; the agent will do everything.

Infrastructure and Operations

The infrastructure questions of the moment are whether agents can transact and deploy without humans, and whether the platforms that host open source can stay reliable enough to keep that work going. Watch for GitHub alternatives to become competitive. And watch AI Together, a cloud company that hosts hundreds of open source models.

TokenTuner helps control AI costs by identifying where companies can use lower-cost models productively. It attempts to match token usage to business outcomes, and evaluates individuals and teams on how effectively they use their token budget.
In partnership with Stripe, Cloudflare now has an agent that can create a new account, start a subscription, register a domain name with DNS, and deploy an application without human intervention aside from granting permission.
Stripe and Tempo have released the Machine Payments Protocol (MPP), and iWallet has laid out a roadmap for the Autonomous Settlement Protocol (ASP). These new protocols are designed to facilitate machine-to-machine transactions, transactions that have to be designed without a human in the loop.
The Inference Era is when inference, rather than training, drives AI usage, cost, and infrastructure. GPUs remain important, but the relative demand for CPUs increases.
GitHub is in danger of losing its place at the center of the open source ecosystem. Problems with uptime are causing projects to find homes elsewhere—most recently, Ghostty.
Together AI operates a cloud AI platform that’s designed specifically for inference rather than training and that provides API access to over 200 open weight models. As AI use increases, the ability to run models and provide answers efficiently becomes more important than the ability to train new models.

Security

The patch window is shrinking to zero, and the attacker’s toolkit and the defender’s toolkit now include the same AI models. Any vulnerability disclosed today is being exploited tonight. The good news is that defenders running these tools at scale can close gaps faster than ever; the bad news is that the race never ends.

FROST is a new technology for surreptitiously discovering what websites a user is visiting. It’s based on measuring the I/O operations on the user’s SSD. FROST requires no interaction from the user and runs entirely in the browser.
Regrettably, neither arcane prompt injection attacks nor cryptocurrency scams are news. But it warms a ham radio enthusiast’s heart to see Morse code used in a prompt injection to scam a crypto trading bot.
TeamPCP, a cybercriminal collective, has attacked GitHub by installing a poisoned extension to VS Code. GitHub announced that nearly 4,000 repositories have been compromised, all belonging to GitHub itself; no customer repositories have become victims. But anyone who installs corrupted code from GitHub’s own repositories is vulnerable.
No Security Meter for AI provides an excellent look into the state of AI security.
Cloudflare’s report on Project Glasswing and Claude Mythos is worth reading. Mythos is especially noteworthy for its ability to chain vulnerabilities. In real life, few vulnerabilities are exploitable on their own; they become vulnerable when they are used in combination with others.
Daniel Stenberg reports that Mythos found five potential vulnerabilities in curl, of which one was legitimate. The low count isn’t surprising, given the quality of the curl team’s work. What’s significant is that Mythos was able to find a legitimate vulnerability in software that had been thoroughly audited by humans, traditional tools, and AI.
Who showed up? A security researcher ran a honeypot with port 22 open for 54 days, and logged every attempt to log in: 269,000 connection attempts from 7,556 unique IP addresses.
GitHub’s dependency scanning service for its MCP server is now in public preview. It checks code changes for vulnerable dependencies before committing code or opening a pull request.
Copy.fail is a recently discovered Linux kernel vulnerability that allows unprivileged processes to escalate privileges, and it was exploited within a day of its release. Unlike most vulnerabilities, running infected programs in a container does not offer protection. The time from release of a zero-day to exploitation in the wild is indeed shrinking.
OpenAI’s Advanced Account Security requires a physical key or passkey for access; there are no passwords. Hardware keys are provided by Yubico or a compatible hardware token.
GPT-5.5 Cyber is a version of GPT-5.5 that has been trained as a security tool. As Anthropic did with Mythos, OpenAI is limiting access to a small group of trusted users.
The Firefox team has used Claude Mythos to find 271 previously unknown vulnerabilities in Firefox. While this finding is terrifying, they conclude that defenders now have the advantage. Once you know the vulnerabilities, it’s possible to close the gap between defenders and attackers.
Claude Code can leak credentials and other secrets to public repos and package registries. When you select “allow always” for a specific command, the command and its credentials are stored in a subdirectory of .claude. This directory can inadvertently be incorporated into a package.

Policy and Governance

The ArXiv preprint repository has clarified its code of conduct for AI users. Submitters are responsible for their papers and will be banned for a year if they submit papers that use AI-generated content inappropriately. This includes hallucinated content, references, and plagiarism.
Look to China for new approaches to data governance. China is treating data as a national resource and building the infrastructure for a data economy.

Web

At its I/O conference, Google announced that traditional search will be replaced by AI search, powered by Gemini 3.5 Flash. Both AI search and traditional search (which is really AI-powered) have proven useful. What happens when you eliminate one of the options?
Linux running in a PDF? The PDF format supports JavaScript, and C can be compiled to JavaScript.

Biology

Colossal Biosciences has developed a 3D-printed artificial eggshell that’s capable of raising chicks from embryos.
Brazil has invested heavily in vaccines and has created a single-shot vaccine against Dengue fever. The country is striving for “medical sovereignty,” a concept that’s clearly related to data sovereignty and AI sovereignty.

AI Sovereignty and the Architecture of Participation

Tim O’Reilly — Mon, 01 Jun 2026 16:05:58 +0000

Adam Tooze recently shared a piece from The Economist about Brazil’s push for what it calls “medical sovereignty,” the determination to make its own vaccines and the active ingredients that go into its medicines rather than depend on supply chains it doesn’t control. Brazil already produces a large share of its own medicines through public institutions like Fiocruz and Butantan, but a lot of the underlying inputs still come from abroad, and the pandemic made clear the cost of that dependence. So the country is trying to build the capacity to make the things it most needs to survive. The economist behind a lot of this thinking is Mariana Mazzucato, whose mission-oriented approach treats public procurement as a tool to build national capacity rather than just buy finished goods. (Foreign Policy has a good overview.)

I think we’re going to see a lot more of this, and not only in medicine. The same impulse is driving the quest for sovereign AI, as countries decide they don’t want their access to a foundational technology to run through a handful of American or Chinese companies. You can see it too in Europe’s and Japan’s new willingness to take responsibility for their own military destiny rather than assume the United States will always be there.

Most commentators describe all of this as decoupling, the unwinding of a connected world. That reading is too narrow.

Free trade was an architecture of participation that broke

Much like open source software and the World Wide Web, free trade was supposed to have what I call “an architecture of participation.” The most important thing about the web and open source wasn’t openness for its own sake. It was that there were no central gatekeepers. Anyone could add to the richness of the system without asking permission as long as they followed the rules of the communication protocols that allowed independently-developed pieces to work together. In addition, value circulated among the participants instead of being extracted to a center, and the system got better the more people used it. That is a very different thing from a system that is merely large and connected.

Free trade was also supposed to work like that. The theory, going back to Smith and Ricardo, was that specialization and exchange would make everyone better off, and that the connections would be mutual. What we actually got over the past few decades looks more like the platform dominance we see in big tech than the original vision of a commons built around shared exchange. A handful of large and powerful countries and firms set the terms and the smaller players are forced to take what is on offer. Despite the language of free trade, the experience for many countries was closer to colonialism, just with a new narrative.

Overall, under the neoliberal order (whose reign, as Gary Gerstle explains, is now ending), free trade became far less egalitarian, inclusive, and generative than it could have been. Less powerful countries ended up in roughly the position that small businesses occupy on Amazon, or developers occupy on the app stores: free to participate, on terms they don’t control, with much of the value they create flowing back to the hub.

Brazil’s response (and that of many others) should not be seen as a retreat from the world. It is a refusal to be participate only as a buyer, or as a source of raw materials.

That’s why decoupling is the wrong word. Decoupling means cutting the connections. What these countries seem to want is to stay connected but to build real capacity of their own, so that no single supplier can switch them off. That’s closer to federation than to separation. A federated system is still a system, and its nodes still interoperate. But no node is wholly at the mercy of another, and value circulates among them rather than collecting at the center. A trading order in which the gains pool at a few hubs is brittle and eventually illegitimate, in the same way that a platform economy that strip-mines its participants eventually provokes regulation and revolt.

I put the increasingly visible quest for sovereign AI, and the role of open source models and open source agentic protocols and harnesses in enabling that sovereignty, into the same bucket. I remember back in the early days of open source software when Michael Tiemann, whose pioneering open source company Cygnus Solutions had just been acquired by Red Hat, told me “What we really sell at Red Hat is control. The ability to control your own destiny.”

As companies are increasingly at the mercy of unexpected token pricing changes by the big centralized players, this same quest for sovereignty is playing out at the level of organizations. Open source AI, including not just open source and open weight models but open agentic protocols, agentic harnesses, and portable memory, are increasingly an essential part of the sovereignty toolkit.

The national technology sovereignty movements should take a lesson from the open source movement. The heart of open source is its architecture of participation. It is a force for innovation and value creation to the extent that it frees up the ability of people to solve their own problems and contribute their solutions to a low-friction global commons.

Is capture the inevitable fate of any architecture of participation?

The pattern of open architectures leading to a wave of innovation, winners emerging, consolidating their power and then turning to the dark side seems to be a natural part of the technology cycle. The web broke Microsoft’s dominance over the personal computer software ecosystem only to give rise to a new generation of gatekeepers. Cory Doctorow called this cycle “enshittification.” I’ve told my own version of that story using the language of economics in “Rising Tide Rents and Robber Baron Rents.”

The instinct after capture is to try to rebuild the thing that got captured, only this time with better rules. Mastodon and Bluesky tried to rebuild Twitter’s social layer with cleaner governance, and neither has succeeded at the scale they hoped for. Critics might say that it was because Mastodon stayed pure and never made itself easy enough to use, while Bluesky looked federated without really being so. But more importantly, reinventing what we used to have, or what we think we used to have, is rarely the path forward. You have to build something new.

Each country building its own answer to the latest frontier models is the Mastodon move. The winning move is to operate at a layer the centralized model structurally can’t reach. Open agent protocols that let services from different providers interoperate (the work that MCP and the emerging agent stack are beginning to do) are one such layer. AI accountable to local democratic and legal institutions is another such layer. Domain-specific AI built around problems the global market won’t serve (the tropical disease vaccine analogue) is another. None of these is a smaller copy of what the hyperscalers offer. But there’s one more important layer to consider: infrastructure.

Where are the servers?

Ilan Strauss made a useful point in our conversation about these ideas. Ilan noted that AI is one of the most global forms of capital we’ve ever built, trained on the whole of the internet and runnable more or less anywhere, and the sovereignty rhetoric is partly an attempt to give something inherently placeless a place. The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are.

The placelessness of AI is only half of the truth, though. The other half is that AI is physically place-bound. The model weights are placeless. The data centers, the chips, the electrical grid, and the water for cooling are very much somewhere.

The comparison with Brazil’s medical sovereignty reinforces this point. Brazil’s challenge isn’t to invent new drugs to compete with Pfizer, but to build the capacity to manufacture existing vaccines, and eventually to build the capacity to invent vaccines for diseases the West ignores. Fiocruz and Butantan matter not because they hold patents but because they are physical institutional capacity rooted in Brazilian soil: the labs, the cold chains, the regulatory capacity, the trained workforce, and access to the active pharmaceutical ingredients. That’s what medical sovereignty really means in practice. It is infrastructure plus the institutions that run it.

The same is becoming true for AI. Open weights matter. They’re closer, though, to the patent than to the lab. Even if Qwen, Kimi, DeepSeek, Llama, Gemma, Granite, and whatever comes next are fully open, running them at scale requires data centers that cost tens of billions to build, chips whose supply chains a handful of countries control, and electricity grids that have to be expanded substantially to carry the load. The countries pursuing sovereign AI seriously seem to understand this. The EU’s AI Gigafactories program, India’s IndiaAI mission, the Gulf compute buildouts, the Singapore and Japan strategies, are all infrastructure plays first and model plays second.

Infrastructure is the layer where capture is hardest to undo. You can distill or fine tune a model far more easily than you can build a new continent’s worth of data centers or conjure the necessary electricity from a fragile power grid. If the architecture of participation for AI is defined only at the model layer, the infrastructure layer below will quietly recapture, over years, everything that was won above. Open weights running on three companies’ servers is not sovereignty.

Building physical infrastructure capable of carrying a generation’s worth of economic activity is exactly the kind of mission the public sector used to take on, before we convinced ourselves the market would handle it. Mazzucato’s argument is that public procurement and public capacity-building are the real engines of foundational technology. AI sovereignty without industrial policy is wishful thinking.

Industrial policy should aim to reinvent 20th century infrastructure, not just copy it. Can we use the enormous rebuild of infrastructure for the AI era to leapfrog the past? The analogy with centralized power grids and decentralized solar reminds us that local control does not have to be a localized version of the hyperscaler pattern. Might we envision a future where there is an intelligence grid that seamlessly uses frontier models in massive data centers and local models controlled by the user as dictated by considerations like cost, privacy, specialized knowledge, and user preferences? Creating the software to manage such an interoperable intelligence grid should be a high priority for the AI open source community. We need an orchestrator not just for agents but also for models and even for data center capacity.

Could federated AI give us a new pattern for the economy?

In a previous piece about AI and markets, “The Third Artificial Intelligence” I picked up Richard Danzig’s argument that markets and the bureaucracies that underpin nation states are themselves artificial intelligences, information-processing mechanisms older than the machine kind. The question with all three is who designs and builds them, what they optimize for, and what feedback loops govern them.

We’re about to spend a lot of effort working out how AI should be organized both across nations and across organizations, whether it concentrates in a few firms and a few countries or whether it can be built as something more federated, where smaller players have genuine capacity and the value they create flows back to them. The choices we are now making about how AI is organized, at the model layer, the protocol layer, and the infrastructure layer, are also choices about how economic activity will be organized for at least a generation. If we manage to get that architecture right for AI, it may give us a working pattern for the thing we’ve so far failed to get right for trade. If we get it wrong, we’ll most likely reproduce, at the level of intelligence itself, the same concentration that free trade has produced in goods and the existing internet platforms produced online.

The technology wants to be everywhere at once. The people who live with its consequences want some say over it where they are. The infrastructure that resolves that tension will be a federation of models, a federation of protocols and code, and a federation of capacity. We need an architecture of participation all the way down the stack, and all the way up.

The final section of this piece benefited greatly from questions and comments raised by Ilan Strauss and Mike Loukides, as well as from previous conversations with Richard Danzig.

SaaS Is Not Dead Yet

Mike Loukides — Mon, 01 Jun 2026 11:01:35 +0000

With the rise of agents, many people have been proclaiming that the age of software as a service (SaaS) is over. Who needs to subscribe to a service when you can create your own software with a few English-language prompts and a few dollars spent on tokens? Your own software, most likely a skill that runs in an agent, will have exactly the features you want: no more, no less.

But whenever someone talks about the death of SaaS, there’s something wrong with the picture. It’s simply that work is about groups and teams, and so far, programming with agents is about individuals. A related challenge is that SaaS companies are good at building dashboards and generating reports for humans, but agents need the raw data, not a representation of the data.

Think about the teamwork required for a good sales team. Someone needs a database to keep track of their customer info. It’s easy to get Claude, Gemini, or GPT to build that, using SQLite for a backend and putting a reasonable web frontend on it. You could also do that fairly quickly with Ruby on Rails, but AI makes it even easier. But what about the salesperson at the next desk? She needs similar CRM software, and she can create it with Claude, Gemini, or GPT. No problem. But it won’t be exactly the same; it will reflect her needs and preferences. Soon you have a team of salespeople in which everyone has their own personal CRM. They’re all similar, but slightly different. They may use different backends (Filemaker, SQLite, MySQL, or maybe a corporate Oracle instance); they have similar-but-slightly-different schemas (one has a single field for customer address, another has separate street, city, state, and country fields); and they don’t interoperate.

That’s the simplest possible case. How do you generate company-wide reports if everyone has their own version of the data? How do you know if you’re succeeding or failing if everyone on the team has their own version of the metrics? Everyone has become their own silo.

The company is not paying subscription fees to a vendor like Salesforce, but is this really progress? If anything, we need to make sharing data and metrics easier, not more difficult. On top of that, a product like Salesforce has hundreds of features. Most people don’t need most of them, but there’s a good chance that almost everyone needs one feature that nobody else needs. And there’s always the features you don’t know you need, ways to get value from data that you haven’t thought of. There’s value in buying a bundle that goes beyond your immediate requirements.

There’s certainly a lot good about enabling people to develop their own tools. I guarantee that if we had Claude Code 30 years ago, I would have vibe-coded my own skills for managing the authors I was working with. I would have vibe-coded some of the crazy tools I wrote to translate from one document format to another. (WordPerfect to troff? Why?) Now that we have agentic programming, I may never write my own tools again. But the SaaS scenario highlights something missing from the agentic picture. We don’t have tools for sharing or collaboration. Nobody buys a Salesforce subscription for themselves. It’s a departmental or corporate resource, shared between many people. And the ability to share easily is precisely what agentic programming lacks. I’ve built some of my own Claude tools and skills, but it’s very difficult to share them with other people at O’Reilly. ChatGPT Skills for Business and Enterprise hints at the ability to share skills among team members and some ability to generate them collaboratively, though it’s hard to find evidence that it delivers. I think we’re seeing a symptom of technological overreach. It’s easy to assume something is “easy” when it isn’t: “You just generate a .md file and put it in the corporate GitHub.” That process has a lot of friction, particularly for users who aren’t technical.

To make skills really useful across a company, we need:

Sharing. This can be a Git server that’s registered as a private marketplace and then configured via a corporate administrative dashboard. Publishing skills to the marketplace would remain the province of Git-aware users, and that’s a problem.
Requirements. We don’t want everyone to build a personal toolset; that’s the problem we’re trying to solve. How do you resolve differences between users who want slightly different things? What does the PRD for a skill look like?
Collaboration. Aside from Google Docs, the current state of widely used collaboration tools is poor. Suffice it to say that working on different branches of a Git repo and merging changes may work for professional programmers, but not for anyone else.
Testing. Tests and evals for agents (related, but not the same) are topics that we don’t yet understand well. But if you’re going to empower users to use and create agentic tools for creating projections and writing reports, you need to know they won’t backfire. Skills also behave like any other AI application: They drift over time. Even after they’re published, they need to be evaluated regularly to see if they still perform correctly.
Versioning. Like any software—and we need to recognize that agentic tools and skills are software, even if they’re written in English—it will be important to update them as requirements change and as LLM behavior drifts. It’s important to keep track of versions and for users to update their skills to the latest version easily. Again, this is a matter of wrapping Git appropriately for nontechnical users.
Security. Security for intelligent agents is still poorly understood. We know about prompt injection, but we also know that it’s a problem that can’t be solved yet. And attackers are still finding novel ways to inject malicious prompts. What vulnerabilities might agentic skills and tools have if they can access corporate data?

While the democratization of programming doesn’t threaten SaaS companies, intelligent agents pose a deeper challenge. In “The Salesforce of Agents Won’t Be Salesforce, the Google of Agents Won’t Be Google,” Jesus Rodriguez points out that the future for services like Salesforce and Google isn’t web UIs and dashboards; it’s APIs that are designed for agents. These APIs require a different kind of data: not something that a human can glance at to get a quick feel for what’s happening, but “structured state, task objectives, relationship graphs, permissioned memory, machine-readable sales playbooks, and reliable APIs for updating intent.” Humans need the data compression that you get from a dashboard. Agents want the data itself, and they’ll take care of the compression. SaaS companies can become the system of record that is responsible for delivering accurate data. What they need to recognize is that their real customer may not be a human user; the customer will be an agent, and that will affect everything from marketing strategy and product design to pricing.

I wouldn’t claim that Salesforce or Google can’t or won’t build APIs to help companies access their own data. SaaS remains relevant, but it’s a different kind of SaaS than we have now. Companies like Salesforce know what data is available and how to work with it. Designing and building the data infrastructure that’s needed to provide next-generation SaaS isn’t trivial, and doing the programming in English rather than C++ doesn’t make it easier. Companies like Salesforce and Google know what needs to be built. They’re likely to offer their own collections of agentic skills as a starting point, alongside APIs. But large, established companies are ripe to be blindsided if they move slowly—and it’s difficult for large institutions to move quickly.

SaaS companies have momentum—or inertia, which to a physicist is the same thing. They have to change, but they aren’t threatened by AI, agents, and user-defined skills. Providing APIs that have been designed to provide data in formats that machines can use should be an obvious next step. If they die, it will be because they don’t adapt. But there’s nothing new about that.

Open Source Ecosystems

Ilan Strauss — Fri, 29 May 2026 11:00:08 +0000

The following article originally appeared on the Asimov’s Addendum Substack and is being reposted here with the author’s permission.

Bill Gurley has an excellent article on what he calls open source strategy, which we recommend reading. There is a lot to debate about his concluding argument in particular: that open-weight models are central to keeping the AI market rent-free. The limits of open-weight AI as the primary open source strategy are surely considerable though, if it still requires expensive hardware to run on, and if the architecture ultimately remains monolithic—rather than composable and protocol-centric.

A related consideration comes from Anthropic’s recent acquisition of Stainless—a startup that generates SDKs, command-line tools, and MCP servers from API specifications. This illustrates that open protocols like MCP, even when publicly governed,¹ remain exposed at their complementary layers to private actors capturing rents. (Protocol openness does not eliminate this and instead probably enables it, by enabling market growth).

We asked Claude to analyze this acquisition, going beyond the press releases. Its first pass overstated parts of the competitive-denial story; what follows is what survived it taking a closer look:

Complement capture, not protocol capture. MCP—the standard that lets AI agents talk to other software—remains open, and its governance has been handed to an independent foundation. What Anthropic bought is the company that turned that standard into something most developers could actually use. Stainless was the dominant tool for taking an ordinary business API (say, a hotel booking system or a customer database) and converting it into something an AI agent could call through MCP. The open standard is still open. The path most developers walked to use it has now been bought.
This isn’t a one-off—the whole layer is consolidating. Stainless wasn’t alone in this market. Its main competitor, Fern, was bought by Postman in January 2026. Anthropic bought Stainless four months later, in May 2026. That leaves Speakeasy as the only major independent player, plus an open-source fallback called OpenAPI Generator that most developers consider too rough for production use without significant manual work. In under five months, two of the three serious companies in this part of the market have been absorbed into larger platforms. The Stainless deal is more visible because of who bought it and why, but the broader pattern matters more: an entire layer of AI infrastructure is being pulled inside platform owners.
Moat migration. The gap in raw model capability between Anthropic, OpenAI, and Google has narrowed considerably and continues to close, and the implication is that model quality alone is unlikely to be the principal basis of competitive advantage over the next two years. What may distinguish the leading firms instead is the quality of the developer experience around their models: how easily a business or an engineer can build something useful on top of a given model, how cleanly the tooling integrates with existing systems, and how reliable the connectors are over time.

Stainless was founded by Alex Rattray, formerly of Stripe. Stripe built its market position largely on unusually well-designed developer tools, and Stainless was, in effect, an attempt to apply the same approach to the layer between AI APIs and the rest of the software economy. Anthropic has acquired the team that knows how to do this.

Pricing logic, with caveats on denial. Stainless was last valued at $150M in December 2025; at >$300M five months later, this is a roughly 2x strategic markup, not acqui-hire arithmetic. Removing a critical-path external dependency on Anthropic’s own SDKs, while denying it to a tight set of competitors, is rational at that price—but the denial logic is partial. Speakeasy is a viable substitute, and OpenAI was reportedly already migrating off Stainless. The friction tax falls hardest on smaller players who lack the engineering bench to absorb migration cost.

…The press release calls it “extending reach”; the InfoWorld read—“last-mile developer experience”—is closer, but the complement-capture component, even if partial, is real.

-*-

Now, while Claude might be overstating some of the market risks associated with this acquisition (you tell us?), it shows that open source’s impacts are highly conditional on its dependencies and should never be analyzed in isolation from the market’s software stack and architecture. This is equally true for open weight models—being dependent on data, compute, and distribution—as it is for open protocols like MCP, dependent on constant API translations and access. Tracking those interdependencies is what a full ecosystem view involves and is helpful to undertake in order to consider where chokepoints might arise, and in turn, where open source strategy might eventually fail or be captured.

Footnotes

In this case by the Agentic AI Foundation under the Linux Foundation ︎

Your AI Agent Already Forgot Half of What You Told It

Andrew Stellman — Thu, 28 May 2026 10:59:36 +0000

This is the seventh article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, part four here, part five here, and part six here.

This is the latest article in my Radar series on AI-driven development and agentic engineering, and I have to admit that this one took a bit of a turn I wasn’t expecting.

In my last article I talked about context and context management and I promised to give you some real practical tips for using it. It was originally meant to be about specific, practical context management techniques that were really helpful to me building Octobatch and the Quality Playbook, two open source projects where I work with AIs to plan and orchestrate all of the work and every line of code is written by AI tools like Claude Code and Cursor.

But as I was writing this, I found that I’d adapted those same techniques to my work writing articles like this one. Which is surprising! I’ve been doing all this work finding ways to help people developing AI skills improve context management, so their skills run more efficiently. It turns out that those same exact techniques apply to anyone using AI tools, even when you’re using chatbots like Claude.ai or ChatGPT.

Full disclosure: I use multiple AI tools to manage this article series. My primary tools are Claude Cowork for brainstorming and managing my article research, notes, and backlog and Gemini’s mobile app for reading drafts aloud and taking my notes while I’m away from my desk. And I want to tell you about something that happened while I was using those tools, because I think it really helps show why context management isn’t just a problem for developers.

While I was writing this article, I was using Gemini’s mobile app to read the draft aloud and take my notes. Partway through the session I asked it to go back and check whether there were earlier notes it hadn’t incorporated yet. It told me it didn’t have access to the previous notes, which seemed weird and insane, since we had just taken those notes a few prompts earlier in the session. I could scroll back up and see them earlier in the conversation, but somehow it didn’t “know” about them.

Here’s what happened. Gemini had compacted our conversation without telling me, and the notes from the first half of the session were just… gone.

If you’ve ever had a web chat AI just seem to forget things you talked about earlier, you’ve experienced context compaction, just like I did. Understanding even the basics of context and context windows can make a big difference in preventing that kind of frustration.

This all reminded me of something I wrote more than two decades ago in Applied Software Project Management (back in 2005!): “Important information is discovered during the discussion that the team will need to refer back to during the development process, and if that information is not written down, the team will have to have the discussion all over again.”

Jenny Greene and I wrote that about human teams and project meetings, but it applies to AI sessions just as well.

Which brings me back to context, which I wrote about in my last article, and which I’ll write more about in the next one, because it’s one of the most important concepts to keep top of mind when working with AI.

Context loss may be invisible, but that doesn’t make it any less frustrating

Context is everything the AI is holding in its working memory during a conversation: what you’ve told it, what it’s told you, any files or instructions it’s read, and whatever internal notes the system has made along the way. All of that lives in a fixed-size context window—think of that as your AI’s short-term memory, the stuff it’s thinking about right now—and when the window fills up, the AI has to start letting things go. Different tools handle this differently: Some truncate older messages, some compress the conversation into a summary (which means details get lost even though the summary looks complete), and some just start behaving inconsistently so you can’t tell whether the AI forgot something or never understood it in the first place. The result is the same: The AI loses track of things you told it, decisions you made together, or details it noticed earlier in the session. And it won’t tell you it forgot. It’ll just keep generating confident-sounding output based on whatever it still has.

Before we dive in a little deeper, I want to do a quick jargon check. If you’ve seen the terms “skills” and “agents” floating around but aren’t sure what they are, think of skills as libraries for AIs and agents as interactive executables. Those aren’t perfectly precise definitions, but if you’re a developer they’re close enough for this discussion.

When you’re coding skills and agents, you run into context problems quickly. The work you’re asking the AI to do is often complex enough that the context window fills up, and the AI has to start compacting: compressing or dropping older parts of the conversation to make room for new ones. Compaction always seems to happen at the most frustrating and inconvenient time, which makes sense when you think about it. You hit context limits precisely when you’ve put the most information into the conversation, which is exactly when losing that information costs you the most.

That’s why I think it can often help to think of AIs as having the same shortcomings that human teams do, except those shortcomings are exaggerated by their AI nature. A person who forgets something from a meeting last week might remember it when you remind them. An AI that lost something to context compaction won’t, because the information is gone. But there’s something you can do about it, and it turns out the techniques that help are the same whether you’re building autonomous AI skills or just trying to get a chatbot to remember what you told it 20 minutes ago.

I’ve landed on four techniques that I come back to over and over again. Each one exists because at some point the AI forgot something important and I responded by putting that thing in a file where it couldn’t be forgotten. None of them require special tooling. And to my surprise, all of these techniques have turned out to be useful for both building software and managing a writing project like this one, whether I’m chatting with Claude, ChatGPT, or Gemini, or using a desktop tool like Claude Cowork or Codex. These are the techniques I find most valuable:

Split discovery from documentation: Don’t ask the AI to figure something out and produce polished output in the same pass.
Use handoff documents, not continuation prompts: Before closing a stale session, have the AI write down everything the next session needs to know.
Give the AI an acceptance criterion, not a procedure: Tell it what “done” looks like instead of spelling out the steps.
Use spec documents as the bridge between AI tools: Make a shared document the single source of truth that all your tools read from.

Split discovery from documentation

When you ask an AI to do something complex, you’re often asking it to do two things at once without realizing it. You’re asking it to figure something out and produce polished output at the same time. The problem is that figuring things out takes attention, and producing output takes attention, and the model only has so much of it. When you combine both tasks in the same prompt, the model starts cutting corners on one of them, and you can’t tell which one it shortchanged.

I ran into this with the Quality Playbook, an open source AI coding skill I built that runs structured code reviews against any codebase. One of the things it does is derive requirements from source code: It reads through the code, identifies what the code promises to do (I call these behavioral contracts), and then produces a requirements document. Originally this all happened in a single pass. The problem was that single-pass requirement generation ran out of attention after about 70 requirements. The model forgot behavioral contracts it had noticed earlier in the code, and the forgetting was completely invisible. There was no stack trace or error message, just incomplete output and no way to know what was missing. I fixed it by splitting the work into two separate prompts:

Read each source file and write down every behavioral contract you observe as a simple list in CONTRACTS.md.

Read CONTRACTS.md and the documentation, then derive requirements from them and write REQUIREMENTS.md.

Then a third pass checks whether every contract has a corresponding requirement, and if there are gaps, goes back to step one for the files with gaps.

The key idea is that CONTRACTS.md is external memory. When the model “forgets” about a behavioral contract it noticed earlier, that forgetting is normally invisible. With a contracts file, every observation is written down before any requirements work begins, so an uncovered contract is a visible, greppable gap. You can see what was forgotten and fix it.

The principle: Don’t ask the AI to figure out what exists and write formatted output in the same pass. The model runs out of attention trying to do both at once. Whenever you’re asking an AI to do something complex, consider whether you’re actually asking it to do two things at once. “Analyze this codebase and write a report” is two tasks. “Read this document and suggest improvements” is two tasks. Split them, and let the first pass write its observations to a file before the second pass starts working with them.

Use handoff documents, not continuation prompts

Anyone who’s spent a long session with an AI coding tool has felt the moment when the context starts to go stale. The AI stops tracking details it was handling fine an hour ago, or it contradicts something it said earlier. The session gets slow, and you’re often restarting because the AI seems to have gotten bogged down and filled up on what you told it. You get the sense that if you keep going, you’re going to spend more time correcting it than making progress.

Most developers respond to their session getting too long in one of two ways: They push through the problem, or they start a fresh one and try to reexplain everything from scratch. Both of those approaches can cause the AI to lose context. The first loses it to compaction; the second loses it to incomplete reexplanation. And both are frustrating! Specifically because you just spent so much time building up all that context with the AI.

There’s a third option. Before you close the session, ask the AI to write a handoff document: a file that captures everything the next session needs to know, written while the current session still has full context. The key is that you’re asking the AI to write this while the relevant details are still fresh in the working context, and in a way that it or another AI can read.

I built this into the Quality Playbook as a core part of how phases communicate. When I split the playbook from a single prompt to independent phases, I needed each phase to run as a completely independent session with no context carryover. So each phase got its own kickoff prompt as a standalone file. Here’s the structure each one follows:

Write a handoff document that a fresh session could use to pick up this work cold. Include everything it would need to know.

Every kickoff opens with what prior phases accomplished, includes explicit boundaries about what’s frozen, and names which future phase owns each piece of remaining work, because without it the AI will helpfully start doing Phase 3 work while you’re still in Phase 2. Each phase also ends with a required forward-looking handoff where the completing agent writes down what the next session needs to know.

The principle: Each handoff is a complete state snapshot. The incoming AI agent never needs to read prior kickoff prompts or chat history. Everything it needs is in the current handoff file: current state, uncommitted changes, immediate next task, pending tasks, file locations, and anything that was discovered during the prior session. A fresh AI session can pick it up cold.

If you’re deep into a Claude Code or Copilot session and you can feel the context getting stale, ask the AI to write a handoff document before you close the session. Tell it to include everything a fresh session would need to continue the work. Then start a new session and point it at that file. A fresh session with a good handoff document will usually outperform a stale session, because it’s starting with clean context instead of compacted, fragmented context.

Give the AI an acceptance criterion, not a procedure

When you give an AI a multistep task, the natural instinct is to spell out the steps. First do this, then do that, then combine the results. The problem is that step-by-step procedures are the first thing the AI forgets when the context window fills up. It’ll skip steps, merge phases, or quietly drop tasks, and there’s nothing in the procedure itself that would help the AI notice what it missed. The procedure tells the AI what to do, but it doesn’t tell the AI what “done” looks like.

I learned this the hard way with the Quality Playbook. The playbook runs multiple iteration passes over a codebase, and the results need to be cumulative. It keeps a list of all the bugs it finds in the code being tested in a file called BUGS.md. Early on, I gave the AI a procedure to run four times and then update that file:

First run the main pass, then run four iteration passes, then merge the findings into BUGS.md.

The AI did not respond well to that instruction.

It turns out that when you ask an AI to do a very complex task a specific number of times, it can lose count. In fact, from my experimentation, it seems that count is one of the first casualties of context compaction. Most of the time the AI decided three iterations was enough, or merged findings from only two passes, and no matter how many different ways I tried to rephrase that instruction, there was nothing I could come up with that prevented the problem.

However, everything changed when I replaced the “run four times” instruction with an acceptance criterion, or a specific condition that tells the AI when to stop looping:

You are done only when BUGS.md contains the cumulative findings from the main run plus all four itration passes.

Even when the AI lost track of intermediate steps, it could check the output against the criterion and know whether it was finished. And I could verify the output against the same criterion, which gave me a way to audit the agent’s work without watching every step.

In developer terms, the AI is really bad at loops like for (i = 0; i < 4; i++) because it loses track of the value of the iterator i when it compacts its context. But it’s really good at loops like while (!done) because it can check done based on the current state without relying on history.

The principle behind all this is that an acceptance criterion survives context pressure because the AI can always check “Am I done?” against a concrete test. This is actually the same principle behind test-driven development: write the test before the code so you know when you’re done. The acceptance criterion is the test for your AI session. When you’re giving an AI a task that has multiple steps, don’t describe the steps. Describe what “done” looks like, and let the AI figure out how to get there.

Use spec documents as the bridge between AI tools

Most developers working with AI don’t use just one tool. You might use Claude for design, Cursor for coding, and Copilot for quick edits. You might even use multiple models inside the same tool, like GPT-5.5 and Opus 4.7 in separate Copilot chats inside VS Code. It’s common to have one model for coding, another for review, and a third for orchestration and project management. The problem is that none of these tools or chats know what you told the others. Claude doesn’t know what you decided with Cursor. Two separate Copilot chats in the same editor don’t share context. You’re the one carrying context between them, and that’s exactly the kind of lossy handoff that causes drift. A design decision you made in one conversation gets lost or distorted by the time it reaches the tool that needs to implement it.

The fix is to make the spec document the single source of truth that all your AI tools read from. I used this when building a game prototype, where I had Claude handling design and planning and Cursor doing the coding. They never talked to each other directly, so the spec documents served as the shared contract: Claude wrote the specs, and Cursor read them. The rule I followed was simple:

Never tell the AI coder something that isn’t already in the specs. If you make a design decision in conversation, write it into the spec first, then point the coder at the spec.

If I made a design decision in a conversation with Claude, that decision had to be written into the spec before I told Cursor about it. If I discovered something during implementation, I wrote it into the appropriate doc first, then pointed the coder at it. The spec was always the single source of truth. When Claude and I changed the wound topology (removing one wound type, promoting another), we updated the docs first, then told Cursor to reread them. When we decided to add a new UI element, we wrote it into the UI spec first, then told Cursor to reread the doc.

The key was including rationale in the specs. Not just “show 5 progressive labels” but why: “The player shouldn’t be told what they’re fighting. They should discover it.” This helps the AI coder make better decisions when the spec doesn’t cover an edge case because it knows the intent behind the requirement.

The principle: The spec document is the shared context that all your tools can read. It prevents the drift that happens when design intent lives only in chat history that the other tool can’t see. This technique works any time you’re using more than one AI tool on the same project, which at this point is most projects.

How these techniques combine: Managing this article series

Those four practices came out of AI-driven development work, but they apply to almost any AI work. And while these techniques emerged for me while working on agents and skills, I think it’s valuable to demonstrate them in a nondevelopment context, so I’ll share an example from my work on the article series you’re reading now.

Over time, the process for how my AI assistant and I manage this article backlog evolved organically in conversation, but it was never written down anywhere except in the AI’s context window. Which means every time the session compacted or I started a fresh chat, the process was gone and I had to reexplain it. I caught this when the AI did something slightly wrong and I wanted to confirm we were on the same page. So I asked:

Every time I suggest a new article idea, you add an entry to the backlog, and then create a new markdown file with the source material, right?

That’s split discovery from documentation. I didn’t say “document our process.” I said “confirm what we do.” Discovery first, then documentation as a separate step. If I’d said “write up our process” without confirming first, the AI might have written something plausible but wrong, and I wouldn’t have caught the discrepancy.

Once we’d confirmed the process, I asked the AI to create two files. AGENTS.md is an emerging standard for AI-readable project context—a single file that tells any AI session what it needs to know about a project. You can learn more about the convention at agents.md. CONTEXT.md serves a similar role as a bootstrapping document—it’s less established as a standard, but the practice of asking the AI to dump everything it knows into a context file so the next session can pick it up cold has been one of the most valuable habits I’ve developed. Here’s the prompt I used:

Update the backlog file to explain what it is and how we maintain it. Create a CONTEXT.md with everything you’d need to bootstrap a new chat. Create an AGENTS.md to make it easy to bootstrap with a single-line prompt.

That prompt is a handoff document. I was explicitly asking the AI to write down everything it knew while it still had full context, specifically because I knew that context would be lost to compaction. The CONTEXT.md file is a handoff from this session to whatever fresh session picks up the work next week.

Notice what I didn’t say. I didn’t give step-by-step instructions for what should go in those files. I said “everything you would need to bootstrap this process again in case we lost it” and “a complete dump of all of the context you would need to bootstrap a new chat and get it to the point where this current chat is.” Those are acceptance criteria, not procedures. The AI had to figure out what belonged in those files. If I’d given it a procedure (“first write the publication history, then the voice rules, then the file locations”), it would have followed the list and missed anything I forgot to include. The acceptance criterion is harder to satisfy but more robust: the test is “Could a fresh session bootstrap from these files alone?”

And the AGENTS.md file itself is a spec document as a bridge between tools. It’s the shared contract that any AI session, whether it’s Claude, Gemini, Cowork, or a fresh chat, can read to get aligned with the project. This session wrote it; the next session reads it. The two sessions never communicate directly, so the spec file bridges the gap between them.

That’s all four practices in two prompts, applied to something as ordinary as managing a writing project. It didn’t require pipelines or codebases or batch orchestration. The practices work because they solve the same underlying problem regardless of the domain: important information living in the AI’s context window instead of on disk.

Context management is a development skill

Every practice I’ve described in this article and the last one is something developers have always been told to do: write things down, record your rationale, be deliberate about what you save and what you let go, write ADRs and design docs and inline comments explaining nonobvious choices. We’ve always known we should do more of it. When you’re working with AI, the cost of not doing it becomes immediate and visible.

The practices in this article all come down to the same thing: putting the important information in files where compaction can’t touch it, so you can see what the AI knows and verify that it matches reality. In the next article, I’ll go deeper on the debugging angle: how to use externalized files to understand what your AI is actually doing, with practical techniques that work even if you’re not building agents but are just using a chatbot.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot.

Disclosure: Aspects of the approach described in this article are the subject of US Provisional Patent Application No. 64/044,178, filed April 20, 2026 by the author. The open source Quality Playbook project (Apache 2.0) includes a patent grant to users of that project under the terms of the Apache 2.0 license.

Get a Good Return on Your AI Investments

Louise Corrigan — Wed, 27 May 2026 16:52:37 +0000

Last week, we had our first Infrastructure & Ops superstream of 2026, Platform Engineering in the Age of AI. Our speakers explored a range of topics focused on supporting new AI workloads, each with unique infrastructure needs, unpredictable costs, and novel security concerns. Google Cloud’s Abdel Sghiouar took the audience through what a good platform for AI looks like, Cockroach Labs’ Jordan Lewis shared lessons learned rolling out a corporate AI platform, Syntasso’s Daniel Bryant outlined a three-layer model for building a good platform, technology leader Sarah Wells discussed the importance of governance and how to make it more manageable, and Thoughtworks’ Ben O’Mahony explained why evals should be part of your observability story. You can watch the highlights here.

The event concluded with a fireside chat between Sam and Nathen Harvey, who leads the DORA team at Google Cloud. DORA has been tracking software delivery performance for over a decade, which means they’ve watched a lot of technology trends come through. Their center of gravity has always been the same question: How quickly and safely can a team move change into a running production application?

AI hasn’t changed that question, although it has made answering it a bit harder. DORA recently released its ROI of AI-Assisted Software Development report to show how AI is working for teams right now, and how that may or may not be contributing to organizations’ bottom lines. Nathen used the findings as a jumping-off point to dig into how AI is changing platform engineering and software development as a whole.

The productivity gap

Sam started by pointing out one of the biggest headline findings from DORA’S 2025 data: Organizations saw about 10% improvement in terms of actual code shipped to production systems. Even though developers likely felt that they were more productive, that doesn’t automatically carry through to production. DORA’s data shows higher throughput alongside higher instability. In other words, teams are shipping more but they’re also more frequently rolling back changes or implementing fixes. The gains at the individual level are real (and 10% is a pretty good number), but those gains aren’t “the dramatic improvements that you find in the headlines.”

AI amplifies good processes (and bad ones)

Nathen explained that AI is an amplifier and mirror that equally reflects the good and bad. On teams where shipping change is already easy, AI tends to keep things running well. On teams where getting change into production is painful, AI generates more change and makes the existing friction more acute. That said, his read on this outcome is cautiously optimistic: “If the pain is more acute, we maybe will invest in addressing that pain.”

The rub is that the investment has to actually happen. Nathen noted that in lower-performing organizations, AI tools often arrive with a reset of expectations rather than an invitation to fix the process: Here’s your new tool. Now we expect more from you. Addressing this problem means reframing the question “Does AI make people more productive?” What we really should be asking is “Under what conditions will AI boost productivity, and who’s responsible for creating them?” And that falls on the organization, not the technology.

Verification isn’t a checkbox

Trust is a big challenge with generative AI. About 30% of DORA survey respondents trust AI output little or not at all. Around 46% trust it “somewhat” (and Nathen is one of them). Despite all the advances in generative AI, these tools still make mistakes, and if you’ve multiplied your ability to generate code without doing anything to scale your ability to verify it, you’ve made your situation worse, not better.

Nathen called this the verification tax, and it belongs in any honest accounting of AI’s productivity impact. Pipeline adaptation belongs there too: Is your delivery pipeline fit for purpose given the volume of change you’re now trying to push through? These costs don’t show up in the headlines about 10x developer productivity. They show up in your incident reports three months later.

DORA recently published an ROI framework and calculator for AI-assisted software development. Nathen was clear that there’s no universal number to offer, and the calculator doesn’t pretend otherwise. What it does is give teams a way to model the real costs, including the learning investment, the verification overhead, and the pipeline changes required.

Context switching and burnout

With productivity on the upswing, AI-induced burnout is becoming a serious concern. (Steve Yegge calls this the “AI vampire.”) DORA’s data for 2025 showed that AI adoption wasn’t strongly connected with burnout, with the caveat that about 64% of DORA survey respondents said they’d never worked in an agentic workflow. Both of those findings are likely to change significantly in 2026.

Nathen highlighted one source of burnout he expects to escalate as agents become the norm: context switching. As he pointed out, software developers spent years arguing for protected focus time to do the deep work that requires them to maintain flow. Agentic workflows are now incentivizing those same developers to voluntarily run a dozen or more agents at once, forcing them to context-switch multiple times every hour. As he joked, “There’s plenty of research that supports the idea that all of us feel like we’re pretty good multitaskers and none of us are.” The consequences are coming, and we’re doing it to ourselves.

The cognitive debt question

Sam Newman brought up the related notion of “cognitive debt,” and in particular, Margaret-Anne Storey’s discussion of it. (See “How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt” and “From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI.”) Here’s how Storey explains the problem in her blog post:

Debt compounded from going fast lives in the brains of the developers and affects their lived experiences and abilities to “go fast” or to make changes. Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.

And as Sam noted, this compounds across teams and organizations. As developers increasingly work in parallel with AI rather than with each other, they lose the shared understanding that comes from people building software together. Kent Beck once said that “software design is an exercise in human relationships.” Agentic workflows are putting pressure on that in ways we’re only beginning to see.

Nathen agreed cognitive debt is where he’s most concerned, and both your workers and your architecture will suffer for it. Understanding the ramifications of an architectural decision you made eight months ago takes years of operation to surface, and AI doesn’t help with that at all.

Invest in your platform now

Considering what makes some AI-assisted teams high performers, Nathen explained, “It’s not that you’re using AI but how you’re using AI.” This observation led DORA to develop seven capabilities that, when combined with AI adoption, lead to better outcomes. Nathen briefly ran through the list, ending on quality internal platforms. And here he made a claim about software engineering investment that was, in his words, “a little bit wild”:

Every product engineer that you have in your organization, every engineer that’s focused on building features right now, should probably stop building features and focus on the platform.

His argument is that platforms matter more, not less, in an environment where AI makes it possible for almost anyone in an organization to build something. The people closest to customers and business problems can now generate working software. What they can’t do is ensure that software is durable, secure, and production-ready.

Nathen suggested that the best leverage for software engineering investment today might be building platforms that provide those guardrails, that shift the complexity of production-readiness down into the infrastructure so that anyone building on top of it gets the safety net for free. He acknowledged that moving every product engineer to platform work might be overkill. But the direction of travel is real. The platform is also, as Newman pointed out, where you bring determinism back into a process that AI has made more nondeterministic.

That’s something we’ve been hearing a lot here at O’Reilly. The expansion of who can build doesn’t reduce the need for deep engineering expertise. It changes where that expertise is most valuable, and platforms are a good answer to where.

What DORA’s research tells us

The teams that are doing well are running experiments, learning from them, and spreading those lessons. The measure Nathen suggested is not how many tokens you’ve consumed but how many experiments you’ve run and how well you’re distributing what you’ve learned.

The tools are moving fast enough that any organization locking in a fixed policy around specific tools will find itself stuck. What you want is the capacity to keep learning, which means building the culture and the processes that make learning visible and transferable.

All of DORA’s research is freely available at dora.dev, including the 2025 annual report and the ROI framework. The DORA Community provides a space for practitioners to work through these questions together. If you’re trying to navigate any of this with your team, you may want to spend some time there.

And if you want to dive deeper into Nathen and Sam’s chat or explore the other sessions, you can watch the entire Infrastructure & Ops Superstream on the O’Reilly learning platform. Our next event, on September 9, will cover agentic observability. Register for free here, and check out all the other free live events on O’Reilly.

Agent Skills

Addy Osmani — Wed, 27 May 2026 10:59:18 +0000

The following article originally appeared on Addy Osmani’s blog and is being reposted here with the author’s permission.

The default behavior of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It doesn’t ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.

This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.

Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.

Agent Skills is my attempt at that scaffolding. It just crossed 27K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.

What a “skill” actually is

The word “skill” is doing a lot of work in the Claude Code/Anthropic vocabulary, and it helps to be precise. A skill is a Markdown file with front matter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.

A skill is not reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.

That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a workflow there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.

Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty Markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.

The SDLC the skills encode

The 20 skills in the repo organize around six lifecycle phases, with seven slash commands sitting on top. Define (/spec) is where you decide what you’re actually building. Plan (/plan) breaks the work down. Build (/build) implements it in vertical slices. Verify (/test) proves it works. Review (/review) catches what slipped through. Ship (/ship) gets it to users safely. /code-simplify sits across the bottom of the whole thing.

This isn’t a coincidence. It’s the same SDLC every functioning engineering organization runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backward memo and the bar raiser. Every healthy team has some version of this loop.

What’s new with AI coding agents is that most agents skip most of these phases by default. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.

A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (using-agent-skills) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.

Five principles that are doing the work

Five design decisions in the project are the loadbearing ones. The rest of the system follows from them.

1. Process over prose

Already covered. Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.

2. Anti-rationalization tables

This is the most distinctive design decision in the project, and the one I most want other teams to steal.

Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:

“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not.
“I’ll write tests later.” → Later is the loadbearing word. There is no later. Write the failing test first.
“Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behavior? Did a human read the diff?

The reason this works is that LLMs are excellent at rationalization. They will produce a plausible-sounding paragraph explaining why this particular task doesn’t need a spec or why this particular change is fine to merge without review. Anti-rationalization tables are prewritten rebuttals to lies the agent hasn’t yet told.

The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.

3. Verification is nonnegotiable

Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behavior. A reviewer signs off. “Seems right” is never sufficient.

This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any long-running agent recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.

4. Progressive disclosure

Do not load all 20 skills into context at session start. Activate them based on the phase. A small meta-skill (using-agent-skills) acts as a router that decides which skill applies to the current task.

This is the harness engineering lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a 20-skill library into a 5K-token slot without poisoning the well.

5. Scope discipline

The meta-skill encodes a nonnegotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.

This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.

The Google DNA

The skills are saturated with practices from Software Engineering at Google and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is exactly the part agents are most likely to skip.

A partial map of which skill encodes which practice:

Hyrum’s law in api-and-interface-design. Every observable behavior of your API will eventually be depended on by someone, so design with that in mind.
The test pyramid (~80/15/5) and the Beyoncé rule in test-driven-development. “If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do.
DAMP over DRY in tests. Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Overabstracted tests are a known antipattern.
~100-line PR sizing, with Critical/Nit/Optional/FYI severity labels in code-review-and-quality. Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped.
Chesterton’s Fence in code-simplification. Don’t remove a thing until you understand why it was put there.
Trunk-based development and atomic commits in git-workflow-and-versioning.
Shift left and feature flags in ci-cd-and-automation. Catch problems as early as possible, decouple deploy from release.
Code-as-liability in deprecation-and-migration. Every line you keep is one you have to maintain forever, so prefer the smaller surface.

None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s law” in its training data, but it does not apply Hyrum’s law when it’s designing your API at 3am. Skills are how you make sure it does.

How to actually use it

Three modes, in roughly increasing commitment.

Mode 1: Install via marketplace. If you’re using Claude Code:

/plugin marketplace add addyosmani/agent-skills 
/plugin install agent-skills@addy-agent-skills

You get the slash commands (/spec, /plan, /build, /test, /review, /ship, /code-simplify) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.

Mode 2: Drop the Markdown into your tool of choice. The skills are plain Markdown with front matter. Cursor users put them in .cursor/rules/. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.

Mode 3: Read them as a spec. Even if you never install anything, the skills are a documented description of what good engineering with AI agents looks like. Read code-review-and-quality.md and apply the five-axis framework to your team’s review process. Read test-driven-development.md and use it to settle the next “do we need to write the test first” argument with a junior. Read the meta-skill and steal the five nonnegotiables for your own AGENTS.md.

This third mode is where I’d actually start. Pick the four or five skills closest to your current pain. Decide which workflows you want enforced. Then install the runtime, or roll your own, to do the enforcing.

What to steal even if you never install

A few patterns from the project I’d steal regardless of whether you use AI coding agents at all:

Anti-rationalization as a team practice. Write down the lies your team tells itself. “We’ll fix the tests after launch.” “This change is too small for a design doc.” “It’s fine, we have monitoring.” Pair each with the rebuttal. Put it in your AGENTS.md or your engineering wiki. It will save you arguments and it will catch the next tired Friday-afternoon shortcut.

Process over prose for anything you write internally. If you find yourself writing a 2,000-word doc titled “how we approach X” you’ve written reference material. Convert it to a workflow with checkpoints. The doc shrinks to 400 words and people actually run it. This applies as much to onboarding guides and runbooks as it does to agent skills.

Verification as a hard exit criterion. Make “produce evidence” the exit step of every task. For agents, for engineers, for yourself. Evidence is whatever proves the work is done: a green test run, a screenshot, a log, a review approval. Without it, the task is not done. “Seems right” never closes the loop.

Progressive disclosure for any rulebook. Do not write a 50-page handbook. Write a small router that points to the right small chapter for the situation. This is true for AGENTS.md, for runbooks, for incident playbooks, for anything anyone will read under time pressure.

Five nonnegotiables, lifted from the meta-skill, that I’d put in any AGENTS.md tomorrow:

Surface assumptions before building. Wrong assumptions held silently are the most common failure mode.
Stop and ask when requirements conflict. Don’t guess.
Push back when warranted. The agent (or engineer) is not a yes-machine.
Prefer the boring, obvious solution. Cleverness is expensive.
Touch only what you’re asked to touch.

That’s a worthwhile engineering culture in five lines, and you don’t need to install anything to adopt it.

Where this fits in the harness

In the broader picture, skills are one layer of agent harness engineering. The harness is the model plus everything you build around it; skills are the reusable workflow chunks that get progressively disclosed into the system prompt. They sit alongside AGENTS.md (the rolling rulebook), hooks (the deterministic enforcement layer), tools (the actions the agent can take), and the session log (the durable memory). Each layer has a specific job. Skills do the senior-engineer-process job.

Skills matter more for long-running agents than they do for chat-style ones, because long runs amplify every shortcut. An agent that skips the test in a 10-minute session produces one bug. An agent that skips the test in a 30-hour session produces a debugging archaeology project at the end of the run, when no one remembers what the original intent was. The longer the run, the more the senior-engineer scaffolding has to be enforced rather than suggested.

The portability of the skills format matters too. The same SKILL.md file works in Claude Code, Cursor (with rules), Gemini CLI, Codex, and any other harness that accepts system-prompt content. Write the workflow once, the runtime enforces it. That’s the thing the Markdown-with-front matter format buys you that bespoke prompt engineering does not.

Closing

The thing I most want people to take from this project, more than the skills themselves, is the framing.

AI coding agents are extremely capable junior engineers with no instinct for the parts of the job that don’t show up in the diff. The senior-engineering work (surfacing assumptions, sizing changes, writing the spec, leaving evidence, refusing to merge what can’t be reviewed) is exactly what an agent will skip unless you make it impossible to skip. The job, increasingly, is to encode that discipline as something the agent cannot talk itself out of.

Skills are one shape of that. Anti-rationalization tables. Progressive disclosure. Process over prose. Verification as the loadbearing exit criterion. The Google practices that already work, made portable.

You can install my version. You can roll your own. The lesson stands either way: The senior-engineer parts of the job are no longer optional, even when the engineer is a model.

The repo is at github.com/addyosmani/agent-skills (MIT). For the broader scaffolding picture, see “Agent Harness Engineering” and “Long-Running Agents.”