---
url: /blog/posts/2026-02-19-amdahls-law-for-ai-agents.md
description: >-
  The maximum speedup from AI agents is bounded by the fraction of the workflow
  that requires human judgment. The highest-leverage investment isn't making
  agents faster — it's making every human intervention self-liquidating.
---

Multi-agent systems are delivering real results. Geoffrey Huntley's [Ralph Loop](https://ghuntley.com/loop/) runs autonomous coding agents in a while-loop until every PRD item is complete. Steve Yegge's [Gas Town](https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04) orchestrates 20–30 concurrent agents across seven specialized roles. Cursor ships an 8-agent parallel system. The throughput gains are real — and the teams getting the most out of them share a pattern.

But the gains aren't automatic. Flask creator Armin Ronacher [told The Pragmatic Engineer](https://newsletter.pragmaticengineer.com/p/new-trend-programming-by-kicking): "I sometimes kick off parallel agents, but not as much as I used to do. The thing is: it's only so much my mind can review!" The teams getting the most value share a pattern — and a law from 1967 explains exactly where the leverage is.

## The Original Insight

In 1967, Gene Amdahl made a simple observation about parallel computing: the speedup of a program is limited by the fraction that must run sequentially.

His formula: **speedup = 1 / (S + (1-S)/N)**

Where S is the serial fraction and N is the number of parallel processors. The key insight isn't in the formula — it's in the limit. As N approaches infinity, speedup converges to 1/S. If 10% of your work is inherently serial, you will never exceed 10x speedup. Not with 100 cores. Not with a million.

This wasn't an argument against parallelism — it was an engineering guide for where to focus. The same logic applies to agents.

## The Agent Version

The equivalent law for AI agents:

> **The maximum speedup from AI agents is bounded by 1/H, where H is the fraction of the workflow that requires human judgment.**

H includes every moment the system blocks on a human: clarifying intent ("Did you mean the admin dashboard or the user-facing one?"), making judgment calls ("Should we refund this customer or offer credit?"), review cycles ("This isn't quite what I wanted, try again"), approval gates, resolving ambiguity ("The policy says 'handle escalations appropriately' — what does that mean?"), and applying taste ("This is technically correct but feels wrong").

If H = 40% of your total workflow time, no improvement in agent capability can ever get you past 2.5x speedup. At 50%, the ceiling is 2x. Even an optimistic 20% only gets you to 5x.

| Human fraction (H) | Max possible speedup |
|---|---|
| 50% | 2x |
| 30% | 3.3x |
| 20% | 5x |
| 10% | 10x |
| 5%  | 20x |

**H dominates the speedup equation, not agent capability.** But unlike Amdahl's original serial fraction S — a fixed property of the algorithm — H isn't static. Better models *do* shrink some components of H. An agent that needs less clarification and makes fewer errors directly reduces human time.

The catch: the components that models reduce aren't the ones that dominate at scale. Model improvements shrink clarification and verification. Taste and novel decisions — the components that become dominant once you've automated the mechanical parts — are largely irreducible by better models. These ceilings are real at any given moment.

**The question is how fast H is declining — and what you do with the gains.**

## Self-Liquidating H

The highest-leverage investment isn't making agents faster or smarter. It's changing the character of H. And the tools for doing that aren't new — tests, specs, automation, documented decisions. These are engineering best practices that have been around for decades. What's new is the return on investment. When agents can act autonomously on every encoded decision, the payoff for writing that test or updating that spec is dramatically higher than it used to be.

The goal isn't to minimize human involvement. It's to make it **self-liquidating**: every human intervention should produce the artifact — the test, the spec update, the documented decision — that makes the same type of intervention unnecessary next time. (The term comes from finance: a self-liquidating loan generates the revenue to pay itself off. A self-liquidating intervention generates the artifact that eliminates its own recurrence.)

A team where 40% of workflow time is human but it's all taste and strategy is in a fundamentally different position than a team where 40% is human but it's all "did you mean X or Y?" and "let me re-check this output." Self-liquidating practices convert the second type into the first.

This requires what I've been calling [configurancy](/blog/2026/02/02/configurancy) — the smallest set of explicit behavioral commitments (and rationales) that allow a bounded agent to safely modify the system without rediscovering invariants. Specs, conformance suites (automated test suites that verify behavior against a spec), documented rationale. Every implicit assumption in your system is a future human-blocking event.

A conformance suite that makes human review unnecessary is **crystallized cognition** — human judgment about correctness, encoded at the moment it was made so agents don't have to rediscover it. The AGENTS.md file that prevents agents from stumbling into known gotchas is the same thing. Every good piece of scaffolding is human judgment captured as a durable, machine-readable artifact.

Which human involvement should you target? The test is concrete: **"Is this intervention encodable?"** When a human catches a bug, can that catch become a test case? When a human clarifies an ambiguity, can that clarification update the spec? When a human makes a taste call, can that call become a documented precedent? If agents keep requiring the same type of human intervention, your configurancy is incomplete.

Verification is highly encodable — catches become test cases. Clarification is highly encodable — resolutions become spec updates. Specification is partially encodable — patterns become reusable templates. Taste and novel decisions are the least encodable — which is fine, because they're the components where human judgment genuinely creates value. The system naturally converges toward a state where humans do only the work that can't be encoded, because everything encodable has been.

But capturing alone isn't enough. Naively appending every intervention creates its own problem — an AGENTS.md that's 400 lines of contradictory gotchas nobody reads, a test suite of overlapping cases encoding conflicting assumptions. Raw accumulation creates noise, not knowledge.

The real pattern is **accumulate, then compress**: individual judgments pile up, then periodically get integrated into coherent higher-level artifacts. Common law accumulates case decisions, then synthesizes them into principles and statutes. Science accumulates papers, then compresses them into review articles and textbooks. In practice: capture every intervention as a local artifact (a test case, an AGENTS.md entry, a decision record), then periodically integrate those artifacts into updated specs, refactored test suites, and revised skill definitions. Compression is where you ask whether the governing variables themselves are still right — whether 40 test cases encoding 40 variations of an assumption reveal that the assumption is wrong. Teams that only accumulate end up with drift and contradiction. Teams that only try to compress end up over-engineering upfront. The cycle needs both.

Two levers make the self-liquidation cycle practical:

1. **Capture the signal.** When a human intervenes — catching a bug, clarifying a spec, making a taste call — the system should generate the artifact that encodes that intervention. A review that catches a bug but doesn't update the test suite is wasted signal. A clarification that doesn't update the spec will recur.
2. **Maintain high configurancy so agents operate autonomously everywhere else.** When the system's knowledge is explicit — specs, invariants, conformance suites, documented rationale — agents don't block on humans for things that could have been encoded upfront.

**Agent scaffolding is the serial code optimization of the AI era.** It concentrates human time where it has maximum leverage and captures the signal from every intervention as durable knowledge the system can reuse.

## What Scaffolding Looks Like

At ElectricSQL, an agent recently propagated a protocol change through 67 files — the spec, two server implementations, 10 client libraries across 10 languages — in 20–30 minutes. No human reviewed 67 files. The conformance suite *is* the review. Without it, that's hours of careful manual verification across 10 languages. With it, I spent a few minutes reviewing the PR. The human designed the protocol change; the suite automated everything downstream. And each new decision about correctness becomes another test case, making the next change even more autonomous.

Emil Stenström built a complete HTML5 parser with agents by hooking in the html5lib-tests conformance suite from the start. Then Simon Willison ported it to JavaScript in 4.5 hours by pointing a *different* agent at the *same* suite. The conformance suite made human review unnecessary because the spec was already encoded as executable verification.

The model handles the task; the configurancy handles the trust. Together they compound.

Teams that treat agent deployment as "pick a model and write some prompts" plateau quickly. They haven't touched H. The teams seeing real speedup have invested heavily in the scaffolding layer — often spending more engineering effort on configurancy than on the agent integration itself.

## Scaling Past the Bottleneck

As agents get faster, **H *feels* like it's growing.** When an agent takes 2 hours to research a topic and you spend 30 minutes reviewing its output, that 30 minutes is background noise. When the agent takes 30 seconds and you still spend 30 minutes reviewing, suddenly *you* are the bottleneck. The absolute time hasn't changed, but the relative weight has shifted dramatically. You're never waiting anymore — which means you're always the one being waited on.

And then you add more agents, and it gets worse.

Amdahl's Law tells you there's a ceiling. Donald Reinertsen's *[The Principles of Product Development Flow](https://www.amazon.com/Principles-Product-Development-Flow-Generation/dp/1935401009)* tells you where the engineering problem is: **naively adding parallel agents can degrade performance — but the fix is tractable.**

Reinertsen applied queueing theory to product development and showed that capacity utilization increases queue size exponentially. At 50% utilization, the queue is manageable. At 80%, it's 4x larger. At 90%, 9x. At 95%, 19x. The human reviewing agent output is a single server in a queue. Five parallel agents quintuple the arrival rate, driving utilization toward 100% and queue times toward infinity.

Gas Town is the most vivid illustration. Yegge describes "palpable stress" as 20–30 agents run simultaneously at speeds too fast to comprehend. Early users describe their role as "keep your Tamagotchi alive" and note that "your management span of control is directly correlated to your attention span and memory." One user went from 5 PRs in 3 hours to 36 PRs in 4 hours — but at $100/hour in Claude tokens and with intense, unbroken cognitive engagement. The throughput is real, but so is the human queue saturation.

Reinertsen's prescription: manage queue size directly — work-in-progress limits, smaller batch sizes, faster feedback loops — rather than maximizing utilization. The agent equivalent: the way to scale parallel agents is to **eliminate the friction that causes them to block on humans in the first place** — invest in configurancy that lets agents verify their own work, so that when they do need a human, it's for judgment that actually matters.

The Ralph Loop gets this right. It works not because it runs agents in parallel but because it's a self-liquidation engine: a well-defined PRD as the spec, automated test verification as the acceptance criteria, and AGENTS.md files that accumulate discovered patterns across iterations. Each iteration where a human corrects an agent encodes that correction as an artifact the next iteration can consume. The same pattern shows up at every scale. [shadcn](https://x.com/shadcn/status/2023812711151259772) describes running a `/done` skill after every agent session that dumps key decisions, questions, and follow-ups into a markdown file tagged with the session ID and branch name. Every session's human context becomes a durable artifact the next session consumes. The intervention encoded itself.

## Where the Value Accrues

The metric that makes this concrete: **recurrence rate.** How often does the same type of human intervention happen twice? This is the same logic behind Toyota's defect tracking: every defect triggers root cause analysis and a process change that prevents recurrence. The metric isn't "how many defects did we find?" but "is the same type of defect happening less often?" Declining recurrence means your system is learning. Flat recurrence means you're paying the same human tax repeatedly.

And agents make the self-liquidation cycle cheaper to run. The core thesis from [configurancy](/blog/2026/02/02/configurancy): **AI makes code cheap; therefore the scarce asset is the system's self-knowledge.** Agents make maintaining explicit contracts cheap — write a spec change, agents propagate it through implementations, conformance suites verify correctness. The very thing that reduces H has itself become cheaper to produce.

This creates a flywheel: agents make it cheap to crystallize human judgment as durable artifacts → durable artifacts reduce future human interventions → fewer recurring interventions means agents deliver more value per cycle → which justifies more agent investment. The teams that recognize this early will compound their way to genuine 10x while everyone else argues about which model to use.

But that compounding isn't smooth. H doesn't behave like Amdahl's original serial fraction S — a fixed property of the algorithm. It moves. When you encode verification and clarification friction into conformance suites and specs, H drops — and your Amdahl's ceiling lifts. But successful teams don't pocket the savings and do the same work faster. They increase scope. The team that reduced H from 40% to 20% by building a conformance suite doesn't just ship the same features at 5x. They take on a protocol redesign that would have been unthinkable before — and that redesign introduces new taste and strategy decisions that push H back up.

This is Jevons' paradox for cognition: efficiency gains in human-agent collaboration get reinvested as increased ambition, not decreased H. The healthy trajectory isn't H monotonically declining toward zero. It's a **sawtooth** — H drops as you encode friction, jumps as you take on harder problems, drops again as you encode the new friction, jumps again. Each cycle operates at a higher level of capability. The teams that compound aren't the ones with the lowest H. They're the ones running the sawtooth fastest.

## The Real Question

Amdahl's Law doesn't say parallelism is useless. It tells you where to focus.

The question isn't "how smart is your agent?" or "how many agents can I run?" It's: **"when a human touches my system, does the system learn something that makes the next touch unnecessary?"**

Most teams spend the majority of their human time on friction, not judgment. And most of that friction is recurring — the same interventions, over and over, because nobody encoded the signal from the last one.

Crystallizing that friction into durable knowledge isn't a one-time fix — it's an iterative discipline of accumulating judgments and compressing them into specs and suites. But the teams that run that cycle fastest are the ones compounding their way to 10x.