Last updated: June 2, 2026
The Six Levels of Agentic Software Engineering
The most important conversation engineering leaders are not having is about AI autonomy. They are having a conversation that sounds like it: "We're using AI heavily." "Our pipelines are autonomous." "Auto-merge is coming." "We have 10000 agents." But the same words mean different things in different organizations, and the resulting decisions about tooling, governance, and headcount get made without a shared map.
That map exists. It was arrived at independently by more than a dozen authors writing about AI software engineering between 2024 and 2026: consultancies, vendors, security bodies, academic researchers, and policy analysts. The taxonomy they converged on is the closest thing the field has to a standard, and the convergence itself is the most interesting finding. Nothing in the levels below is original to this article. The contribution is the synthesis.
Read by an engineering leader, what follows answers three questions: where your organization actually sits, what to build to reach the next rung, and why no source in the literature has ever recommended skipping one. The taxonomy is not a menu. It is a sequence.
Why a Shared Vocabulary Matters
A shared autonomy taxonomy does three things at once:
- It locates your organization honestly. Most teams that describe themselves as advanced are at Level 2, the human-reviewed PR. They feel further along because generation is fast, but their review surface has not changed. The taxonomy makes that visible.
- It separates the problems you actually have from the ones you are reading about. Auto-merge horror stories are Level 3.5 and Level 4 failures. Review-wait bottlenecks are Level 2 failures. AI slop appears at every level but with different mitigations. Without a label, leaders chase the wrong fix.
- It makes tooling claims falsifiable. A vendor selling an "autonomous engineering platform" can be asked, plainly, which level they operate at and what evidence supports it. The answer is almost always lower than the marketing.
The rest of this article walks the levels and names the challenges each one introduces. The load-bearing point comes in section 3: no level can be skipped. Every attempt to leap from Level 2 to Level 4 in the published record produces the same failure mode under a different name.
The Six Levels at a Glance
The table below is the reference figure for the rest of the article. Each row names a level, what AI does at that level, and what humans do. What separates the levels is not how much code AI writes (that scales smoothly across every level) but what humans look at and approve.
| Level | Name | What AI does | What humans do |
|---|---|---|---|
| L1 | AI-Assisted | Autocompletes lines, suggests snippets. | Write the code. Decide on each suggestion as it appears. |
| L2 | AI-Generated, Human-Reviewed | Writes whole functions, files, or PRs. | Drive the work. Review and approve every PR before merge. |
| L3 | AI-Generated, Auto-Reviewed | Generates code from a spec. Runs tests, linters, security scans, holdout scenarios automatically. | Approve merges. Look at intent and proof (satisfaction reports, screenshots, behavior evidence), not diffs. |
| L3.5 | Selective Auto-Merge | Generates code. Routine PRs in qualifying services auto-merge. | Veto rights on auto-merged PRs. Active review only on high-risk services and out-of-policy changes. |
| L4 | Mostly Autonomous | Handles the full loop within a defined scope. Notifies humans, does not ask them. | Set goals and scope. Handle escalations on boundary violations. |
| L5 | Dark Factory (lights-out) | End-to-end SDLC: spec, code, test, review, deploy, maintain. | Write specs. Define system boundaries and quality thresholds. Improve the factory when novel failures emerge. |
Two things to notice before reading on.
First, the human role does not shrink as AI grows. It moves. At L1 the human writes code. At L2 the human approves diffs. At L3 the human approves intent and proof. At L4 the human sets scope and handles escalations. At L5 the human writes specs and tunes the factory. Each row removes a kind of work and adds a different kind. The total amount of human judgment required does not decline; it concentrates at a higher level of abstraction.
Second, L3.5 is not a typo. It is the most operationally important level in the table. Real organizations do not promote every service at once. They earn auto-merge on routine services first, while high-risk surfaces (auth, billing, customer data, schema migrations) stay on the human path indefinitely. Treating L3.5 as a real level, rather than as a halfway state on the way to L4, is the difference between a working program and one that stalls. And this is not theoretical or limited to scrappy startups: publicly traded companies with large security and compliance functions are already operating with selective auto-merge in production. Compliance is achievable at L3.5, not a barrier to it.
Why No Steps Can Be Skipped
The most common mistake leaders make when reading an autonomy taxonomy for the first time is treating it as a menu. They look at L2 (their current state) and L4 (the brochure picture from the loudest vendor) and ask which they prefer. The answer is that neither is a choice, because the levels are not options. They are a sequence.
Every transition works only because the previous level produced something the next one consumes. Skip a level and the substrate is missing. The interesting skips happen above L2, since L1 to L2 is not a step anyone struggles with: a team that turns on AI generation lands at L2 on day one. The transitions worth examining are the ones that fail in published postmortems.
L2 to L3. Moving from human-reviewed PRs to automated review gates requires a layered review stack: a security agent, a conventions agent, a holdout-scenario evaluator, an audit log. Building that stack is the substantive engineering work. Once it exists, humans can move their attention from diffs to intent and proof because the mechanical review is being done somewhere. Skip L3 and there is nothing else doing the mechanical review. Auto-merging at that point is what Mayflower named the Bullshit Factory and what HackerNoon named Stapling AI Onto the Old Workflow: generation scaled without scaling validation. The result is high-volume output that compounds slop instead of catching it.
L3 to L3.5. Selective auto-merge requires per-service trust, and per-service trust requires measurement. Pass rate, false-positive rate, override rate, over a meaningful sample (HackerNoon proposes 20 PRs as a floor). That data only exists if the L3 review stack has been running long enough to produce it. Promoting a service to auto-merge before the data exists is what HackerNoon names Auto-Merge Before Earning Trust. The reference incident is MindStudio's account of an autonomous agent that wiped 1.9 million rows from a production database. The agent was not malicious. It had merge rights it had not earned.
L3.5 to L4. Mostly-autonomous operation requires codified knowledge of what to do when a scenario fails: which fix, which agent, which playbook. That knowledge accumulates only when an organization has lived through L3.5 long enough to encounter and codify the recurring failures. Mayflower names this the Codification Imperative: every manual fix must become a durable improvement to the factory itself. Skip L3.5 and the organization arrives at L4 with no codified rules and no audit discipline. The agent acts on judgment it does not have, and there is no record of how it arrived there.
L4 to L5. The final transition requires that engineers write specs precisely enough that no one needs to look at diffs. That skill is not innate. It develops only when teams have been operating at L4 long enough to see which specs produce correct code and which produce confident garbage. Skip L4 and the specs are still draft-quality. The Dark Factory then ships draft-quality output at full throughput, which is the failure mode every source warns against in their loudest language.
The disagreements in the field are real, but they sit at the top of the ladder. Sources disagree about whether human approval becomes an anti-pattern at L5 or simply becomes escalation-only at L4. None of them argue about whether the ladder can be climbed in a different order. On that question the literature is unanimous: the rungs exist for a reason, and every reason corresponds to a documented incident in someone's postmortem.
Engineering leaders who want to move up should pick the next rung, not the highest one.
What This Means for Engineering Leaders
Three takeaways follow from the argument so far. They are deliberately practical and order-sensitive: the first has to be true before the second is useful, and the second has to be in motion before the third applies.
- Locate your organization honestly. Most teams that describe themselves as AI-native or highly automated are at Level 2. They feel further along because generation is fast, but their review surface has not changed. The diagnostic is simple: if every PR is still human-approved before merge, you are at Level 2, regardless of how much code AI is writing. Honesty here is the prerequisite for everything else. A team that thinks it is at Level 3 cannot reason about how to reach Level 3.5.
- Invest in the next rung's substrate, not the highest one. Every level requires the previous level's machinery. The discipline that builds this machinery has a name: harness engineering. It is the practice of designing the scaffolding around AI agents (review pipelines, evaluators, audit logs, spec templates, per-service measurement) so that raw model output becomes reliable engineering capability. From Level 2, the harness worth building is the Level 3 review stack: a security review agent, a conventions agent, a holdout-scenario evaluator, an audit log, a spec template. Not "an autonomous engineering platform." From Level 3, the harness worth building is per-service trust measurement. The work for the rung above that is not yet relevant, and trying to do it first is the failure mode section 3 names.
- Treat promotion as a measurement problem, not a declaration. Per service, not org-wide. Per surface, not per team. A service earns its promotion when the data supports it. Industry-proposed floors are >90% scenario pass rate, <5% evaluator false-positive rate, and <10% human override rate over the last 20 PRs. High-risk surfaces (auth, billing, customer data, schema migrations) may never qualify, and that is correct. The taxonomy does not require every service to promote together. It requires every service to promote on evidence.
The leaders who lose ground in this transition will be the ones who skip a rung, declare a level, or treat the program as uniform across the organization. The leaders who win will pick the next rung, do the harness engineering, and let the data say when the rung after that is available.
Where the Model Comes From
The levels in this article are not invented. They are the field convergence of more than a dozen independent authors writing between 2024 and 2026: consultancies, vendors, security bodies, academic work, and policy authors. None copied from another. The fact that they all landed on a five-to-six-level model, with the same boundaries at roughly the same places, is the reason this taxonomy is worth treating as authoritative rather than as one more opinion.
Where they agree. All sources converge on the same lower-level boundaries. Level 1 is inline assistance. Level 2 is AI generation with human PR review. Level 3 introduces automated review gates. Every source identifies the L2 review-wait as the binding bottleneck. Every source names the same anti-pattern (the Bullshit Factory, Stapling AI Onto the Old Workflow, Auto-Merge Before Earned Trust) when teams scale generation without first scaling validation. Every source treats the analogy to SAE J3016 self-driving levels as the correct reference frame, whether they cite it openly (HackerNoon, Tessl, CSA, ASDLC.io) or adopt the structure without naming the source.
Where they disagree. The disagreements are concentrated at the upper levels. Mayflower argues that past a throughput threshold, mandatory human approval actively reduces quality (the Quality Inversion). MindStudio and HackerNoon retain veto, audit, and override rights at L4 and L5 while removing per-PR approval. Parloa sits in the middle: humans review intent and proof, not code, and retain the right to decide whether to ship at all. The CSA and the arXiv treatment push the question further by asking whether a six-level taxonomy is even the right granularity, or whether autonomy should vary by capability within a single system. None of these disagreements affect Levels 1 through 3. They affect the question of what the human role becomes once the machinery above it is mature.
Sources:
- HackerNoon: The Dark Factory Pattern: Moving from AI-Assisted to Fully Autonomous Coding. Origin of the L3.5 designation. Openly cites the self-driving analogy.
- MindStudio: What Is a Dark Factory? AI Coding Autonomy. Cleanest five-level statement. Source of the 1.9M-row-wipe reference.
- Mayflower: AI Dark Factory Patterns. Named the Bullshit Factory, Mechaniker, Zirkusdompteur, Quality Inversion.
- Parloa: Agentic Software Engineering at Scale. Frames the role shift to engineers as architects.
- Infralovers: Architektur-Patterns: Dark Factory (German). Holdout sets, digital twins, the Codification Imperative.
- Tessl / AI Native Dev: The 5 levels of AI agent autonomy: learning from self-driving cars.
- Cloud Security Alliance: Autonomy Levels for Agentic AI.
- ASDLC.io: Levels of Autonomy: L1–L5 AI Agent Autonomy Scale.
- SmartBear: Levels of Autonomy in Software Development.
- Swarmia: Five levels of AI coding agent autonomy.
- Sean Falconer: The Practical Guide to the Levels of AI Agent Autonomy.
- James Read: AI Levels of Autonomy in Software Engineering.
- Knight First Amendment Institute: Levels of Autonomy for AI Agents.
- arXiv 2506.12469: Levels of Autonomy for AI Agents.
- Interface EU: An Autonomy-Based Classification.
SAE International: J3016 Levels of Driving Automation. The original taxonomy the analogy is built on.
