The vocabulary of agentic software engineering moves faster than the people writing it. Terms get used three different ways in the same blog post. Vendors describe their product by its category instead of by its function. Words like "agent" and "memory" mean precise things in research papers and softer things in marketing material.

This article is the dictionary we wish we had had when we started. It defines the terms Dash0 uses when we talk about agentic systems: the levels of autonomy, the practice that builds them, the components inside, the safety surfaces around them, and the operational visibility that keeps the whole thing honest. Where another Dash0 article goes deeper on a term, we link.

The arrangement is concentric. We start at the outside (the destination and the framework). And move inward to the practice that gets you there, then to the parts the practice assembles, then to the safety surfaces, then to the operational visibility you need to keep all of it accountable.

The Destination

Dark factory

A dark factory is the highest level of agentic software engineering: a lights-out, end-to-end software delivery lifecycle operated by AI from spec to deploy and maintain. Humans author specs and tune the system instead of reviewing diffs.

The term is borrowed from manufacturing, where a "lights-out factory" runs without human presence on the floor. In software, the equivalent is a pipeline where code is written, reviewed, tested, deployed, and maintained by agents, with humans setting boundaries and improving the factory itself when something novel breaks.

See The Six Levels of Agentic Software Engineering for the full ladder. The dark factory is Level 5.

AI adoption levels

A level taxonomy that locates where an organization sits on the path from inline AI assistance (L1) to a dark factory (L5). The boundary between two adjacent levels is not how much code AI writes (that scales smoothly across every level), but what humans look at and what they approve.

The taxonomy makes three things possible: locating your organization honestly, naming what the next investment is, and falsifying vendor claims about autonomy.

See The Six Levels of Agentic Software Engineering for the full table and the argument for why levels cannot be skipped.

The Practice

Harness engineering

Harness engineering is the discipline of designing the scaffolding around AI agents so that raw model output becomes reliable engineering capability. The harness is everything that is not the model: review pipelines, evaluators, audit logs, spec templates, per-service trust measurement, memory systems, retrieval indexes, sandboxes, and guardrails.

The term captures a load-bearing claim: capability comes from the harness, not from the model. Two organizations using the same frontier model can produce wildly different outcomes depending on what they build around it. The model is necessary. The harness is what makes the model useful.

Each level of autonomy requires a different harness. Level 2 requires almost no harness beyond the IDE. Level 3 requires a review stack. Level 3.5 requires per-service trust measurement. Level 4 requires codified knowledge of what to do when a scenario fails. Building the harness for the next level you target is the substantive work of climbing the ladder.

Tip: The analogy: test harnesses. The word is not new and the meaning is not coincidental. A test harness is the scaffolding around code under test: fixtures, mocks, runners, assertion libraries, CI integration. None of it is the code that ships, but without it the code that ships cannot be trusted. Test harnesses are an enablement layer; they make it safe to change software quickly. Harness engineering for agents is the same idea applied one level up: the scaffolding around the model that makes it safe to let the model change software quickly. In both cases the harness is invisible in the final product and load-bearing for everything around it.

Evaluations

Evaluations (often shortened to evals) are structured tests that measure whether an agent's output meets a standard. The standard can be functional (“does the code work?”), behavioral (“does it match the spec?”), stylistic (“does it follow conventions?”), or operational (“did it use the right tools?”).

Evals are to agents what tests are to traditional software, with two crucial differences:

They run on probabilistic systems. A single pass or fail does not generalize. Evals are reported in distributions: pass rate, false-positive rate, override rate over N runs.
They include holdouts. Holdouts are data or scenarios that are reserved for testing the performance of AI agents after they have been trained. They are necessary because a scenario the implementation agent can read is a scenario it can game.

Without evals, level promotion is a declaration. With evals, it is a measurement.

LLM as a judge

LLM as a judge is the practice of using one language model to grade the output of another. You give the judge model a rubric (a description of what “good” means, see below), the input given the model, and the output the model gives back. It returns a score, a verdict (pass/fail), or a pairwise preference, usually with a written rationale.

The technique exists because most qualities that matter in agent output cannot be measured by deterministic tests. "Did this code pass the unit tests?" is a deterministic check. "Is this code idiomatic?", "Did the agent explain its reasoning clearly?", "Does this answer faithfully reflect the retrieved context?", "Is this PR description honest about what changed?" are not. Writing a brittle regex for any of those produces worse signal than asking a capable model to evaluate against a rubric.

A typical LLM-as-judge setup:

Rubric: a short, explicit specification of what "good" means. The quality of the judge is bounded by the quality of the rubric.
Inputs: the original task, any retrieved context, and the candidate output.
Output format: a structured verdict (score, label, or A vs. B preference) plus a rationale.
Aggregation: multiple independent runs of the same judge, or multiple different judges, combined by majority vote or average.

The recurring failure modes are well-documented and worth naming:

Length bias. Judges tend to prefer longer answers, regardless of quality.
Self-preference. A judge model often rates its own outputs higher than other models' outputs on the same task.
Position bias. In pairwise comparison, the first option is preferred more often than the second.
Sycophancy. Judges agree with the framing of the rubric instead of evaluating against it.
Miscalibration. Scores drift over model versions; a "7/10" from one version is not a "7/10" from the next.

The usual mitigations are multi-run majority voting (2-of-3 is a common floor), pairwise comparison instead of absolute scoring (more reliable), shuffled positions in pairwise tests, and periodic human audit of a sample to keep the judge calibrated against reality.

Important: LLM-as-judge is a useful evaluation tool, not an oracle. Treat its verdicts as one signal among several (deterministic tests, holdout scenarios, post-merge CI/CD, human spot-checks), not as ground truth. The harness improves when the judge is calibrated; it degrades when the judge is trusted blindly.

The Parts

Agent

An agent is a program that takes a goal, decides what steps to take, executes those steps using tools, observes the results, and decides what to do next, until the goal is met or it gives up.

Three things distinguish an agent from a regular program: it operates against a goal rather than a fixed instruction, it chooses its own actions dynamically, and it runs in a loop rather than as a one-shot call. The model is the agent's reasoning engine. Tools are how it acts on the world. Memory and knowledge are what it carries between steps.

Most things marketed as agents are not. Many are templated workflows with an LLM call inside. The test: can it choose a different path when the situation changes? If not, it is an automation, not an agent.

Loop

The loop is the structural pattern that makes a program an agent rather than a workflow. The agent reasons about the current state, picks an action, executes the action, observes the result, and feeds that result back into the next round of reasoning. It keeps doing this until a stop condition is met.

The most common formulation is ReAct (Reason + Act), introduced in a 2022 paper by Yao et al. The pattern is one alternating sequence of thoughts (free-form reasoning the model produces) and actions (tool calls). Each action returns an observation that re-enters the next thought step. Almost every contemporary agent framework is a variation on this idea.

123456
flowchart LR
    Goal([Goal]) --> Reason[Reason]
    Reason --> Act[Act via tool]
    Act --> Observe[Observe result]
    Observe --> Reason
    Reason -->|stop condition met| Done([Done])

Note: When someone says "ReAct agent," they almost always mean an agent that uses tool calls in a reasoning loop. The original paper's specific prompt format is rarely used as-is today.

MCP

MCP (Model Context Protocol) is an open protocol introduced by Anthropic in 2024 that standardizes how agents discover and invoke tools, fetch resources, and use prompt templates across systems. Before MCP, every agent integration was a custom adapter. With MCP, an agent that speaks the protocol can talk to any server that speaks it.

The protocol defines three primitives:

Tools: functions the agent can call.
Resources: data the agent can read.
Prompts: templates the host application exposes.

A server provides any of these; a client consumes them. The wire format is JSON-RPC.

MCP matters operationally because it decouples the agent from the integration surface. Switching from one model provider to another no longer requires rebuilding every tool integration.

Managed Model Platforms

Managed model platforms are cloud-vendor services that host foundation models behind a unified API, bundle them with the vendor's identity, governance, and networking primitives, and sell them on the vendor's usual contract. The two that matter for enterprise procurement are AWS Bedrock and GCP Vertex AI.

AWS Bedrock: Amazon's managed service for foundation models. It provides API access to Anthropic Claude, Meta Llama, Mistral, Cohere, AI21, Amazon's own Titan and Nova families, and others through a single endpoint. Calls are billed against the AWS account, authenticated with IAM, and can be kept within the customer's VPC. Bedrock layers additional services on top: Knowledge Bases (managed RAG), Agents (managed tool-use orchestration), and Guardrails (content and PII filters).
GCP Vertex AI: Google Cloud's ML and AI platform. For agentic workloads the relevant surface is the Generative AI API, which provides access to Gemini, Anthropic Claude (via partnership), Llama, and others, with the same IAM, VPC, and billing posture as the rest of Google Cloud. Vertex also includes tooling for fine-tuning, model evaluation, vector search, and agent orchestration through Agent Builder.

The reason these exist as a category is procurement and compliance, not technical capability. A regulated enterprise that has already negotiated a data processing agreement, residency commitment, and security review with AWS or Google can consume the same vendors' models through Bedrock or Vertex without re-negotiating any of it with the underlying model providers. The trade-off is that some advanced features (prompt caching variants, beta endpoints) may be missing or delayed.

Managed platform vs. proxy. Bedrock and Vertex look superficially like proxies (one API, many models) but solve a different problem. A proxy like LiteLLM gives you vendor neutrality across providers you have already contracted with. A managed platform gives you a single contractual and security boundary, at the cost of being tied to that cloud and that cloud's model catalog. Most large customers end up running both: the managed platform for the regulated path, a proxy in front of it for routing, observability, and quota management.

Proxy {#proxy}

A proxy in agentic systems is a middleware layer that sits between an agent and one or more LLM providers, presenting a unified API while handling routing, retries, cost tracking, key management, and rate limiting. LiteLLM is the best-known open-source example; commercial offerings include OpenRouter, Portkey, and Helicone.

The reasons to run a proxy multiply quickly:

Vendor neutrality: one API, many providers. Switch providers without touching agent code.
Cost visibility: per-team, per-agent, per-tenant token accounting in one place.
Quota and rate-limit management: failover across providers when one is throttled.
Auditability: every call is logged for compliance and incident review.

For any team running more than a handful of agents in production, the question is not whether to run a proxy. It is which one.

Memory

Memory is what an agent carries across steps and across runs. The two dimensions are scope (within one run vs. across runs) and structure (raw transcript vs. summarized vs. typed records).

The common forms:

Working memory: the context window of the current run. Resets between runs.
Episodic memory: per-run records of what happened: which tools were called, what results came back, what the agent decided. Useful for resuming, debugging, and learning.
Semantic memory: distilled facts and preferences that survive across runs ("the user prefers Postgres," "this codebase uses pnpm"). Often stored as natural-language records the agent reads at the start of a run.
Procedural memory: codified procedures and playbooks. In practice this is what we call a runbook: a written sequence the agent follows when a situation matches.

Memory has the same failure modes as any database. Stale memory is wrong memory. Memory write-amplification fills the context window with noise. Memory without write boundaries leaks one user's context into another. Treat memory as state, not as magic.

Knowledge

Knowledge is information the agent can draw on that does not change between runs: documentation, runbooks, prior incident reports, schemas, design decisions, architecture overviews. Where memory is what the agent remembers about itself and its interactions, knowledge is what it knows about the world it operates in.

Knowledge is usually surfaced through retrieval (see Semantic Search and RAG below). The interesting question is not how to store knowledge but how to curate it: stale documentation is worse than no documentation, because the agent will confidently follow it.

The most useful knowledge artifacts in our experience are not handcrafted for AI consumption. They are commit messages, ADRs, post-mortems, and runbooks written for humans first. The agent benefits from the same investments that make humans more effective.

Semantic search

Semantic search retrieves information by meaning rather than by keyword. A query and a candidate document are each converted into vectors using an embedding model; the candidates closest to the query in vector space are returned.

The practical difference from keyword search is that "how do I rotate secrets" can match a document titled "Key rotation procedure" without sharing any of the same words. This is essential for agents because they tend to ask questions in different phrasings than the documents that answer them.

The infrastructure is straightforward: an embedding model, a vector store (pgvector, Pinecone, Weaviate, and similar), and an index. The hard part is curation. Indexing every wiki page produces noisy retrieval. Indexing the right subset, kept current, produces useful retrieval.

RAG

RAG (Retrieval-Augmented Generation) is the pattern of fetching relevant context with a retrieval step, including it in the model's prompt, and then generating an answer grounded in that context. It is the dominant solution to the problem that LLMs do not know about your data and cannot fit your whole knowledge base in their context window.

A typical RAG pipeline:

The user (or agent) issues a query.
The query is embedded and matched against an indexed knowledge base.
The top-N most relevant chunks are retrieved.
The chunks are concatenated into the prompt with the query.
The model generates an answer that cites or quotes the retrieved chunks.

RAG is not magic. It does as well as its retrieval does, and retrieval quality is dominated by what is in the index. The recurring failure modes are stale data, poor chunking, and the model ignoring retrieved context when its training data is more familiar.

Note: RAG vs. semantic search. The two terms describe different layers of the same stack. Semantic search is a retrieval technique: it answers "which documents are most relevant to this query?" using an embedding model and a vector store. RAG is an end-to-end generation pattern: retrieve, stuff into prompt, generate. RAG almost always uses semantic search as its retriever, but it does not have to (BM25, SQL, and graph traversal are all valid retrievers). You can have semantic search without RAG (a search bar). You can have RAG without semantic search (BM25 + LLM). The cleanest framing: semantic search uses representational AI (embeddings) to find relevant context; RAG adds a layer of generative AI (an LLM) on top to write the answer.

The Safety Surfaces

Sandboxing

Sandboxing is the practice of giving an agent a constrained execution environment so that the blast radius of its mistakes is bounded. The sandbox decides which filesystem the agent sees, what network it can reach, what tools it can call, and what side effects it can produce.

Sandboxes operate at several levels:

Process sandbox: restricting what a command can do on a single machine (containers, gVisor, seccomp).
Tool sandbox: restricting which tools an agent can call from a given context (read-only vs. read-write, scoped credentials, allowlists).
Environment sandbox: running the agent against an ephemeral copy of production data rather than production itself.
Permission sandbox: requiring human confirmation for actions outside a defined safe set.

The reference incident for not sandboxing is a production agent that deleted 1.9 million database rows because it had write access it had not earned. Sandboxing is the operational answer to "what is the worst that can happen?"

Tip: You no longer have to build this yourself. The most common sandboxing techniques (ephemeral containers, per-run filesystems, scoped network egress, fast cold-start) are now available as managed services. Modal and Vercel offer hosted agent sandboxes; Daytona is the leading open-source option for self-hosting. Treat these as the default starting point. The interesting engineering is in which tools, credentials, and data you expose inside the sandbox, not in reinventing the container substrate.

Human-in-the-loop

Human-in-the-loop (HITL) describes any workflow where a human must approve, intervene, or be notified before, during, or after an agent acts. HITL is the primary safety surface across Levels 2, 3, and 3.5; it becomes escalation-only at Level 4 and audit-only at Level 5.

HITL is not a single thing. The interesting question is what the human looks at:

At L2: the human looks at the diff.
At L3: the human looks at intent and proof (satisfaction reports, screenshots, evidence).
At L3.5: the human looks at exceptions and high-risk surfaces; routine PRs auto-merge.
At L4: the human looks at escalations and audit trails.

Moving up the ladder is, mechanically, the process of moving the human's attention from diffs to higher-level surfaces. The amount of human judgment required does not shrink. It concentrates at a higher level of abstraction.

The Operational Surfaces

LLM observability

LLM observability is visibility into the behavior of the underlying language model calls: prompts, completions, token counts, latencies, model versions, error rates. It is the foundation layer of agent operations, equivalent to APM for traditional services.

What LLM observability shows you:

Inputs and outputs: which prompts produced which completions.
Cost: how much each call cost, broken down by model, team, and tenant.
Latency: how long each call took, broken down by model and prompt size.
Failure modes: when the model failed, refused, or timed out.
Drift: how completions change when prompts or model versions change.

At Dash0, LLM observability is captured under OpenTelemetry GenAI semantic conventions: spans for model calls, attributes for token usage, events for tool calls and tool results. The data flows through the same pipeline as any other telemetry.

Agent observability

Agent observability is visibility one level up: the behavior of the agent itself, not just the LLM calls inside it. It tracks goals, plans, tool calls, intermediate decisions, retries, failures, escalations, and outcomes.

Where LLM observability answers "what did the model do?", agent observability answers "what did the agent do, and why?" The unit of analysis is the run: a goal entered, a plan formed, a series of steps executed, an outcome reached. Each run is a trace; each step is a span; each tool call is a child span.

Agent observability is where the operational discipline of harness engineering lives. Without it, level promotion is a vibe. With it, you can answer the questions auto-merge demands: pass rate, override rate, false-positive rate, blast-radius incidents, cost per goal, time per goal.

Important An agent is a distributed system that talks to itself. Without observability, it cannot be operated. Without operation, it cannot earn the trust required to climb the autonomy ladder.

Agentic Software Engineering Terminology