In Part 1 we covered what an AI agent is, how it differs from chatbots and copilots, and how to match autonomy level to the task. This post goes deeper — into the plumbing that actually determines whether an agent works in production.
Most “my agent broke” investigations don’t end at the model. They end in memory design, tool scope, orchestration logic, or missing observability. That’s what this post is about.
Orchestration Patterns: The Architectures That Drive Agent Behavior
Orchestration is not a single technique. It’s a family of patterns — each with distinct trade-offs in reliability, cost, latency, and debuggability. Choosing the right one is a foundational architectural decision.
ReAct (Reason + Act)
ReAct is the most widely deployed orchestration pattern. At each loop iteration, the model generates a thought (explicit reasoning about what to do next), selects an action (a tool call with parameters), observes the result, and updates its context before the next iteration.
┌──────────┐ ┌──────────┐ ┌─────────────┐
│ Thought │────▶│ Action │────▶│ Observation │
│ │ │ │ │ │
│ Reason │ │ Tool call│ │ Result + │
│ about │ │ with │ │ context │
│ next step│ │ params │ │ update │
└──────────┘ └──────────┘ └──────┬──────┘
▲ │
└──────────────────────────────────┘
repeat until terminal
The tight thought-action-observation cycle makes the model’s reasoning auditable — you can trace exactly why it chose each tool call. This is its primary advantage for debugging.
The trade-off: ReAct is expensive. Every iteration requires a full model call. Long tasks accumulate latency and token cost linearly. It also assumes the model can plan one step ahead effectively. When tasks require reasoning ten iterations out, single-step reasoning degrades and the agent begins making locally reasonable but globally suboptimal decisions.
Chain-of-Thought (CoT) Planning
CoT separates planning from execution. The model produces an explicit multi-step plan before taking any action. Orchestration then executes that plan sequentially, feeding results back as observations.
┌─────────────────────────────────┐
│ Planning phase │
│ │
│ Model generates full plan │
│ before any tool is called │
│ │
│ Step 1: fetch customer record │
│ Step 2: check order history │
│ Step 3: calculate refund │
│ Step 4: send confirmation │
└────────────────┬────────────────┘
│ optional: human review here
▼
┌─────────────────────────────────┐
│ Execution phase │
│ │
│ Orchestration executes steps │
│ sequentially, feeds results │
│ back as observations │
└─────────────────────────────────┘
The advantage: planning upfront reduces model calls during execution, lowers cost, and creates a natural checkpoint for human review before any action fires.
The limitation: the upfront plan is static. If early tool calls return unexpected results, a pure CoT agent has no mechanism to revise mid-execution. In dynamic environments — where APIs fail, data is missing, or results differ from expectations — rigid plans break down. The fix is hybrid orchestration: plan upfront, but re-enter a ReAct loop whenever observations deviate significantly from plan assumptions.
Hierarchical Planning
For complex, long-horizon tasks, flat orchestration breaks down. Hierarchical planning introduces two levels: a high-level planner that decomposes the goal into sub-goals, and sub-agents or lower-level orchestrators that execute each sub-goal independently.
┌─────────────────────────────────────────┐
│ High-level planner │
│ │
│ Decomposes goal into sub-goals. │
│ Operates over abstract objectives. │
│ Does not track tool-level details. │
└────────┬──────────────┬─────────────────┘
│ │
▼ ▼
┌────────────┐ ┌────────────┐
│ Sub-agent │ │ Sub-agent │
│ A │ │ B │
│ │ │ │
│ Specialized│ │ Specialized│
│ tools and │ │ tools and │
│ context │ │ context │
└────────────┘ └────────────┘
│ │
└──────┬───────┘
▼
Results composed back
by the high-level planner
This separation keeps context windows lean at both levels — the planner doesn’t need tool-level details, and sub-agents don’t need full task context. The trade-off is coordination complexity. Debugging hierarchical systems requires distributed tracing that crosses agent boundaries. Failures can originate at any level and manifest at another.
Selecting an Orchestration Pattern
┌──────────────────────────────────────────────────────────────────┐
│ Decision guide │
├──────────────────────────────────────────────────────────────────┤
│ Task has unpredictable step sequence → ReAct │
│ │
│ Task structure is known and stable, → CoT with │
│ or you need a human approval checkpoint plan validation │
│ before execution begins │
│ │
│ Task exceeds reliable planning horizon → Hierarchical │
│ of a single model, or sub-tasks need planning │
│ domain-specific context that shouldn't │
│ bleed across a monolithic window │
└──────────────────────────────────────────────────────────────────┘
In practice, most production systems are hybrids. The outer loop uses hierarchical decomposition; individual sub-tasks use ReAct; high-stakes sub-tasks inject a CoT plan-then-confirm step before execution.
Memory Architecture: What the Agent Knows and for How Long
Poor memory design is the leading cause of agent unreliability in production. The failure modes are subtle — the agent appears to work, then silently loses track of context, repeats completed steps, or contradicts earlier decisions. By the time the bug is visible, it’s several iterations into a corrupted state.
The design question isn’t “Do I need memory?” It’s: what must survive across runs, what can be reconstructed cheaply, and what should never be stored?
Short-Term / Working Memory: The Context Window
Working memory is the context window — the content assembled for the current model call. It holds the system prompt, current goal, active tool schemas, prior tool results, reasoning steps, and recent observations. It is fast, always available, and finite.
The architectural risk is context degradation: as the window fills over a long task, early content — including the original goal — competes with noise. The model doesn’t forget in a hard, detectable way. It degrades softly: subtly worse decisions, goal drift, increased hallucination rates. You will not see an error. You’ll see subtly wrong outputs.
Token usage across a session
─────────────────────────────────────────────
Iter Context tokens % of window
─────────────────────────────────────────────
1 2,260 14% ░░░░░░░░░░
2 2,440 16% ░░░░░░░░░░
3 4,890 31% ███░░░░░░░
4 7,120 45% ████░░░░░░
5 9,440 59% █████░░░░░ ← alert threshold
6 3,200 20% ██░░░░░░░░ ← summarization fired
7 5,100 32% ███░░░░░░░
─────────────────────────────────────────────
Alert at 60%. Critical at 80%.
By 90%, degradation is already occurring.
Three mitigations address this:
Periodic summarization compresses accumulated reasoning and observations into a dense summary that replaces the raw content. Summarize what happened and what was decided, but keep the most recent tool results verbatim — they are the most decision-relevant content.
Smart eviction tracks which context elements are still decision-relevant and removes those that aren’t. Tool results from five iterations ago rarely need to be verbatim if a summary captures their outcome. This requires tagging context elements at insertion time.
Chunked execution breaks long tasks into sub-tasks, each with a clean context window. State is persisted externally between chunks and retrieved at the start of each. This is hierarchical orchestration applied to memory management.
Monitor token usage explicitly in your orchestration layer. Alert at 60% of the window limit — not 90%.
Long-Term Memory: External Storage
Long-term memory lives outside the context window and is retrieved on demand. It subdivides into three types that serve distinct purposes.
Episodic memory stores logs of past agent runs — what was attempted, what succeeded, what failed, and why. This is the mechanism by which agents improve over time without retraining. When starting a new task, the agent retrieves relevant past episodes and uses them as few-shot context. Episodic memory is the foundation of self-improving agents and is chronically underimplemented.
Design decision: log goal, plan, key decision points, tool call summaries, and final outcome. Discard raw observation payloads after summarization. Full transcripts are expensive and noisy; final-outcome-only logging loses the reasoning that explains why outcomes occurred.
Semantic memory stores factual knowledge and business data — the ground truth the agent needs to answer questions accurately. In production, this is implemented via RAG: a vector store the agent queries via a tool, returning relevant chunks inserted into the context window on demand.
Naive top-k vector similarity retrieval works for simple factual lookups but degrades on complex queries. More robust approaches use hybrid retrieval (combining vector similarity with BM25 keyword search), query decomposition (breaking a complex query into sub-queries before retrieving), and re-ranking (using a second model to score retrieved chunks for relevance before insertion). Each adds latency; the question is whether retrieval quality justifies it for your use case.
Procedural / Coordination memory is shared state in multi-agent systems — task queues, sub-task status, intermediate results, and cross-agent signals. This is the nervous system of multi-agent coordination.
The Memory Design Principle
For every category of information the agent needs, answer:
─────────────────────────────────────────────────────────
What is its read/write frequency?
How long must it survive?
Who else needs access to it?
What is the cost of losing it mid-task?
─────────────────────────────────────────────────────────
Wrong answers produce:
Too much in context → high cost, latency, degradation
Too little persisted → broken workflow continuity
Wrong things stored → security risk
Tool Design: Where Agent Reliability Is Won or Lost
The model decides which tools to call. The orchestration layer executes them. But the tools themselves determine whether those calls succeed reliably. Poorly designed tools are the second most common production failure mode — behind memory mismanagement — and the most fixable.
The Three Tool Categories
Outbound API calls connect the agent to external systems: Slack, GitHub, internal microservices. Primary failure modes are parameter errors, transient network failures triggering retry logic that can produce duplicate actions, and authentication expiry mid-task.
Custom functions are code you own, control, and test. They are deterministic and predictable. Use them whenever predictability matters more than flexibility — tax calculations, date arithmetic, schema validation, data transformations. Custom functions are the highest-reliability tool category and should be preferred for any computation that doesn’t require external state.
Data retrieval (RAG) grounds the agent in current facts rather than training data. Every domain-specific fact the agent needs to get right should have a retrieval path. If there is no tool that retrieves it, the agent will hallucinate it.
Tool Design Principles
Single responsibility. A tool called get_customer_data that conditionally fetches orders, preferences, or account status depending on parameters is three tools disguised as one. Split it. Single-responsibility tools are easier to test, easier to mock, and dramatically easier to debug when they fail.
Explicit, typed interfaces. Every parameter should have a type, a description, and — where applicable — a constraint on valid values. Design tool interfaces as if a careful junior engineer with no context will be calling them at 3 a.m. That junior engineer is the model.
Structured, predictable output. Tools should return consistent schemas regardless of the internal code path taken. A tool that returns a dict in success cases and a string in failure cases forces the model to handle multiple output shapes — and it will handle them inconsistently. Always return a typed, structured result. Make failure states as information-dense as success states — the model needs failure details to decide what to do next.
Idempotency where possible. Agents retry. If your tool has side effects — writes, sends, charges — idempotency prevents those side effects from compounding on retry. Include an idempotency key in any tool that performs a write operation.
Hard usage limits. Tools that make external writes or have financial consequences should have rate limits enforced in the tool layer, not just in prompting. The model cannot be reliably instructed to self-limit. Enforce limits structurally.
Common Tool Anti-Patterns
Anti-pattern What goes wrong
──────────────────────────────────────────────────────────────
God tool One call does everything. No intermediate
checkpoint. Failure is unattributable.
Under-described A tool named "search" with no description
schema of what it searches, what format queries
take, or what the output looks like.
The model guesses. Sometimes correctly.
Swallowed errors Returns {"status": "ok", "result": null}
on failure. Model interprets "ok" as
success, receives null, produces nonsense.
Missing retrieval The agent hallucinates because no tool
coverage covers a data source it needs. Audit
coverage before you ship.
──────────────────────────────────────────────────────────────
Multi-Agent Coordination: When One Agent Isn’t Enough
Single-agent architectures hit two ceilings as task complexity grows: context window capacity and specialization. Multi-agent architectures address both by decomposing tasks across specialized agents that each operate with lean, relevant context.
This is not free. Coordination introduces failure modes that don’t exist in single-agent systems.
The Orchestrator-Worker Pattern
┌─────────────────────────────────────────────────┐
│ Orchestrator │
│ │
│ Decomposes goal → delegates sub-tasks → │
│ synthesizes results → determines next steps │
└──────────┬──────────────────────┬───────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────────┐
│ Worker A │ │ Worker B │
│ │ │ │
│ Domain-specific │ │ Domain-specific │
│ tools + context │ │ tools + context │
│ │ │ │
│ Executes sub-task│ │ Executes sub-task │
│ independently │ │ independently │
└──────────┬───────┘ └──────────────┬───────────┘
│ │
└────────────┬─────────────┘
▼
Results composed back
by the orchestrator
The design decisions that determine whether this works:
Sub-task interface design. The sub-task description must be self-contained — it cannot assume the worker has access to the orchestrator’s broader context. This is the most commonly underestimated challenge. Poorly scoped sub-task descriptions produce workers that misinterpret their assignment and return irrelevant results.
Result schema standardization. Worker agents must return results in a schema the orchestrator can reason about reliably. Define a standard result envelope: status, payload, confidence indicator if applicable, and a brief human-readable summary the orchestrator can use for planning.
Failure propagation. When a worker fails, the orchestrator needs to know whether the sub-task failed completely, partially, or produced a low-confidence result. A binary success/failure signal is insufficient for a planning agent that may have recovery options.
Shared State and Write Ownership
In multi-agent systems, coordination memory is a shared persistent store that all agents can read and write. The critical design decision is write ownership. In a well-designed multi-agent system, each piece of shared state has exactly one agent responsible for writing it. Concurrent writes from multiple agents without coordination produce race conditions that exhibit as intermittent, mysterious failures in production. Use optimistic locking or task-claim mechanisms to enforce write ownership at the state layer — not at the prompt layer.
When Multi-Agent Is the Wrong Choice
Multi-agent architecture is frequently over-applied. It is the right choice when the task genuinely exceeds the planning horizon of a single model, when different sub-tasks require domain-specific context that shouldn’t bleed across a monolithic window, or when parallelism is a hard latency requirement.
It is the wrong choice when coordination overhead exceeds task complexity, when a single agent with good context management would suffice, or when the team doesn’t yet have robust observability across agent boundaries. Many systems that claim to need multi-agent coordination are actually single agents with context management problems. Fix the context management first.
Observability: Tracing the Loop, Not Just the Edges
Standard application monitoring tracks requests and responses. Agent observability has to track the loop — every iteration, every tool call, every reasoning step, and the token budget at each point. A single user request can produce dozens of model calls and tool executions. You need to correlate all of them into a coherent session trace.
What to Emit at Every Loop Iteration
This is the minimum viable trace event. Emit one per iteration, structured, before the next iteration begins:
Iteration trace event
─────────────────────────────────────────────
session_id Ties all iterations together
iteration_n Which loop cycle this is
context_tokens Tokens in context at start
thought Model's reasoning output
tool_name Tool selected (or "none")
tool_params Parameters as constructed
tool_latency_ms Wall time for tool execution
tool_result_size Tokens in tool response
context_tokens_after Tokens in context after update
error null if clean; typed if not
timestamp_ms Unix ms at iteration start
Context tokens before and after is the most important pair. The delta tells you how fast the window is filling. If context_tokens_after is growing faster than context_tokens is shrinking from summarization, you will hit degradation before the task completes.
The Session Trace Structure
A complete session trace is a tree, not a flat log:
session: s_abc123
│
├── iteration: 1
│ ├── model_call: thought generation (320ms, 1,840 tokens in)
│ ├── tool_call: search_knowledge_base (210ms, result: 420 tokens)
│ └── context snapshot: 2,260 tokens
│
├── iteration: 2
│ ├── model_call: thought generation (290ms, 2,260 tokens in)
│ ├── tool_call: call_crm_api (850ms, result: 180 tokens)
│ └── context snapshot: 2,440 tokens
│
├── iteration: 3
│ ├── model_call: thought generation (340ms, 2,440 tokens in)
│ ├── tool_call: summarize_context (-1,100 tokens evicted)
│ └── context snapshot: 1,520 tokens ← summarization fired
│
└── iteration: 4
├── model_call: thought generation (305ms, 1,520 tokens in)
├── tool_call: send_email (120ms, result: 40 tokens)
└── terminal: goal_achieved
The context snapshot after iteration 3 shows summarization working correctly — the window dropped from 2,440 to 1,520 tokens. Without this tree structure, you can’t see that event or attribute subsequent behavior to it.
Latency Attribution
Total session latency breaks down into four buckets. You cannot optimize what you don’t attribute:
Session latency breakdown
─────────────────────────────────────────────────────
% of total
Model inference latency ████████████ 48%
Tool execution latency ████████████████ 38%
└─ call_crm_api ████████ 22%
└─ search_kb ████ 10%
└─ send_email ██ 6%
Context assembly latency ████ 9%
Orchestration overhead █ 5%
─────────────────────────────────────────────────────
Total session wall time: 4,820ms
In most production agents, tool latency dominates — not model inference. You find this only with per-tool timing. Without attribution, optimization effort lands on the wrong layer.
Error Classification
Mixing error types in a single metric makes debugging impossible. Classify every error at emission time:
Error taxonomy
──────────────────────────────────────────────────────────────
Class Type Recovery action
──────────────────────────────────────────────────────────────
Model errors
malformed_tool_call Return structured error to model,
allow one retry with correction hint
goal_drift Reinject original goal, flag for
human review if recurs
reasoning_loop Detect via repeated tool calls,
terminate and surface to operator
Tool errors
transient_failure Exponential backoff, max 3 retries
permanent_failure Surface to model as context update,
trigger replanning
parameter_invalid Return typed error with correction
schema, allow model to revise call
timeout Log latency, treat as transient,
apply retry policy
Orchestration errors
context_overflow Emergency summarization before
next model call
termination_failure Hard stop, checkpoint state, alert
state_corruption Halt immediately, do not recover,
escalate
──────────────────────────────────────────────────────────────
Alert Rules
Alert rules
──────────────────────────────────────────────────────────────
Signal Threshold Action
──────────────────────────────────────────────────────────────
Context tokens > 60% limit Trigger summarization
Context tokens > 80% limit Force summarization,
suspend iteration
Tool error rate > 2/session Log, notify on-call
if task is high-stakes
Repeated tool call Same tool ≥ 3 Detect retry spiral,
consecutive terminate session
Session duration > P99 baseline Flag for review
Termination condition Not reached Hard stop at max
not triggered by max_iter iteration limit
State corruption Any occurrence Halt immediately
──────────────────────────────────────────────────────────────
The repeated tool call rule catches the most destructive failure mode: a retry spiral where the agent calls the same failing tool indefinitely. This burns tokens, time, and potentially triggers external side effects on every call.
Production Guardrails and Recovery
Guardrails: Constraining the Action Space
Guardrails are structural constraints on what the agent can do. They are enforced in the orchestration and tool layers — not in prompting — and cannot be overridden by model output.
Input guardrails validate the goal and context before the loop begins. Malformed goals, goals that reference unavailable tools, or goals that exceed defined scope should be rejected at intake — not discovered three iterations into an expensive loop.
Tool call validation checks every tool call the model generates before execution. Validate parameter types, check values against allowed ranges, and verify the requested tool is in scope. Reject malformed calls and return a structured error to the model. The model can often recover from a rejected call if the error message is informative.
Output guardrails screen the agent’s final output before it’s returned or acted on. For agents that produce content consumed by users, check for policy violations or hallucinated citations. For agents that take real-world actions, validate that the proposed action is within defined operational bounds before execution.
Rate and scope limits enforce hard ceilings on the real-world impact any single session can have: maximum API calls per tool per session, maximum financial transactions per hour, maximum records modified per run. These live at the infrastructure layer — not in the prompt.
Human-in-the-Loop Checkpoints
Not every action should be auto-executed. High-stakes, irreversible, or high-uncertainty actions should route through a human approval checkpoint before execution. The classification logic — which actions are always auto-approved, always human-approved, and conditionally approved — must be enumerated explicitly in a policy configuration that the orchestration layer enforces. Don’t rely on the model to classify this.
Failure Recovery
Structured error handling at the tool layer means every tool returns a consistent failure schema with a failure type (transient, permanent, recoverable), a retry recommendation, and a human-readable explanation. The orchestration layer uses this to route failures correctly.
Loop termination conditions must be explicitly defined. The loop ends when: the goal is achieved, a stopping condition fires (maximum iterations reached, token budget exhausted, error threshold exceeded), or a terminal failure occurs. Without explicit termination logic, loops run until they hit a hard timeout — wasting resources and potentially taking partial, inconsistent actions along the way.
State checkpointing persists the agent’s current state at defined intervals so that a crash mid-task can be resumed rather than restarted from scratch. The checkpoint includes: current goal, completed steps, tool results obtained, and the current context summary.
Audit trails record every action taken, every tool called, and every model decision made during a session. For agents that modify external state, the audit trail should include enough information to reverse each action — not just that an action was taken. Design the trail as if you’ll need to reconstruct and undo a session in production under time pressure. You will.
The Guiding Principles
Design memory like a thoughtful schema: persist what must survive, reconstruct what can be rebuilt cheaply, never store what shouldn’t be retained.
Design tools like clean API contracts: explicit, single-purpose, hard to misuse.
Design orchestration around failure, not the happy path: termination conditions, error classification, retry bounds, and checkpointing are not edge case concerns — they are the architecture.
Design coordination for correctness before performance: enforce write ownership structurally, enumerate dependencies explicitly, and introduce parallelism only when sequential execution has proven insufficient.
Instrument the loop, not just the edges: the inputs and final outputs of a session tell you almost nothing about why it failed. The per-iteration trace — token trajectory, tool selection, latency attribution, error classification — tells you everything.
The model gets the hype. Memory, tools, orchestration, observability, and operations determine whether your agent becomes a reliable system or a cautionary tale.