AI Agent Architecture: Memory, Tools, Orchestration, and Production

In Part 1 we covered what an AI agent is, how it differs from chatbots and copilots, and how to match autonomy level to the task. This post goes deeper — into the plumbing that actually determines whether an agent works in production.

Most “my agent broke” investigations don’t end at the model. They end in memory design, tool scope, orchestration logic, or missing observability. That’s what this post is about.

Orchestration Patterns: The Architectures That Drive Agent Behavior

Orchestration is not a single technique. It’s a family of patterns — each with distinct trade-offs in reliability, cost, latency, and debuggability. Choosing the right one is a foundational architectural decision.

ReAct (Reason + Act)

ReAct is the most widely deployed orchestration pattern. At each loop iteration, the model generates a thought (explicit reasoning about what to do next), selects an action (a tool call with parameters), observes the result, and updates its context before the next iteration.

┌──────────┐     ┌──────────┐     ┌─────────────┐
│  Thought │────▶│  Action  │────▶│ Observation │
│          │     │          │     │             │
│ Reason   │     │ Tool call│     │ Result +    │
│ about    │     │ with     │     │ context     │
│ next step│     │ params   │     │ update      │
└──────────┘     └──────────┘     └──────┬──────┘
      ▲                                  │
      └──────────────────────────────────┘
              repeat until terminal

The tight thought-action-observation cycle makes the model’s reasoning auditable — you can trace exactly why it chose each tool call. This is its primary advantage for debugging.

The trade-off: ReAct is expensive. Every iteration requires a full model call. Long tasks accumulate latency and token cost linearly. It also assumes the model can plan one step ahead effectively. When tasks require reasoning ten iterations out, single-step reasoning degrades and the agent begins making locally reasonable but globally suboptimal decisions.

Chain-of-Thought (CoT) Planning

CoT separates planning from execution. The model produces an explicit multi-step plan before taking any action. Orchestration then executes that plan sequentially, feeding results back as observations.

┌─────────────────────────────────┐
│         Planning phase          │
│                                 │
│  Model generates full plan      │
│  before any tool is called      │
│                                 │
│  Step 1: fetch customer record  │
│  Step 2: check order history    │
│  Step 3: calculate refund       │
│  Step 4: send confirmation      │
└────────────────┬────────────────┘
                 │ optional: human review here
                 ▼
┌─────────────────────────────────┐
│         Execution phase         │
│                                 │
│  Orchestration executes steps   │
│  sequentially, feeds results    │
│  back as observations           │
└─────────────────────────────────┘

The advantage: planning upfront reduces model calls during execution, lowers cost, and creates a natural checkpoint for human review before any action fires.

The limitation: the upfront plan is static. If early tool calls return unexpected results, a pure CoT agent has no mechanism to revise mid-execution. In dynamic environments — where APIs fail, data is missing, or results differ from expectations — rigid plans break down. The fix is hybrid orchestration: plan upfront, but re-enter a ReAct loop whenever observations deviate significantly from plan assumptions.

Hierarchical Planning

For complex, long-horizon tasks, flat orchestration breaks down. Hierarchical planning introduces two levels: a high-level planner that decomposes the goal into sub-goals, and sub-agents or lower-level orchestrators that execute each sub-goal independently.

┌─────────────────────────────────────────┐
│           High-level planner            │
│                                         │
│  Decomposes goal into sub-goals.        │
│  Operates over abstract objectives.     │
│  Does not track tool-level details.     │
└────────┬──────────────┬─────────────────┘
         │              │
         ▼              ▼
┌────────────┐    ┌────────────┐
│  Sub-agent │    │  Sub-agent │
│     A      │    │     B      │
│            │    │            │
│ Specialized│    │ Specialized│
│ tools and  │    │ tools and  │
│ context    │    │ context    │
└────────────┘    └────────────┘
         │              │
         └──────┬───────┘
                ▼
     Results composed back
     by the high-level planner

This separation keeps context windows lean at both levels — the planner doesn’t need tool-level details, and sub-agents don’t need full task context. The trade-off is coordination complexity. Debugging hierarchical systems requires distributed tracing that crosses agent boundaries. Failures can originate at any level and manifest at another.

Selecting an Orchestration Pattern

┌──────────────────────────────────────────────────────────────────┐
│ Decision guide                                                   │
├──────────────────────────────────────────────────────────────────┤
│ Task has unpredictable step sequence       → ReAct              │
│                                                                  │
│ Task structure is known and stable,        → CoT with           │
│ or you need a human approval checkpoint      plan validation    │
│ before execution begins                                          │
│                                                                  │
│ Task exceeds reliable planning horizon     → Hierarchical       │
│ of a single model, or sub-tasks need         planning           │
│ domain-specific context that shouldn't                          │
│ bleed across a monolithic window                                 │
└──────────────────────────────────────────────────────────────────┘

In practice, most production systems are hybrids. The outer loop uses hierarchical decomposition; individual sub-tasks use ReAct; high-stakes sub-tasks inject a CoT plan-then-confirm step before execution.

Memory Architecture: What the Agent Knows and for How Long

Poor memory design is the leading cause of agent unreliability in production. The failure modes are subtle — the agent appears to work, then silently loses track of context, repeats completed steps, or contradicts earlier decisions. By the time the bug is visible, it’s several iterations into a corrupted state.

The design question isn’t “Do I need memory?” It’s: what must survive across runs, what can be reconstructed cheaply, and what should never be stored?

Short-Term / Working Memory: The Context Window

Working memory is the context window — the content assembled for the current model call. It holds the system prompt, current goal, active tool schemas, prior tool results, reasoning steps, and recent observations. It is fast, always available, and finite.

The architectural risk is context degradation: as the window fills over a long task, early content — including the original goal — competes with noise. The model doesn’t forget in a hard, detectable way. It degrades softly: subtly worse decisions, goal drift, increased hallucination rates. You will not see an error. You’ll see subtly wrong outputs.

Token usage across a session
─────────────────────────────────────────────
Iter   Context tokens   % of window
─────────────────────────────────────────────
1      2,260            14%   ░░░░░░░░░░
2      2,440            16%   ░░░░░░░░░░
3      4,890            31%   ███░░░░░░░
4      7,120            45%   ████░░░░░░
5      9,440            59%   █████░░░░░  ← alert threshold
6      3,200            20%   ██░░░░░░░░  ← summarization fired
7      5,100            32%   ███░░░░░░░
─────────────────────────────────────────────
Alert at 60%. Critical at 80%.
By 90%, degradation is already occurring.

Three mitigations address this:

Periodic summarization compresses accumulated reasoning and observations into a dense summary that replaces the raw content. Summarize what happened and what was decided, but keep the most recent tool results verbatim — they are the most decision-relevant content.

Smart eviction tracks which context elements are still decision-relevant and removes those that aren’t. Tool results from five iterations ago rarely need to be verbatim if a summary captures their outcome. This requires tagging context elements at insertion time.

Chunked execution breaks long tasks into sub-tasks, each with a clean context window. State is persisted externally between chunks and retrieved at the start of each. This is hierarchical orchestration applied to memory management.

Monitor token usage explicitly in your orchestration layer. Alert at 60% of the window limit — not 90%.

Long-Term Memory: External Storage

Long-term memory lives outside the context window and is retrieved on demand. It subdivides into three types that serve distinct purposes.

Episodic memory stores logs of past agent runs — what was attempted, what succeeded, what failed, and why. This is the mechanism by which agents improve over time without retraining. When starting a new task, the agent retrieves relevant past episodes and uses them as few-shot context. Episodic memory is the foundation of self-improving agents and is chronically underimplemented.

Design decision: log goal, plan, key decision points, tool call summaries, and final outcome. Discard raw observation payloads after summarization. Full transcripts are expensive and noisy; final-outcome-only logging loses the reasoning that explains why outcomes occurred.

Semantic memory stores factual knowledge and business data — the ground truth the agent needs to answer questions accurately. In production, this is implemented via RAG: a vector store the agent queries via a tool, returning relevant chunks inserted into the context window on demand.

Naive top-k vector similarity retrieval works for simple factual lookups but degrades on complex queries. More robust approaches use hybrid retrieval (combining vector similarity with BM25 keyword search), query decomposition (breaking a complex query into sub-queries before retrieving), and re-ranking (using a second model to score retrieved chunks for relevance before insertion). Each adds latency; the question is whether retrieval quality justifies it for your use case.

Procedural / Coordination memory is shared state in multi-agent systems — task queues, sub-task status, intermediate results, and cross-agent signals. This is the nervous system of multi-agent coordination.

The Memory Design Principle

For every category of information the agent needs, answer:
─────────────────────────────────────────────────────────
  What is its read/write frequency?
  How long must it survive?
  Who else needs access to it?
  What is the cost of losing it mid-task?
─────────────────────────────────────────────────────────
Wrong answers produce:
  Too much in context  →  high cost, latency, degradation
  Too little persisted →  broken workflow continuity
  Wrong things stored  →  security risk

Tool Design: Where Agent Reliability Is Won or Lost

The model decides which tools to call. The orchestration layer executes them. But the tools themselves determine whether those calls succeed reliably. Poorly designed tools are the second most common production failure mode — behind memory mismanagement — and the most fixable.

The Three Tool Categories

Outbound API calls connect the agent to external systems: Slack, GitHub, internal microservices. Primary failure modes are parameter errors, transient network failures triggering retry logic that can produce duplicate actions, and authentication expiry mid-task.

Custom functions are code you own, control, and test. They are deterministic and predictable. Use them whenever predictability matters more than flexibility — tax calculations, date arithmetic, schema validation, data transformations. Custom functions are the highest-reliability tool category and should be preferred for any computation that doesn’t require external state.

Data retrieval (RAG) grounds the agent in current facts rather than training data. Every domain-specific fact the agent needs to get right should have a retrieval path. If there is no tool that retrieves it, the agent will hallucinate it.

Tool Design Principles

Single responsibility. A tool called get_customer_data that conditionally fetches orders, preferences, or account status depending on parameters is three tools disguised as one. Split it. Single-responsibility tools are easier to test, easier to mock, and dramatically easier to debug when they fail.

Explicit, typed interfaces. Every parameter should have a type, a description, and — where applicable — a constraint on valid values. Design tool interfaces as if a careful junior engineer with no context will be calling them at 3 a.m. That junior engineer is the model.

Structured, predictable output. Tools should return consistent schemas regardless of the internal code path taken. A tool that returns a dict in success cases and a string in failure cases forces the model to handle multiple output shapes — and it will handle them inconsistently. Always return a typed, structured result. Make failure states as information-dense as success states — the model needs failure details to decide what to do next.

Idempotency where possible. Agents retry. If your tool has side effects — writes, sends, charges — idempotency prevents those side effects from compounding on retry. Include an idempotency key in any tool that performs a write operation.

Hard usage limits. Tools that make external writes or have financial consequences should have rate limits enforced in the tool layer, not just in prompting. The model cannot be reliably instructed to self-limit. Enforce limits structurally.

Common Tool Anti-Patterns

Anti-pattern         What goes wrong
──────────────────────────────────────────────────────────────
God tool             One call does everything. No intermediate
                     checkpoint. Failure is unattributable.

Under-described      A tool named "search" with no description
schema               of what it searches, what format queries
                     take, or what the output looks like.
                     The model guesses. Sometimes correctly.

Swallowed errors     Returns {"status": "ok", "result": null}
                     on failure. Model interprets "ok" as
                     success, receives null, produces nonsense.

Missing retrieval    The agent hallucinates because no tool
coverage             covers a data source it needs. Audit
                     coverage before you ship.
──────────────────────────────────────────────────────────────

Multi-Agent Coordination: When One Agent Isn’t Enough

Single-agent architectures hit two ceilings as task complexity grows: context window capacity and specialization. Multi-agent architectures address both by decomposing tasks across specialized agents that each operate with lean, relevant context.

This is not free. Coordination introduces failure modes that don’t exist in single-agent systems.

The Orchestrator-Worker Pattern

┌─────────────────────────────────────────────────┐
│                  Orchestrator                   │
│                                                 │
│  Decomposes goal → delegates sub-tasks →        │
│  synthesizes results → determines next steps    │
└──────────┬──────────────────────┬───────────────┘
           │                      │
           ▼                      ▼
┌──────────────────┐   ┌──────────────────────────┐
│   Worker A       │   │   Worker B               │
│                  │   │                          │
│ Domain-specific  │   │ Domain-specific           │
│ tools + context  │   │ tools + context           │
│                  │   │                          │
│ Executes sub-task│   │ Executes sub-task        │
│ independently    │   │ independently            │
└──────────┬───────┘   └──────────────┬───────────┘
           │                          │
           └────────────┬─────────────┘
                        ▼
              Results composed back
              by the orchestrator

The design decisions that determine whether this works:

Sub-task interface design. The sub-task description must be self-contained — it cannot assume the worker has access to the orchestrator’s broader context. This is the most commonly underestimated challenge. Poorly scoped sub-task descriptions produce workers that misinterpret their assignment and return irrelevant results.

Result schema standardization. Worker agents must return results in a schema the orchestrator can reason about reliably. Define a standard result envelope: status, payload, confidence indicator if applicable, and a brief human-readable summary the orchestrator can use for planning.

Failure propagation. When a worker fails, the orchestrator needs to know whether the sub-task failed completely, partially, or produced a low-confidence result. A binary success/failure signal is insufficient for a planning agent that may have recovery options.

Shared State and Write Ownership

In multi-agent systems, coordination memory is a shared persistent store that all agents can read and write. The critical design decision is write ownership. In a well-designed multi-agent system, each piece of shared state has exactly one agent responsible for writing it. Concurrent writes from multiple agents without coordination produce race conditions that exhibit as intermittent, mysterious failures in production. Use optimistic locking or task-claim mechanisms to enforce write ownership at the state layer — not at the prompt layer.

When Multi-Agent Is the Wrong Choice

Multi-agent architecture is frequently over-applied. It is the right choice when the task genuinely exceeds the planning horizon of a single model, when different sub-tasks require domain-specific context that shouldn’t bleed across a monolithic window, or when parallelism is a hard latency requirement.

It is the wrong choice when coordination overhead exceeds task complexity, when a single agent with good context management would suffice, or when the team doesn’t yet have robust observability across agent boundaries. Many systems that claim to need multi-agent coordination are actually single agents with context management problems. Fix the context management first.

Observability: Tracing the Loop, Not Just the Edges

Standard application monitoring tracks requests and responses. Agent observability has to track the loop — every iteration, every tool call, every reasoning step, and the token budget at each point. A single user request can produce dozens of model calls and tool executions. You need to correlate all of them into a coherent session trace.

What to Emit at Every Loop Iteration

This is the minimum viable trace event. Emit one per iteration, structured, before the next iteration begins:

Iteration trace event
─────────────────────────────────────────────
session_id            Ties all iterations together
iteration_n           Which loop cycle this is
context_tokens        Tokens in context at start
thought               Model's reasoning output
tool_name             Tool selected (or "none")
tool_params           Parameters as constructed
tool_latency_ms       Wall time for tool execution
tool_result_size      Tokens in tool response
context_tokens_after  Tokens in context after update
error                 null if clean; typed if not
timestamp_ms          Unix ms at iteration start

Context tokens before and after is the most important pair. The delta tells you how fast the window is filling. If context_tokens_after is growing faster than context_tokens is shrinking from summarization, you will hit degradation before the task completes.

The Session Trace Structure

A complete session trace is a tree, not a flat log:

session: s_abc123
│
├── iteration: 1
│   ├── model_call: thought generation      (320ms, 1,840 tokens in)
│   ├── tool_call:  search_knowledge_base   (210ms, result: 420 tokens)
│   └── context snapshot: 2,260 tokens
│
├── iteration: 2
│   ├── model_call: thought generation      (290ms, 2,260 tokens in)
│   ├── tool_call:  call_crm_api            (850ms, result: 180 tokens)
│   └── context snapshot: 2,440 tokens
│
├── iteration: 3
│   ├── model_call: thought generation      (340ms, 2,440 tokens in)
│   ├── tool_call:  summarize_context       (-1,100 tokens evicted)
│   └── context snapshot: 1,520 tokens     ← summarization fired
│
└── iteration: 4
    ├── model_call: thought generation      (305ms, 1,520 tokens in)
    ├── tool_call:  send_email              (120ms, result: 40 tokens)
    └── terminal: goal_achieved

The context snapshot after iteration 3 shows summarization working correctly — the window dropped from 2,440 to 1,520 tokens. Without this tree structure, you can’t see that event or attribute subsequent behavior to it.

Latency Attribution

Total session latency breaks down into four buckets. You cannot optimize what you don’t attribute:

Session latency breakdown
─────────────────────────────────────────────────────
                                           % of total
Model inference latency     ████████████     48%
Tool execution latency      ████████████████ 38%
  └─ call_crm_api  ████████ 22%
  └─ search_kb     ████     10%
  └─ send_email    ██        6%
Context assembly latency    ████              9%
Orchestration overhead      █                5%
─────────────────────────────────────────────────────
Total session wall time: 4,820ms

In most production agents, tool latency dominates — not model inference. You find this only with per-tool timing. Without attribution, optimization effort lands on the wrong layer.

Error Classification

Mixing error types in a single metric makes debugging impossible. Classify every error at emission time:

Error taxonomy
──────────────────────────────────────────────────────────────
Class           Type              Recovery action
──────────────────────────────────────────────────────────────
Model errors
  malformed_tool_call   Return structured error to model,
                        allow one retry with correction hint
  goal_drift            Reinject original goal, flag for
                        human review if recurs
  reasoning_loop        Detect via repeated tool calls,
                        terminate and surface to operator

Tool errors
  transient_failure     Exponential backoff, max 3 retries
  permanent_failure     Surface to model as context update,
                        trigger replanning
  parameter_invalid     Return typed error with correction
                        schema, allow model to revise call
  timeout               Log latency, treat as transient,
                        apply retry policy

Orchestration errors
  context_overflow      Emergency summarization before
                        next model call
  termination_failure   Hard stop, checkpoint state, alert
  state_corruption      Halt immediately, do not recover,
                        escalate
──────────────────────────────────────────────────────────────

Alert Rules

Alert rules
──────────────────────────────────────────────────────────────
Signal                  Threshold       Action
──────────────────────────────────────────────────────────────
Context tokens          > 60% limit     Trigger summarization
Context tokens          > 80% limit     Force summarization,
                                        suspend iteration
Tool error rate         > 2/session     Log, notify on-call
                                        if task is high-stakes
Repeated tool call      Same tool ≥ 3   Detect retry spiral,
                        consecutive     terminate session
Session duration        > P99 baseline  Flag for review
Termination condition   Not reached     Hard stop at max
not triggered           by max_iter     iteration limit
State corruption        Any occurrence  Halt immediately
──────────────────────────────────────────────────────────────

The repeated tool call rule catches the most destructive failure mode: a retry spiral where the agent calls the same failing tool indefinitely. This burns tokens, time, and potentially triggers external side effects on every call.

Production Guardrails and Recovery

Guardrails: Constraining the Action Space

Guardrails are structural constraints on what the agent can do. They are enforced in the orchestration and tool layers — not in prompting — and cannot be overridden by model output.

Input guardrails validate the goal and context before the loop begins. Malformed goals, goals that reference unavailable tools, or goals that exceed defined scope should be rejected at intake — not discovered three iterations into an expensive loop.

Tool call validation checks every tool call the model generates before execution. Validate parameter types, check values against allowed ranges, and verify the requested tool is in scope. Reject malformed calls and return a structured error to the model. The model can often recover from a rejected call if the error message is informative.

Output guardrails screen the agent’s final output before it’s returned or acted on. For agents that produce content consumed by users, check for policy violations or hallucinated citations. For agents that take real-world actions, validate that the proposed action is within defined operational bounds before execution.

Rate and scope limits enforce hard ceilings on the real-world impact any single session can have: maximum API calls per tool per session, maximum financial transactions per hour, maximum records modified per run. These live at the infrastructure layer — not in the prompt.

Human-in-the-Loop Checkpoints

Not every action should be auto-executed. High-stakes, irreversible, or high-uncertainty actions should route through a human approval checkpoint before execution. The classification logic — which actions are always auto-approved, always human-approved, and conditionally approved — must be enumerated explicitly in a policy configuration that the orchestration layer enforces. Don’t rely on the model to classify this.

Failure Recovery

Structured error handling at the tool layer means every tool returns a consistent failure schema with a failure type (transient, permanent, recoverable), a retry recommendation, and a human-readable explanation. The orchestration layer uses this to route failures correctly.

Loop termination conditions must be explicitly defined. The loop ends when: the goal is achieved, a stopping condition fires (maximum iterations reached, token budget exhausted, error threshold exceeded), or a terminal failure occurs. Without explicit termination logic, loops run until they hit a hard timeout — wasting resources and potentially taking partial, inconsistent actions along the way.

State checkpointing persists the agent’s current state at defined intervals so that a crash mid-task can be resumed rather than restarted from scratch. The checkpoint includes: current goal, completed steps, tool results obtained, and the current context summary.

Audit trails record every action taken, every tool called, and every model decision made during a session. For agents that modify external state, the audit trail should include enough information to reverse each action — not just that an action was taken. Design the trail as if you’ll need to reconstruct and undo a session in production under time pressure. You will.

The Guiding Principles

Design memory like a thoughtful schema: persist what must survive, reconstruct what can be rebuilt cheaply, never store what shouldn’t be retained.

Design tools like clean API contracts: explicit, single-purpose, hard to misuse.

Design orchestration around failure, not the happy path: termination conditions, error classification, retry bounds, and checkpointing are not edge case concerns — they are the architecture.

Design coordination for correctness before performance: enforce write ownership structurally, enumerate dependencies explicitly, and introduce parallelism only when sequential execution has proven insufficient.

Instrument the loop, not just the edges: the inputs and final outputs of a session tell you almost nothing about why it failed. The per-iteration trace — token trajectory, tool selection, latency attribution, error classification — tells you everything.

The model gets the hype. Memory, tools, orchestration, observability, and operations determine whether your agent becomes a reliable system or a cautionary tale.

Orchestration Patterns: The Architectures That Drive Agent Behavior#

ReAct (Reason + Act)#

Chain-of-Thought (CoT) Planning#

Hierarchical Planning#

Selecting an Orchestration Pattern#

Memory Architecture: What the Agent Knows and for How Long#

Short-Term / Working Memory: The Context Window#

Long-Term Memory: External Storage#

The Memory Design Principle#

Tool Design: Where Agent Reliability Is Won or Lost#

The Three Tool Categories#

Tool Design Principles#

Common Tool Anti-Patterns#

Multi-Agent Coordination: When One Agent Isn’t Enough#

The Orchestrator-Worker Pattern#

Shared State and Write Ownership#

When Multi-Agent Is the Wrong Choice#

Observability: Tracing the Loop, Not Just the Edges#

What to Emit at Every Loop Iteration#

The Session Trace Structure#

Latency Attribution#

Error Classification#

Alert Rules#

Production Guardrails and Recovery#

Guardrails: Constraining the Action Space#

Human-in-the-Loop Checkpoints#

Failure Recovery#

The Guiding Principles#

Stay current on AI infrastructure and platform engineering