Observability for AI Agents: Logs, Traces, and Metrics That Actually Tell You Something

An agent can pass every health check, return HTTP 200 on every request, and stay within p95 latency targets — while quietly producing worse and worse output for weeks. By the time the degradation shows up in aggregate business metrics, users have already formed an opinion.

Traditional monitoring answers the wrong question. “Is it running?” is not the question that matters for agents. The question is “is it reasoning correctly?” — and that requires a fundamentally different observability stack.

This post covers what that stack looks like: the three pillars (logs, traces, metrics), what to capture at each layer, the specific failure modes each one surfaces, and the order in which to build it.

Why Infrastructure Monitoring Misses Agent Failures

A standard APM setup — uptime checks, error rates, latency histograms, resource utilization — is built for a model where correct behavior means reaching a defined endpoint without throwing an exception. For deterministic services, that’s sufficient. A function that returns the right output without errors is a function that’s working.

Agents break this model in three specific ways:

Silent quality degradation. The agent completes every task without errors. Latency is fine. Token counts are normal. But the output quality has been drifting downward for two weeks because a prompt change shifted how the model interprets ambiguous inputs. No infrastructure metric captures this.

Correct output via wrong path. The agent answers the question correctly but made seven tool calls when three were appropriate, retrieved three irrelevant documents before finding the right one, and took 4.2 seconds when 0.8 was achievable. The final answer looks fine; the execution was expensive and fragile. Infrastructure monitoring shows nothing anomalous.

Reasoning loops and dead ends. The agent gets stuck in a loop — checking the same tool repeatedly, generating intermediate reasoning that contradicts earlier steps, or failing to recognize that it has enough information to answer. Eventually it times out or produces a degraded response. The timeout shows up as latency; everything upstream is invisible.

The common thread: agent failure modes are failures of reasoning quality, not execution health. You need instrumentation that captures the decision-making process, not just its infrastructure envelope.

Pillar 1: Structured Logs

Logs are the atomic unit of agent observability. For traditional services, a request log captures the input, output, status code, and duration — enough to reconstruct what happened. For agents, you need the full decision record at every step.

What to log

Every agent execution generates a sequence of events. Each should be a structured log entry with a consistent schema:

{
  "trace_id": "agt-7f3a9b2c",
  "span_id": "span-0041",
  "timestamp": "2026-04-17T14:32:01.847Z",
  "event_type": "tool_call",
  "agent_id": "research-agent-v2",
  "session_id": "sess-8821",
  "tool_name": "web_search",
  "tool_input": { "query": "Q3 earnings Apple 2025" },
  "tool_output_summary": "Retrieved 5 results, top result: ...",
  "latency_ms": 312,
  "token_cost": null,
  "status": "success",
  "reasoning_snapshot": "Need recent earnings data to answer the user's comparison question."
}

The reasoning_snapshot field is the one most teams omit. It captures why the agent made this call at this moment — the intermediate reasoning that led to the decision. Without it, you have a log of what the agent did but no record of why. Debugging becomes guesswork.

Key event types to capture:

task_start — initial goal, session context, model version, tool set
reasoning_step — intermediate chain-of-thought, current plan state
tool_call — tool name, full input, output summary, latency, status
tool_error — failure type, error message, retry attempt number
context_update — what was added to or trimmed from the context window
task_complete — final output, total latency, total tool calls, total token cost, completion status
task_abort — reason, last successful step, partial output if any

Log levels and sampling

In early production, default to logging everything. Storage is cheap; debugging with insufficient logs is expensive.

At scale, log sampling becomes necessary. A reasonable strategy:

Always log: task_start, task_complete, task_abort, tool_error
Sample at 10–20%: reasoning_step, tool_call for successful high-volume tasks
Always log on quality alert: If a quality monitoring signal fires on a session, retroactively promote the full trace to permanent storage

Structure your logs so this promotion is possible — store the full execution trace in a short-lived buffer (Redis with 24-hour TTL works well) and only persist to long-term storage for flagged sessions or sampling hits.

Pillar 2: Distributed Traces

Logs give you the events. Traces give you the story — a causally connected sequence of events that follows a single task from the initial query to the final response, across every service boundary it crosses.

For agents, traces are what let you answer: “Exactly what happened during this specific run, in what order, and what caused the final output?”

Trace structure

A trace is a tree of spans. Each span represents a named operation with a start time, duration, status, and attributes. Spans nest: the root span is the full task; child spans are individual steps; tool calls that reach external services are leaf spans.

[root] task: "summarize Q3 earnings for AAPL and MSFT"   4.2s
  ├── [span] reasoning: plan decomposition                 0.1s
  ├── [span] tool_call: web_search("AAPL Q3 2025")        0.3s
  │     └── [span] http: api.search.example.com            0.28s
  ├── [span] tool_call: web_search("MSFT Q3 2025")        0.4s
  │     └── [span] http: api.search.example.com            0.38s
  ├── [span] tool_call: retrieve_doc("AAPL-10Q-Q3-2025")  1.1s
  │     └── [span] vector_search: doc-store                0.09s
  ├── [span] reasoning: synthesis                          0.2s
  └── [span] llm_call: generate final response             2.1s

This trace immediately reveals that the LLM generation step is consuming half the total latency. It also shows whether tool calls are running sequentially when they could run in parallel — a common optimization opportunity invisible without tracing.

Context propagation

Context propagation is what makes distributed tracing work. A unique trace_id must flow through every component the task touches: the agent runtime, every tool call, every external API, every sub-agent in a multi-agent workflow.

OpenTelemetry is the standard here. Its W3C traceparent header propagates trace context across HTTP boundaries automatically when your tools and services are instrumented. For internal function calls and LLM API calls, propagate the trace context manually through your SDK calls.

The consequence of not doing this: you can trace within the agent but not across tool boundaries, which is exactly where the interesting failures happen.

Multi-agent trace propagation

In multi-agent systems, each agent spawns its own trace — but those traces need to be linked. When an orchestrator delegates a subtask to a worker agent, pass the parent trace ID and span ID as task metadata. The worker agent creates a new root span with parent_span_id set to the orchestrator’s span. This links the full execution graph without collapsing it into one monolithic trace.

The result: you can view the orchestrator’s trace and drill into any delegated subtask, or query by trace ID to see the full multi-agent execution path for a given user request.

Pillar 3: Metrics

Metrics aggregate individual events into signals you can monitor, alert on, and trend over time. For agents, the metrics that matter form a four-level hierarchy.

Level 1: Business metrics

These answer “is the agent accomplishing what it’s supposed to accomplish?” They’re defined against your specific use case, not derived from infrastructure.

Metric	Definition
Goal completion rate	% of tasks where the agent fully completed the stated objective
User acceptance rate	% of outputs accepted without correction or retry
Task completion time	Wall-clock time from task start to accepted completion
Cost per completed task	Total token spend + external API cost per successful task

Goal completion rate is the single most important metric and the hardest to instrument. It requires a definition of “complete” that your system can evaluate automatically — either a structured output schema the agent fills, an LLM-as-judge scorer, or an explicit user acceptance signal (thumbs up, edit rate, retry rate).

Level 2: Quality metrics

These answer “how well is the agent reasoning?”

Metric	Definition
Tool precision	% of tool calls that were necessary for task completion
Tool recall	% of tasks where the agent called all necessary tools
First-attempt success rate	% of tasks completed without retry or fallback
Reasoning coherence score	LLM-judge score on intermediate reasoning quality (sampled)
Hallucination rate	% of outputs containing factually incorrect claims (sampled, human or judge)

Tool precision and recall are the leading indicators for trajectory failure — they move before final output quality degrades, giving you an early warning signal.

Level 3: Operational metrics

These answer “is the agent behaving within expected operational parameters?”

Metric	Definition
Average tool calls per task	Total tool calls / completed tasks. Baseline this at launch; deviations signal trajectory drift.
Context window utilization	Average % of context window used at task completion. High utilization predicts truncation failures.
Loop iteration count	Average ReAct iterations per task. Spikes indicate reasoning loops or stuck states.
Tool error rate by type	Errors broken down by tool and error class. Identifies specific tool reliability issues.
Retry rate	% of tool calls that required retry. Elevated rates indicate external API instability.

Level 4: Infrastructure metrics

Standard service metrics — latency p50/p95/p99, token throughput, model API error rate, queue depth. These are necessary but not sufficient on their own.

One important addition specific to LLMs: decomposed latency. Total response time breaks into:

Time to first token (TTFT): Model inference start latency. Dominated by prompt processing time.
Inter-token latency (ITL): Throughput during generation. Affected by model size and serving infrastructure.
Tool execution time: Latency attributable to external calls. Separate this from model latency or you can’t diagnose the actual bottleneck.

Most production agents have a counterintuitive latency profile: the model itself is fast; the bottleneck is tool calls. Without decomposed latency, you’d never know where to optimize.

Alerting: What to Watch and When

Metrics without alerts are a dashboard you check after something breaks. The alerts that matter:

Goal completion rate drop > 5% vs. 7-day rolling average — Immediate investigation. This is the primary quality signal.

Tool call efficiency ratio > 1.5x baseline — The agent is making more calls than usual to complete the same tasks. Leading indicator for trajectory drift or context window problems.

Context window utilization > 80% — Truncation is imminent. Review the tasks hitting this threshold; they’re the ones most likely to fail unpredictably.

Loop iteration p95 > 2x baseline — Reasoning loops are forming. Something in the execution path is causing the agent to re-examine decisions it already made.

LLM-judge quality score 7-day trend < -0.1 — Gradual quality drift. Not a spike, but a persistent downward movement that will compound if not addressed.

Cost per task > 150% of budget — Either a task complexity spike or a trajectory failure causing excessive tool calls.

Set these as relative thresholds against rolling baselines, not absolute values. Absolute thresholds go stale as task mix evolves; relative thresholds adapt automatically.

The Reasoning Visibility Problem

The hardest part of agent observability is capturing why the agent made a decision, not just what it decided.

LLMs don’t expose their internal reasoning as structured data — they generate natural language. Capturing useful reasoning visibility requires deliberate prompt design: instruct the model to externalize its reasoning at key decision points in a structured format before taking action.

A ReAct-style prompt already does this partially — each thought step is a reasoning trace. Make it more useful by asking for explicit structure:

Before calling any tool, output:
REASONING: [why this tool, why now, what you expect to learn]
TOOL: [tool name]
INPUT: [parameters]

This gives you log-parseable reasoning snapshots at every tool call — not a perfect window into model internals, but enough to answer “why did it do that?” for the majority of cases.

For tasks where reasoning quality is particularly important (high-stakes decisions, multi-step workflows), run a post-execution reflection prompt: ask the model to review its own trace and identify any steps where a different decision would have been better. Store the reflection as a semantic memory entry and log it against the trace ID.

The Tooling Landscape

You don’t need to build this from scratch. The observability tooling for LLM applications has matured significantly:

Tracing and logging: OpenTelemetry is the foundation — instrument your agent with the OTel SDK, export to your existing backend (Jaeger, Tempo, Datadog, Honeycomb). LangSmith and Langfuse provide purpose-built tracing UIs with LLM-specific span attributes, token cost tracking, and built-in LLM-as-judge evaluation. Helicone operates as a proxy layer with minimal integration cost.

Metrics and dashboards: Standard Prometheus + Grafana works if you define agent-specific metrics properly. Both LangSmith and Langfuse expose metric APIs you can feed into existing dashboards.

LLM-as-judge pipelines: RAGAS provides evaluation metrics for retrieval quality. For general quality scoring, a judge prompt running against sampled outputs via the Anthropic or OpenAI API with structured output schemas integrates cleanly into any pipeline.

The choice between purpose-built LLM observability tools and extending your existing stack depends on your team’s infrastructure preference. Purpose-built tools are faster to get value from; extending your existing stack keeps your observability unified. For most teams starting out, purpose-built wins on time-to-insight.

What to Build First

The order matters. This is the sequence that gets you to production-ready observability fastest:

Day 1: Structured logging on every tool call. Log the full request-response cycle with trace ID, tool name, input, output summary, latency, and status. This is the minimum needed to debug anything.

Day 2–3: End-to-end tracing. Wire up OpenTelemetry or a purpose-built tool. Every task gets a trace ID that propagates through the full execution path. You can now reconstruct any execution after the fact.

Week 1: Business metrics defined and instrumented. What does task completion look like for your specific use case? Define it, build the instrumentation, baseline it before launch. You need a starting line.

Week 2: Quality metrics and alerting. Tool precision/recall, loop iteration count, context window utilization. Alerts on the thresholds above. This is what catches silent degradation before it compounds.

Ongoing: Reasoning snapshots. Prompt the model to externalize its reasoning at decision points. These make debugging dramatically faster and are cheap to add once the logging infrastructure is in place.

The feedback loop this enables: production monitoring surfaces an anomaly → trace pinpoints the execution step → log captures the reasoning at that step → you understand exactly what went wrong → you fix it and add a regression test. Without all three layers, any step in that chain becomes guesswork.

The Observability Gap Is an Operational Risk

Teams that skip agent-specific observability aren’t just flying blind on quality — they’re accumulating operational debt that compounds with every deployment. Each change to a prompt, tool definition, or model version is a potential quality regression with no detection mechanism.

The investment to instrument properly is a few days of engineering work. The cost of not doing it is measured in production incidents you can’t diagnose, quality degradation you don’t notice until users leave, and deployments you’re afraid to make because you don’t know what they’ll break.

Observe first. Build confidence second. Deploy without fear third.

Part of the AI Agents in Production series.

Why Infrastructure Monitoring Misses Agent Failures#

Pillar 1: Structured Logs#

What to log#

Log levels and sampling#

Pillar 2: Distributed Traces#

Trace structure#

Context propagation#

Multi-agent trace propagation#

Pillar 3: Metrics#

Level 1: Business metrics#

Level 2: Quality metrics#

Level 3: Operational metrics#

Level 4: Infrastructure metrics#

Alerting: What to Watch and When#

The Reasoning Visibility Problem#

The Tooling Landscape#

What to Build First#

The Observability Gap Is an Operational Risk#

Related Posts

Stay current on AI infrastructure and platform engineering