Chapter 44Evaluating and Observing Agents
Chapter 25 taught you to evaluate models; agents are harder still. A model produces one output you can judge, but an agent takes many steps — reasoning, calling tools, observing results — and any of them can go wrong. To know whether an agent works, you must be able to *see* what it did and *judge* the whole path it took, not just its final answer. This chapter covers observability (seeing the agent's steps) and evaluation (judging its behavior), the twin disciplines that turn a black box into something you can trust and improve. For agents that take real actions, this is not optional.
Why Agents Are Hard to Evaluate
Evaluating a model was already hard (Chapter 25) because language tasks rarely have one right answer. Agents pile on more difficulty. An agent solves a problem over many steps, and a failure at any step can derail the whole task. Success depends on the entire trajectory — the sequence of thoughts, tool calls, and observations — not just the final output. Agents are non-deterministic, so the same task can play out differently each run. And they involve tools, each of which can fail in its own way. Judging "was the final answer right?" is not enough; you need to understand how the agent got there.
Observability: Seeing What the Agent Did
The foundation of everything in this chapter is a simple truth: you cannot fix what you cannot see. Observability means recording what the agent does at each step so you can inspect its full trajectory afterward. The core technique is tracing — logging every loop iteration: the model's reasoning, which tool it called, with what arguments, and what result came back. A complete trace is the single most valuable debugging tool you have for agents, turning an opaque process into a readable story of what happened.
Step-Level Tracing
Adding tracing to an agent is straightforward — you record each step as the loop runs. Here is the Chapter 31 loop with tracing added.
def run_agent_traced(goal, tools, max_steps=10):
history = [{"role": "user", "content": goal}]
trace = [] # record every step
for step in range(max_steps):
decision = model_decide(history, tools)
if decision.is_final_answer:
trace.append({"step": step, "action": "final", "text": decision.text})
return decision.text, trace
result = run_tool(decision.tool, decision.args)
trace.append({"step": step, "tool": decision.tool,
"args": decision.args, "result": result}) # log it
history.append({"role": "tool", "content": result})
return "Stopped: step limit.", traceNow, after any run, you can read the trace step by step and see exactly what the agent did — which tools it called, what it passed, what it got back, and where things went wrong. When an agent misbehaves, the trace usually makes the cause obvious.
Evaluating Trajectories, Not Just Outputs
Because an agent's quality lives in its whole path, you must evaluate the trajectory, not only the final answer. Ask: did it take sensible steps in a sensible order? Did it choose the right tools and call them correctly? Did it avoid getting stuck in loops? Did it actually use its observations, or ignore them? Did it reach the goal efficiently, or wander? An agent that stumbles to a correct answer through luck and waste is worse than it looks; an agent that reasons cleanly but hits a tool failure may be better than its output suggests. Judge the journey, not just the destination.
Metrics for Agents
A few measures capture agent quality, complementing the model metrics of Chapter 25.
- Success rate — across a set of tasks, how often did the agent actually achieve the goal? The headline number.
- Efficiency — how many steps, how much cost, and how much time did it take? An agent that succeeds in three steps beats one that takes thirty.
- Tool-use accuracy — did it pick the right tools and call them with correct arguments?
- Failure analysis — when it failed, why? Categorizing failures (wrong tool, bad reasoning, loop, tool error) is how you know what to fix.
Building an Agent Eval Set
The most valuable habit from Chapter 25 applies directly: build your own evaluation set of representative tasks with clear success criteria, run the agent against them, and measure. For agents, each eval case is a goal plus a way to check whether the agent achieved it.
eval_tasks = [
{"goal": "Find the current population of Tokyo.",
"check": lambda answer: "million" in answer.lower()},
{"goal": "Calculate 15% of 240.",
"check": lambda answer: "36" in answer},
]
def evaluate_agent(agent, tasks):
passed = 0
for task in tasks:
answer, trace = agent(task["goal"])
ok = task["check"](answer)
passed += ok
if not ok:
print("FAILED:", task["goal"], "-- inspect the trace:", trace)
print(f"Passed {passed} of {len(tasks)}")Common Agent Failures (and Spotting Them in Traces)
Agents fail in characteristic ways, and a trace reveals each one. Loops — the trace shows the same action repeated (Chapter 32); fix with step limits and loop detection. Wrong tool or bad arguments — the trace shows a mismatched call; fix the tool descriptions (Chapter 33). Ignoring observations — the agent's reasoning does not reflect what a tool returned. Giving up too early — it stops before the goal. Hallucinating instead of retrieving — it invents facts rather than using a tool. Reading traces trains your eye to recognize these patterns quickly, which is the heart of agent debugging.
Continuous Evaluation
As with models (Chapter 25), evaluation is not a one-time gate but an ongoing practice. Agents are fragile to change — adjusting a prompt, swapping a model, or editing a tool can silently break behavior that used to work. Re-run your eval set after every meaningful change to catch these regressions before they reach users. A standing agent eval set is your safety net against the quiet breakage that creeps in as a system evolves.
The Verification Theme, Again
This chapter is the verification spine of the book applied to agents — and for agents the stakes are highest, because they act. A model that is wrong says something incorrect; an agent that is wrong does something incorrect. Before you trust an agent with real actions, you must be able to observe its behavior and evaluate that it behaves correctly. Observability plus evaluation is the verification of agents, and it is what stands between an impressive demo and a system you can responsibly deploy.
Summary
Agents are hard to evaluate because they take many fallible steps, their success depends on the whole trajectory, they are non-deterministic, and they involve tools. Observability — recording each step through tracing — is the foundation, since you cannot fix what you cannot see, and a complete trace turns an opaque agent into a readable story. You evaluate the trajectory, not just the final output, using metrics like success rate, efficiency, tool-use accuracy, and failure analysis, measured against your own eval set of representative tasks. Common failures (loops, wrong tools, ignored observations) reveal themselves in traces, and evaluation must be continuous to catch the regressions that changes introduce. For agents that act, this observe-and-evaluate discipline is the verification that makes them trustworthy.
Seeing and judging an agent's behavior naturally raises the question of keeping it safe. Chapter 45 confronts the safety and security of agents head-on — including prompt injection, the defining threat of systems that act on untrusted information.
Exercises
- 1Explain why agents are harder to evaluate than single model outputs. List at least three reasons specific to the multi-step, tool-using nature of agents.
- 2Add step-level tracing to an agent loop and run it on a task. Read the resulting trace and describe, step by step, what the agent did.
- 3Explain the difference between evaluating an agent's final output and evaluating its trajectory. Give an example where the final answer is right but the trajectory is poor.
- 4Define three metrics for judging whether an agent succeeded, and explain what each one tells you that the others do not.
- 5Build a small agent eval set with at least three tasks, each with a clear success check. Run an agent against it and interpret the results.
- 6Take a trace from a failed agent run (real or imagined) and diagnose the cause using the common-failures list. Explain how the trace revealed the problem.
