Part 5 · Training and Fine-Tuning Language Models

Chapter 25Evaluating Models: Benchmarks, Metrics, and Pitfalls

We have now built a model from the ground up — pretrained, fine-tuned, instruction-tuned, and aligned. But a question has been lurking under every chapter of this part: how do you actually *know* it is any good? Evaluation is one of the most underrated skills in all of AI, and one of the easiest to get wrong. This chapter, closing Part V, covers how models are evaluated — benchmarks, metrics, model judges, and humans — and, just as importantly, the many ways evaluation can quietly mislead you. A model is only ever as trustworthy as the evaluation that vouches for it.

Why Evaluation Is Hard for Language Models

Evaluating a calculator is easy: there is one right answer, and you check against it. Evaluating a language model is much harder, because most language tasks have no single right answer. There are countless good ways to summarize an article or answer a question, and "good" is multi-dimensional — accurate, clear, appropriately detailed, well-toned — and partly subjective. You cannot simply mark generated text right or wrong. This difficulty is why evaluation deserves a whole chapter, and why so many people do it badly.

Benchmarks: Standardized Tests for Models

A benchmark is a standardized test for models: a fixed set of questions or tasks with known answers, used to compare different models on the same footing. There are benchmarks for factual knowledge, for reasoning, for math, for coding, for reading comprehension, and much more. When you see one model described as outperforming another, a benchmark is usually behind that claim. Benchmarks are genuinely useful for comparison — they give the field a common yardstick. But that yardstick can deceive you in several important ways.

The Trouble with Benchmarks

Three problems make benchmark scores far less trustworthy than they appear.

  • Contamination. If a benchmark's questions and answers leaked into the model's training data — easy to imagine, since benchmarks live on the public web — the model may simply be reciting answers it memorized, not reasoning. This is the test-set contamination of Chapter 16, and it inflates scores into meaninglessness.
  • Overfitting to the benchmark. When everyone optimizes for the same test, models can become good at the test rather than at the underlying ability it was meant to measure. This is an instance of a famous principle, Goodhart's law: when a measure becomes a target, it stops being a good measure.
  • Narrowness. A benchmark measures one slice of ability. A high score on a coding benchmark tells you little about how the model handles your customer-support tone, and a strong general score says nothing specific about your task.

Metrics: Putting a Number on Quality

Within a benchmark, you need a way to score each answer automatically — a metric. For tasks with a clear answer, exact match works: did the model produce the right label or number? For open-ended text, metrics often measure how similar a response is to a reference answer. These automatic metrics are fast and cheap, which is their virtue. Their vice is that they are shallow: they can miss meaning entirely, rewarding text that overlaps with a reference while penalizing a differently-worded answer that is actually better. A metric is a rough proxy, never the full truth of quality.

LLM-as-Judge

A newer, increasingly common approach is LLM-as-judge: using a strong language model to evaluate another model's outputs against a rubric you provide. This scales far better than human review and can assess nuances that simple metrics miss. But the judge has its own biases — it may favor longer answers, or responses written in its own style, or be swayed by confident phrasing. The lesson echoes Chapter 19: a model's output, even when the output is a judgment, is a candidate to be verified, not gospel. Check that your judge actually agrees with human opinion before trusting it at scale.

Human Evaluation: Still the Gold Standard

For nuanced quality, human evaluation remains the most trustworthy approach — typically using the preference comparisons of Chapter 18, where people judge which of two responses is better. Humans catch subtleties that metrics and even model judges miss. The drawbacks are familiar: it is slow, expensive, and somewhat subjective. But when the stakes are high, there is still no substitute for asking real people whether the output is actually good.

Build Your Own Evaluation Set

Here is the single most valuable evaluation habit, and it costs you nothing but discipline: for your task, the benchmark that matters is your own. Public benchmarks measure general abilities; they cannot tell you whether the model does your job well. So collect a set of real examples representative of your actual use case, define clear criteria for what counts as a good answer, and test against them every time. A small, honest, task-specific evaluation set is worth more than any leaderboard.

python
# Your own evaluation set: real examples + clear pass/fail criteria.
eval_set = [
    {"input": "Customer: My order hasn't arrived.",
     "must_include": ["apolog", "track"]},      # should apologize and offer tracking
    {"input": "Customer: How do I reset my password?",
     "must_include": ["settings", "reset"]},
]

def evaluate(model):
    passed = 0
    for case in eval_set:
        answer = model.respond(case["input"]).lower()
        if all(word in answer for word in case["must_include"]):
            passed += 1
    print(f"Passed {passed} of {len(eval_set)} cases")

evaluate(my_model)

This is intentionally simple — real evaluation criteria can be richer, including human or model judgment of tone and correctness — but even a basic harness like this catches regressions that vibes never would.

Evaluate Continuously

Evaluation is not a one-time gate; it is an ongoing practice. Every time you change the model, adjust a prompt, or update your data, re-run your evaluation set to catch regressions — cases that used to work and silently broke. A change that improves one thing often quietly harms another, and only a standing evaluation will reveal it. Build the habit of measuring after every meaningful change.

The Recurring Lesson

This chapter is, at heart, another face of a principle that runs through the whole book: the verification is the moat. Evaluation is verification, applied to models. Whether you are training a model, building an agent, shipping a product — or writing a technical book — the thing that separates trustworthy work from confident guesswork is rigorous checking. Do not trust vibes; measure. The discipline to evaluate honestly is what turns impressive-seeming output into something you can actually rely on.

Summary

Evaluating language models is hard because most tasks have no single right answer and "good" is multi-dimensional. Benchmarks offer a common yardstick but can mislead through contamination, overfitting to the test (Goodhart's law), and narrowness. Automatic metrics are fast but shallow, missing meaning; LLM-as-judge scales well but carries its own biases and must itself be verified; human evaluation remains the gold standard for nuance despite being slow and costly. The most valuable habit is to build your own task-specific evaluation set with clear criteria, run it continuously to catch regressions, and never trust vibes over measurement — because evaluation is simply verification, the moat that separates reliable work from guesswork.

This completes Part V and the entire "building a model" arc — data, training, fine-tuning, alignment, and evaluation. From here the book turns to using models: Part VI begins with running inference and the craft of prompting, the skills you will lean on most as you start building agents.

Practice

Exercises

  1. 1Explain why evaluating a language model is harder than evaluating a calculator. What property of language tasks creates the difficulty?
  2. 2Describe the three ways benchmarks can mislead, and give a concrete example of each. Which do you think is hardest to detect, and why?
  3. 3State Goodhart's law in your own words and connect it both to benchmark overfitting and to the reward hacking from Chapter 24. What do the two phenomena have in common?
  4. 4Compare automatic metrics, LLM-as-judge, and human evaluation across speed, cost, and how well they capture nuanced quality. When would you choose each?
  5. 5Design a small evaluation set for a specific task you care about: write at least four example inputs with clear, checkable criteria for a good response.
  6. 6Explain why evaluation should be continuous rather than one-time, and describe a situation where a change improved one thing while silently breaking another. How would a standing eval set have caught it?
View detailed solutions for all chapters →