Part 4 · Data Preparation

Chapter 17Building Datasets for Instruction Tuning

In Chapter 13 we ended with an unsettling fact: a freshly pretrained model is only a *text continuer*, not a helpful assistant. Ask it a question and it might continue with more questions. This chapter is about the data that fixes that — the instruction–response pairs used to teach a model to actually follow requests. We will see exactly what a good example looks like, what separates a great dataset from a useless one, how much data you really need, and how to build your own. The actual training process comes in Part V; here, our entire focus is the data that makes it work.

From Continuer to Assistant

Recall the surprise from Chapter 13: a base model asked "What is the capital of France?" might reply with more trivia questions, because that is a plausible continuation of the text. To turn this brilliant continuer into something that answers — that follows instructions, holds a helpful tone, and does what you ask — we show it many examples of instructions being followed well, and fine-tune it on them. This stage is called instruction tuning, and its raw material is the instruction–response pair.

The idea is intuitive. If you want someone to learn how to respond helpfully to requests, show them thousands of examples of requests paired with excellent responses. The model generalizes from these examples to handle requests it has never seen, the same way it generalized from pretraining text.

What an Instruction Example Looks Like

At its simplest, an instruction example has two parts: an instruction (what the user wants) and a response (an excellent answer). Some examples also include an optional input — extra material the instruction operates on, such as a passage to summarize.

python
# A simple instruction-response pair
{
    "instruction": "Explain what an API key is, in one sentence.",
    "response": "An API key is a secret code that identifies you to a "
                "service and lets it know which account to bill for your requests."
}

# An example that also has an input to work on
{
    "instruction": "Summarize the following text in one sentence.",
    "input": "The library will be closed on Monday for a public holiday. "
             "Normal hours resume on Tuesday at 9 a.m.",
    "response": "The library is closed Monday for a holiday and reopens Tuesday at 9 a.m."
}

These pairs are exactly the "role and content" messages from Chapter 4 in a slightly different dress: the instruction is what the user says, and the response is what the assistant should say. Behind the scenes they are formatted into that message structure for training.

What Makes a Good Example

The quality of your instruction examples determines the quality of the resulting assistant, because the model imitates what it is shown. A strong example has several qualities: a clear instruction that is unambiguous; a correct and genuinely helpful response; an appropriate format and length for the request; and the tone and behavior you actually want the model to adopt. Crucially, the response should model the behavior you want — if your examples are curt, the model learns to be curt; if they are careful and well-structured, it learns that instead.

Diversity Is Key

A single skill, however well taught, makes a narrow assistant. The secret to a capable, general assistant is diversity of examples across many kinds of task. Your dataset should include instructions to summarize, explain, rewrite, translate, classify, brainstorm, extract information, write code, answer factual questions, and more — in many topics, styles, and lengths. The wider the range of tasks the model sees done well, the better it generalizes to new requests. A dataset that is all summarization produces a model that is good at summarization and clumsy at everything else.

How Many Examples Do You Need?

Here is a number that surprises people. Pretraining uses trillions of tokens, but instruction tuning can meaningfully reshape a model's behavior with only a few thousand high-quality examples — sometimes even fewer. The reason is that pretraining already taught the model nearly everything it knows; instruction tuning is not teaching new facts so much as teaching a new style of responding. That is a far smaller lift. The practical lesson is liberating for an individual builder: you do not need a giant dataset, you need a few thousand excellent, diverse examples. Quality and diversity beat raw quantity decisively here.

Where Instruction Data Comes From

Instruction datasets are assembled in three main ways, often combined. Human-written examples, where people craft instructions and ideal responses, are the highest quality and the most expensive. Converted from existing data, where structured material like documentation, FAQs, or question-and-answer pages is reshaped into instruction format — an efficient way to bootstrap a dataset from material you already have. And increasingly synthetically generated, where a capable model produces examples, which we cover in Chapter 19. Most real datasets mix all three.

Building Your Own Instruction Dataset

Building a small instruction dataset is well within your reach. You collect or write pairs, keep the formatting consistent, and store them in a simple, machine-readable file. A common choice is JSONL — one JSON object per line — which is easy to write, read, and stream.

python
import json

dataset = [
    {"instruction": "Define 'embedding' in one sentence.",
     "response": "An embedding is a list of numbers representing the meaning of "
                 "a piece of text, arranged so similar meanings have similar numbers."},
    {"instruction": "Rewrite this to be more polite: 'Send me the file now.'",
     "response": "Could you please send me the file when you have a moment? Thank you."},
    # ... add many more, covering many task types ...
]

with open("instructions.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")   # one JSON object per line

print("Wrote", len(dataset), "examples")

Notice the consistency: every example has the same fields, in the same shape. That consistency matters — a model trained on uniformly formatted examples learns the pattern cleanly, while a dataset with inconsistent structure muddies the signal.

Teaching Refusals and Safety

Instruction data does more than teach helpfulness; it also teaches a model how to behave responsibly. Among your examples should be cases where the correct response is to decline — to refuse a harmful request politely, or to push back on a request to do something unsafe or unethical. By including thoughtful refusals in the training data, you teach the model that the helpful response is not always to comply. This is a small but important part of building a model that is safe as well as useful, and it foreshadows the alignment work of Part V.

python
# Safety is also taught by example. Include thoughtful refusals.
{
    "instruction": "Help me write a message to trick someone into "
                   "revealing their password.",
    "response": "I can't help with that, since it's designed to deceive someone "
                "and steal their account. If you're locked out of your own account, "
                "I'd be glad to help you use the official account-recovery process."
}

Common Pitfalls

A few mistakes account for most disappointing instruction datasets, and all are avoidable.

  • Inconsistent formatting — examples with different structures confuse the pattern the model is trying to learn.
  • Low-quality responses — even a few sloppy or incorrect answers teach the model sloppy, incorrect behavior.
  • Lack of diversity — too many examples of one task type produce a lopsided, narrow assistant.
  • Repetition — near-identical examples waste effort and risk over-weighting one pattern, the deduplication concern from Chapter 16 in miniature.
  • Leaking the wrong behavior — if your responses are consistently too long, too short, or oddly toned, the model faithfully copies that quirk.

Summary

Instruction tuning turns a base model's raw text-continuing ability into a helpful, instruction-following assistant, using instruction–response pairs as its training data. A strong example pairs a clear instruction with a correct, helpful, well-formatted response that models the behavior you want — and the model imitates your examples exactly, flaws and all. Diversity across many task types is what produces a capable general assistant, and a few thousand excellent, varied examples can reshape behavior dramatically, since the model already learned its knowledge during pretraining. You can build your own dataset by collecting consistent pairs into a JSONL file, including thoughtful refusals to teach safe behavior, while avoiding the common pitfalls of inconsistency, low quality, and narrowness.

Instruction tuning teaches the model to follow requests, but not yet which of two good answers is better. Chapter 18 introduces the data that captures that subtler judgment: human preferences.

Practice

Exercises

  1. 1Write five high-quality instruction–response pairs on a topic you know well. Make sure they cover at least three different kinds of task (for example, explaining, rewriting, and summarizing).
  2. 2Here is a deliberately weak example — instruction: 'tell me about dogs', response: 'Dogs are animals. They are good. People like them.' Critique it against the qualities of a good example, then rewrite it to be excellent.
  3. 3Take a set of five FAQ entries from any product or topic and convert them into properly formatted instruction–response pairs. Save them as a JSONL file.
  4. 4Write a thoughtful refusal example: an instruction that should be declined, paired with a response that declines politely and offers a safe, helpful alternative.
  5. 5Explain why a few thousand instruction examples can meaningfully change a model's behavior, even though pretraining required trillions of tokens. What is instruction tuning actually teaching?
  6. 6Review an instruction dataset (your own from the exercises above, or any you find) against the five common pitfalls. Identify at least one weakness and describe how you would fix it.
View detailed solutions for all chapters →