Part 4 · Data Preparation

Chapter 19Synthetic Data and Data Augmentation

⏱ 7 min read·✏️ 6 exercises·Data Preparation

The last two chapters revealed an expensive truth: good instruction and preference data is largely made by hand, and human effort does not scale cheaply. So a natural question arises — could a capable model help generate the data used to train models? It can, and **synthetic data** has become one of the most important and fastest-growing techniques in the field. But it comes with serious, sometimes subtle dangers. This chapter, closing Part IV, explains what synthetic data and data augmentation are, how they are produced, their genuine advantages, and the risks that make verification absolutely non-negotiable.

When Real Data Runs Short

Human-written instruction examples and human-judged preferences are high quality but slow and costly to produce. Real data for rare or specialized situations may barely exist. Faced with this scarcity, builders increasingly turn to a powerful idea: use a strong existing model to generate training data for another model. This is synthetic data — data created by a model rather than collected from the world.

What Is Synthetic Data?

Synthetic data is, simply, training examples produced by a model. A capable model can be prompted to write instruction–response pairs, generate plausible documents, invent varied questions, or even produce the candidate responses used in preference comparisons. In effect, a more capable model acts as a teacher, generating lessons for a model being trained. The appeal is obvious: a model can produce in minutes what would take humans weeks, and it never tires.

Data Augmentation: Stretching What You Have

A close cousin of synthetic data is data augmentation, which does not invent examples from nothing but creates variations of examples you already have. From one instruction, you might generate several rephrasings; from one passage, a translation or a reformatted version. Augmentation multiplies a small, precious dataset into a larger one while preserving its meaning, helping the model generalize rather than memorize the exact wording of a few examples.

python

# Augmentation: turn one seed instruction into several phrasings.
seed = "Explain what a variable is in programming."
variations = [
    "Explain what a variable is in programming.",
    "In simple terms, what is a variable in code?",
    "Could you describe what a programming variable does?",
    "Teach a beginner the concept of a variable.",
]
# Each variation can be paired with a good response to expand the dataset.

How Synthetic Data Is Generated

In practice, you prompt a capable model to produce the examples you need, usually asking for variety so the dataset does not become repetitive. The pattern is the same one you have used since Chapter 3: send a clear instruction, receive a response, and structure the result.

python

from anthropic import Anthropic
client = Anthropic()

def make_examples(topic, n):
    prompt = (
        f"Generate {n} diverse instruction-response pairs about {topic}. "
        "Vary the task types: explaining, summarizing, rewriting, and questions. "
        "Return them as a JSON list of objects with 'instruction' and 'response'."
    )
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text   # parse this JSON, then verify each example

raw = make_examples("how embeddings work", 10)

Notice the comment on the last line. Generating the examples is the easy part; what you do next — verifying them — is what separates useful synthetic data from a trap.

The Big Advantages

Used well, synthetic data offers real benefits that explain its rise.

Cheap and fast. A model can produce thousands of examples in the time a human writes a handful.
Scalable and targeted. You can generate exactly the kinds of examples you lack, filling specific gaps in your dataset on demand.
Covers rare cases. You can deliberately create examples of unusual situations that would almost never appear in collected data.
Privacy-friendly. Because the examples are invented rather than drawn from real people, synthetic data can sidestep some of the privacy concerns of Chapter 15.

The Serious Risks

Now the warnings, because the dangers of synthetic data are easy to underestimate and have undone many well-meaning projects.

Quality and hidden errors. A generating model can produce confident, fluent, wrong examples — and if you train on them, the model dutifully learns the mistakes as if they were truth. A hallucinated fact in synthetic data becomes a memorized falsehood.
Bias amplification. The generating model's biases are baked into the examples it produces, and training on them reinforces those biases — a feedback loop that can make a small skew worse over time.
Model collapse. When models are trained repeatedly on data generated by previous models, with no fresh human data, quality can degrade across generations — an echo chamber where errors and blandness compound. Real, human-grounded data remains essential.
Loss of diversity. A generating model tends to produce examples in its own characteristic style, so a dataset built purely from one model can be narrower and more uniform than it appears.

Verification: The Non-Negotiable Step

The guardrail that makes synthetic data safe is verification, and it should be as deliberate as the generation. Filter out malformed or low-quality examples using the cleaning techniques of Chapter 16. Validate factual claims where you can, automatically or by checking against trusted sources. Have a human review a meaningful sample to catch what automated checks miss. And mix synthetic data with real, human-grounded data rather than relying on it alone, to guard against model collapse and preserve diversity. Generation is fast and cheap; verification is slow and essential, and the value of synthetic data lives entirely in that second step.

Synthetic Data for Your Projects

For an individual builder, synthetic data is a genuine superpower when handled with discipline. The workflow is: generate candidate examples to cover the gaps in your dataset, verify and filter them rigorously, curate the survivors, and keep a human in the loop reviewing samples. Used this way, you can build a sizable, diverse dataset far faster than by hand — without inheriting the model's mistakes.

This echoes a theme that runs through this entire book, and one worth taking to heart for any AI work you do: the cheap part is generating output, and the valuable part is verifying it. Whether you are training a model, building an agent, or — as it happens — writing a technical book with AI assistance, the verification is the moat. Anyone can generate; the discipline to check is what produces something trustworthy.

Summary

When real data is scarce or expensive, synthetic data — examples generated by a capable model — can fill the gap, and data augmentation can multiply a small dataset by creating variations of what you already have. The advantages are compelling: cheap, fast, scalable, targetable, able to cover rare cases, and friendlier to privacy. But the risks are serious: generated examples can be confidently wrong, can amplify bias, can degrade quality across generations (model collapse), and can quietly lose diversity. The non-negotiable safeguard is verification — filtering, validating, human review, and mixing with real data — because the entire value of synthetic data depends on checking it before you trust it.

This completes Part IV. We have sourced, cleaned, and prepared data of every kind — pretraining text, instruction pairs, preference comparisons, and synthetic examples. In Part V we finally put this data to work, training and fine-tuning models, beginning in Chapter 20 with the crucial choice between pretraining, fine-tuning, and in-context learning.

Practice

Exercises

1Use a model to generate ten instruction–response pairs about a topic you know well. Then carefully review them and record how many were genuinely correct and high quality. What does the error rate tell you about verification?
2Take a single seed instruction and use augmentation to create five varied phrasings of it that preserve its meaning. Explain how this might help a model generalize.
3Describe, in your own words, the risk of 'model collapse'. Why does training models on the output of models, with no fresh human data, tend to degrade quality over time?
4Design a verification plan for a batch of synthetic instruction data: list the specific checks you would apply before trusting any of it for training.
5Explain one concrete situation where synthetic data is a clearly good idea, and one where it would be risky or inappropriate. Justify each.
6The chapter claims 'the verification is the moat'. Restate this principle in your own words and describe how it applies to at least two different AI tasks beyond training a model.

View detailed solutions for all chapters →