Part 4 · 5 chapters

Data Preparation

Great models and useful agents both depend on good data. This part teaches the unglamorous but essential craft of collecting, cleaning, and shaping data for training and for retrieval.

Chapter 15Data Preparation

Where Training Data Comes From

We now begin the part of the book that the user who asked for it called "data preparation" — and it deserves the attention. In Chapter 13 we saw that a model is, in the deepest sense, a reflection of the data it learned from. Everything it knows, every skill it has, and every flaw it carries traces back to that data. This chapter surveys where training data actually comes from, the genuinely thorny legal and ethical questions it raises, and how to think about data quality from the very first step. It assumes no background, and it sets up the hands-on cleaning, instruction, preference, and synthetic-data chapters that follow.

Chapter 16Data Preparation

Cleaning, Deduplicating, and Filtering Data

If Chapter 15 was about choosing your ingredients, this chapter is about washing and chopping them. Raw data — especially raw web text — is genuinely filthy: full of duplicates, junk, broken formatting, and content you do not want anywhere near your model. Cleaning it is unglamorous, hands-on work, and it is some of the highest-leverage work in all of machine learning. We will walk through a practical cleaning pipeline step by step, with runnable code you can adapt to your own datasets, and we will see exactly why one step — deduplication — matters far more than beginners expect.

Chapter 17Data Preparation

Building Datasets for Instruction Tuning

In Chapter 13 we ended with an unsettling fact: a freshly pretrained model is only a *text continuer*, not a helpful assistant. Ask it a question and it might continue with more questions. This chapter is about the data that fixes that — the instruction–response pairs used to teach a model to actually follow requests. We will see exactly what a good example looks like, what separates a great dataset from a useless one, how much data you really need, and how to build your own. The actual training process comes in Part V; here, our entire focus is the data that makes it work.

Chapter 18Data Preparation

Preference and RLHF Data: How Human Feedback Is Collected

Instruction tuning, from the last chapter, teaches a model to follow requests. But it leaves a subtler question unanswered: when two responses are both reasonable, which is *better* — more helpful, more honest, safer, better phrased? Teaching that judgment requires a different kind of data, built not from single right answers but from human *preferences* between alternatives. This chapter explains what preference data is, why it takes the form of comparisons, how it is collected, and the genuine difficulties involved. The training methods that consume this data come in Part V; here we focus entirely on the data itself.

Chapter 19Data Preparation

Synthetic Data and Data Augmentation

The last two chapters revealed an expensive truth: good instruction and preference data is largely made by hand, and human effort does not scale cheaply. So a natural question arises — could a capable model help generate the data used to train models? It can, and **synthetic data** has become one of the most important and fastest-growing techniques in the field. But it comes with serious, sometimes subtle dangers. This chapter, closing Part IV, explains what synthetic data and data augmentation are, how they are produced, their genuine advantages, and the risks that make verification absolutely non-negotiable.

← Part 3Part 5