Part 4 · Data Preparation

Chapter 15Where Training Data Comes From

⏱ 8 min read·✏️ 6 exercises·Data Preparation

We now begin the part of the book that the user who asked for it called "data preparation" — and it deserves the attention. In Chapter 13 we saw that a model is, in the deepest sense, a reflection of the data it learned from. Everything it knows, every skill it has, and every flaw it carries traces back to that data. This chapter surveys where training data actually comes from, the genuinely thorny legal and ethical questions it raises, and how to think about data quality from the very first step. It assumes no background, and it sets up the hands-on cleaning, instruction, preference, and synthetic-data chapters that follow.

Why Data Is the Real Foundation

There is an old saying in computing: garbage in, garbage out. Nowhere is it truer than in machine learning. A model trained on careful, high-quality, representative data becomes capable and reliable. A model trained on careless, biased, or dirty data becomes unreliable in ways that no clever architecture can fix. The data is not a preliminary detail before the "real" work of training — it is much of the real work.

This is why serious teams spend more time on data than on almost anything else, and why this book devotes an entire part to it. If you remember only one thing, let it be this: the quality of what comes out of a model is bounded by the quality of what went in.

Three Very Different Data Needs

Before listing sources, it helps to separate three distinct kinds of data, because they have completely different requirements. Confusing them is a common beginner mistake.

Pretraining data — the vast, raw text used to build a base model (Chapter 13). Measured in trillions of tokens, prioritizing scale and broad coverage.
Fine-tuning data — smaller, carefully curated datasets used to shape a model's behavior, such as the instruction and preference data of the next few chapters. Measured in thousands or millions of examples, prioritizing quality and relevance.
Retrieval data — your own documents, used at the moment of use rather than training, as in the RAG system of Chapter 36. Not training data at all, but still data you must source and prepare with care.

Most of this part focuses on the first two. As an individual builder, you will rarely assemble pretraining corpora, but you will very often prepare fine-tuning and retrieval data — so the principles here apply directly to your own work.

The Main Sources of Text Data

Where does all that text come from? A handful of sources supply most of it, each with characteristic strengths and weaknesses.

Web crawls

The largest source by far is text scraped from the public web — billions of pages. Its great virtue is sheer scale and variety; its great vice is messiness. Web text is full of duplicates, navigation menus, advertisements, spam, broken formatting, and low-quality or harmful content. Raw web data is never usable as-is, which is the entire reason the next chapter exists.

Books and articles

Books, academic papers, and edited articles offer high-quality, well-structured, carefully written language — exactly the kind of text you want a model to learn from. The catch is largely legal: much of this material is copyrighted, raising questions we turn to shortly.

Code repositories

Public collections of source code teach models to read and write programs. A model's coding ability comes directly from how much good code it saw during training, which is why code is now a major, deliberately included ingredient.

Reference and curated collections

Encyclopedias, question-and-answer sites, and other curated knowledge bases are dense with reliable facts and clear explanations, making them especially valuable per token compared to the noisy open web.

Conversations and forums

Dialogue from forums and discussions teaches models the back-and-forth rhythm of conversation, which is particularly useful for assistants and agents that must hold a coherent exchange.

Quality Over Quantity (Up to a Point)

It is tempting to assume more data is always better. Scale does matter — but beyond a point, quality matters more. A smaller, clean, well-chosen dataset frequently produces a better model than a larger, dirtier one, because the model is not wasting its capacity learning noise, errors, and repetition. The modern trend has moved steadily toward carefully curated data rather than simply hoarding more. Quantity gets you started; quality gets you good.

The Legal and Ethical Questions

Sourcing training data sits in the middle of a genuinely unsettled set of legal and ethical debates, and an honest book must present them as open questions rather than solved ones.

Copyright is the most prominent. Much valuable text — books, articles, code, art — is owned by someone. Whether, and how, it may be used to train models is being actively contested in courts and legislatures around the world, and reasonable people and institutions disagree. Consent and attribution raise related concerns: people who wrote text on the web generally did not anticipate it training AI, and creators ask whether they should be credited or compensated. Personal and private data is a sharper worry still: web text can contain names, addresses, and other private information that should not be memorized or reproduced by a model.

Bias In, Bias Out

We met this idea with embeddings in Chapter 9, and it returns here at full scale. Training data reflects the people who produced it — their languages, viewpoints, assumptions, and blind spots. If a perspective is over-represented in the data, the model will lean that way; if a group or language is under-represented, the model will serve it less well. These biases are not a bug introduced by careless engineers so much as an inheritance from the data itself, which is precisely why thoughtful sourcing and filtering matter so much. You cannot remove bias entirely, but you can be aware of it, measure it, and work to reduce it.

Data for Your Own Projects

Let us bring this down to your scale. You are unlikely to gather a trillion-token web crawl, but you will regularly assemble datasets to fine-tune a model's behavior (Chapter 17) or documents to feed a RAG system (Chapter 36). The same principles guide good sourcing at any size. Seek data that is relevant to your task, high in quality, permitted for your use, representative of the cases you care about, and clean enough to trust. A few hundred excellent, well-chosen examples will serve you better than a sloppy dump of thousands.

Before adopting any dataset, it is worth simply looking at it — really reading a sample — to see what you are dealing with. A few lines of code go a long way.

python

# Always inspect a sample of your data before trusting it.
documents = load_dataset("my_data")     # however your data is stored

print("Total documents:", len(documents))
for doc in documents[:5]:                # eyeball the first few
    print("-" * 40)
    print(doc[:300])                      # first 300 characters of each

A Sourcing Mindset

Whenever you encounter or assemble a dataset, get into the habit of asking five questions. Where did this come from? Am I allowed to use it? Is it representative of what I care about, or skewed? Is it clean, or full of noise? And is it actually relevant to the task at hand? These questions are simple, but asking them every time will spare you a great many downstream problems — and they cost nothing but a moment's discipline.

Summary

Data is the true foundation of every model: garbage in, garbage out. Training data comes in three distinct kinds — vast pretraining corpora, curated fine-tuning datasets, and use-time retrieval documents — with different priorities. The main sources are web crawls (huge but messy), books and articles (high quality but copyrighted), code, curated reference collections, and conversations. Quality eventually matters more than raw quantity. Sourcing raises real, unsettled questions about copyright, consent, and private data, and all data carries the biases of those who produced it. For your own projects, choose data that is relevant, high-quality, permitted, representative, and clean — and always inspect it before you trust it.

Having chosen our sources, we face the reality that raw data is a mess. Chapter 16 rolls up its sleeves and walks through the concrete steps of cleaning, deduplicating, and filtering data into something worth training on.

Practice

Exercises

1List four distinct sources of text training data, and for each, name one strength and one weakness in your own words.
2Explain the difference between pretraining data, fine-tuning data, and retrieval data, and give an example task where each would be the relevant kind.
3In one paragraph, explain the idea that quality can matter more than quantity. When might a smaller dataset beat a larger one?
4Pick a topic you know well and write a short data-sourcing plan for a model meant to answer questions about it: what sources you would use, how you would judge quality, and what permission concerns might arise.
5Describe one concrete way that bias in training data could lead to a model that treats some users worse than others, and explain why this traces back to data rather than to the model's design.
6Write down the five sourcing-mindset questions from the chapter, then apply all five to any real dataset you can find online. What did asking them reveal?

View detailed solutions for all chapters →