Chapter 26Running Inference: Local and in the Cloud
We have spent five parts understanding how models are built. Now the book pivots to the half you will spend most of your time in: *using* them. This part is about putting a finished model to work, and it begins with the most basic act of all — running the model to get an answer, which is called inference. We will see the two places you can run a model, what actually happens inside when it generates text, the settings that shape its output, and how to think about speed and cost. Everything here is practical and beginner-friendly, and it is the ground floor for building agents.
What Inference Means
Inference is simply running a trained model to produce output — feeding it a prompt and getting a response. It is the counterpart to training. Training is the expensive, one-time process of building the model (Parts III to V); inference is the everyday act of using it, which you do every single time you send a prompt. When you chatted with a model in Chapter 3, that was inference. The distinction matters because the two have completely different costs and characteristics: training happens rarely and costs a fortune, while inference happens constantly and is far cheaper per use.
Two Places to Run a Model
There are two fundamentally different places you can run inference, and they map onto the hosted-versus-open distinction from Chapter 14. You can run the model in the cloud, calling a provider's model over the internet, or you can run it locally, on your own hardware. Each suits different needs, and many builders use both at different stages.
Cloud Inference (via API)
Cloud inference means sending your prompt over the internet to a provider who runs the model on their machines and sends back the result — exactly what you did in Chapter 3. Its strengths are convenience and power: there is no hardware to buy or manage, the most capable frontier models are available this way, and it scales effortlessly from one request to millions. Its costs are a per-token charge, the fact that your data leaves your machine, a dependence on the network and the provider, and some latency from the round trip. For learning, prototyping, and most production work, cloud inference is the path of least resistance.
Local Inference
Local inference means running an open model (Chapter 14) on your own computer or server. Its strengths mirror the cloud's weaknesses: your data never leaves your machine (real privacy), there is no per-token fee, it works offline, and you have complete control. Its costs are that you need capable hardware, you must handle the setup yourself, the open models you can run are often less capable than the best cloud models, and you are responsible for keeping it running. Friendly local-runner tools have made this dramatically easier than it used to be, so running a small model on a decent laptop is now within easy reach.
What Actually Happens During Inference
It is worth seeing under the hood, because it explains a lot about cost and speed. When a model generates a response, it does not produce the whole thing at once. It runs a forward pass (Chapter 7) to predict the single most likely next token, appends that token, and then runs again to predict the next one, conditioning on everything written so far (the causal generation of Chapter 12). It repeats this, one token at a time, until the response is complete. This token-by-token process is called autoregressive generation.
This immediately explains something practical: longer outputs take proportionally longer to generate, because each token requires its own pass through the model. A one-sentence answer is fast; a five-page essay is slow, because the model is genuinely producing it one piece at a time.
The Key Inference Settings
When you run inference, a handful of settings shape the output. You have met the most important one already.
- Temperature (from Chapter 5) — controls randomness. Low for focused, repeatable answers; higher for varied, creative ones.
- Maximum tokens — caps how long the response can be. Essential for controlling cost and preventing runaway output.
- Stop sequences — text that tells the model to stop generating when it produces them, useful for keeping output tidy and bounded.
- Top-p (nucleus sampling) — another randomness control that limits the model to choosing from only the most probable next tokens. Often left at its default.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200, # cap the length -- controls cost and runaway output
temperature=0.2, # low: focused and repeatable
messages=[{"role": "user", "content": "List three uses for embeddings."}],
)
print(response.content[0].text)Streaming: Getting Tokens as They Arrive
Because generation happens token by token, you do not have to wait for the whole response before seeing anything. Streaming delivers tokens to you as they are produced, which is why chat interfaces show text appearing word by word, as if typed. Nothing about the total speed changes, but the experience is far better — the user sees progress immediately instead of staring at a blank screen, and for a long response that difference is enormous. Whenever a human is waiting on the output, prefer streaming.
Cost and Latency in Practice
Two practical realities govern inference. Cost, for cloud models, is driven by tokens — both the tokens you send in and the tokens generated out (recall Chapter 11). Latency, how long you wait, is driven mostly by how many tokens are generated and how large the model is. The practical levers follow directly: keep outputs as short as the task allows, use a smaller or faster model when it is good enough (Chapter 46), and avoid resending unnecessary context. Small habits here add up to large savings at scale.
Choosing for Your Project
The decision echoes Chapter 14. Start in the cloud: it is the fastest way to build and learn, with no hardware to wrangle. Move toward local inference when a concrete need pushes you there — strict privacy, high-volume cost savings, offline operation, or deep control. Many real systems even mix the two, using a small local model for simple, frequent tasks and a powerful cloud model for the hard ones, a strategy we return to in Chapter 46.
Summary
Inference is the act of running a trained model to get output — the everyday counterpart to one-time training. You can run it in the cloud (convenient, powerful, scalable, but billed per token with data leaving your machine) or locally (private, free per use, offline, but needing hardware and usually less capable). Under the hood, generation is autoregressive: the model produces one token at a time, which is why longer outputs take longer. Key settings — temperature, maximum tokens, stop sequences, top-p — shape the output, streaming improves the experience by delivering tokens as they arrive, and cost and latency are driven mostly by token counts and model size. Start in the cloud and move local when a real need demands it.
Running a model is the easy part; getting it to do what you want is a craft. Chapter 27 begins that craft with prompt engineering — the cheapest and highest-leverage skill in this entire book.
Exercises
- 1Explain the difference between training and inference, and why they have such different costs. Which one do you do constantly, and which happens rarely?
- 2List two genuine advantages of running a model locally and two of running it in the cloud. For a project of your choice, say which you would pick and why.
- 3Explain, in your own words, what autoregressive generation is and why it means longer responses take longer to produce.
- 4Make the same request twice, once with a low temperature and once with a high temperature, and describe the difference in the outputs. Which setting would you use for generating code?
- 5Measure how response time changes as you ask for longer and longer outputs (for example, a 1-sentence, 1-paragraph, and 1-page answer). Relate what you observe to token-by-token generation.
- 6Explain why streaming improves the user experience even though it does not make the total generation any faster. When is it most worth using?
