Part 9 · Advanced and Cutting-Edge Topics

Chapter 46Small Models, Local Agents, and Cost Optimization

⏱ 6 min read·✏️ 6 exercises·Advanced and Cutting-Edge Topics

There is a powerful instinct, when building agents, to reach for the biggest, most capable model for everything. It is usually a mistake — an expensive one. Many tasks do not need a frontier model, and using one anyway wastes money and time. This chapter is about doing more with less, deliberately: when a small model is the right choice, how local agents fit in, the routing pattern that combines small and large models, and the concrete levers for controlling cost. Right-sizing your models to your tasks is one of the most practical skills in production agent building.

Bigger Isn't Always Better

The most capable model is not the right model for every job, any more than a freight truck is the right vehicle for every errand. Frontier models are powerful but slow and expensive; many of the steps an agent takes — classifying a request, extracting a field, routing to the right tool, answering a simple question — are well within the reach of a far smaller, cheaper, faster model. Reflexively using a frontier model for everything inflates your cost and latency for no benefit. The skill is matching the model to the task at hand.

The Rise of Small Models

Small models have improved dramatically, and the trend is strongly toward more capable small models, with the market for them growing rapidly (as we noted back in Chapter 1). For a great many specific tasks, a small model is now simply good enough — and being smaller, it is faster and far cheaper to run, and often able to run locally on modest hardware. The gap between small and frontier models, while real for the hardest tasks, has narrowed enough that small models deserve to be your default consideration for well-defined work, not an afterthought.

When a Small Model Suffices

Knowing when a small model is enough is the heart of right-sizing.

Well-defined, narrow tasks — classification, extraction, routing, simple question-answering. Small models excel here.
High-volume tasks — when you run a step millions of times, even a small per-call saving compounds into a large one, so cost dominates the choice.
Latency-sensitive tasks — when a user is waiting, a fast small model often gives a better experience than a slow large one.
Privacy-sensitive tasks — a small model can run locally, keeping data on your own machine.

Reserve the big models for what genuinely needs them: hard multi-step reasoning, open-ended generation, and tasks requiring broad knowledge or sophisticated judgment. Use the powerful tool where it earns its cost, and the cheap tool everywhere else.

Local Agents

As Chapter 26 discussed, small models can run locally — on your own computer or servers — which is now practical for real agent components. A local model keeps data completely private, costs nothing per call, and works offline. The trade-offs are the familiar ones: you need the hardware and setup, and local models are usually less capable than the best cloud models. But for the well-defined, high-volume, or privacy-sensitive tasks above, a local small model can be exactly right — and a hybrid agent might run simple steps locally while calling a cloud model only for the hard ones.

Routing: The Best of Both

The most powerful pattern in this chapter is routing: use a small, cheap model for easy steps and a large, expensive model only for hard ones, with a router that decides which to use for each request. It is triage — handle the routine cheaply, escalate the difficult. A simple router can itself be a small, fast model (or even a rule) that classifies how hard a request is and sends it to the appropriate model.

python

def route(request):
    difficulty = small_model_classify(request)    # cheap, fast triage
    if difficulty == "easy":
        return small_model_answer(request)          # handle cheaply
    else:
        return large_model_answer(request)          # escalate only when needed

# Most requests are easy and handled cheaply; only the hard ones cost more.

Routing can dramatically cut cost while keeping quality high, because in most real workloads the majority of requests are easy and only a minority truly need the expensive model. You pay for power only when power is required.

Cost Optimization Strategies

Beyond routing, several concrete levers control cost, most of which you have already met.

Right-size the model — use the smallest model that does the job well for each task.
Cap output length — set maximum tokens so responses (and bills) stay bounded (Chapter 26).
Trim the context — send only what is needed, since you pay for every input token (Chapter 12).
Cache repeated calls — if the same request recurs, reuse the answer instead of paying again.
Use cheaper models for tool-use steps — many internal agent steps do not need a frontier model.

Measuring and Managing Cost

You cannot optimize what you do not measure — the same truth that drove evaluation in Chapter 25. Log token usage on every call (Chapter 30), track cost per task, and set budgets so spending cannot run away unnoticed. With measurement in place, you can see where your money actually goes — often a surprising concentration in a few expensive steps — and target your optimization there. Without measurement, cost optimization is guesswork.

The Trade-Offs

There is no free lunch. Every choice balances capability, cost, speed, and privacy, and pushing one usually pulls another. A bigger model is more capable but slower and pricier; a local small model is private and cheap but less capable; aggressive context trimming saves money but risks dropping something important. The goal is not to minimize any single dimension but to choose deliberately for each task, with eyes open to the trade-offs. Thoughtful right-sizing, guided by measurement, is what makes an agent both capable and affordable.

Summary

Using the biggest model for everything is a common and costly mistake; many tasks need only a small model, which is faster, cheaper, and often runnable locally for privacy. Small models suffice for well-defined, high-volume, latency-sensitive, and privacy-sensitive tasks, while big models earn their cost on hard reasoning and open-ended work. The routing pattern combines both — triaging requests so easy ones are handled cheaply and only hard ones escalate — and dramatically cuts cost in typical workloads. Further levers include capping output, trimming context, caching, and using cheap models for internal steps, all guided by measuring cost since you cannot optimize what you do not measure. Every choice trades off capability, cost, speed, and privacy, so the goal is deliberate right-sizing rather than minimizing any one dimension.

An efficient, well-guarded agent is ready for the real world. Chapter 47 covers deploying agents to production — turning a working prototype into a reliable, monitored, maintainable service.

Practice

Exercises

1List three tasks for which a small model would clearly suffice and one that genuinely needs a large model. Justify each based on the task's characteristics.
2Run the same task on a small model and a large model (or reason about how they would differ) and compare quality, speed, and cost. When is the small model the better overall choice?
3Implement a simple router that sends easy requests to a small model and hard ones to a large model. Explain why this can cut cost without much loss of quality in a typical workload.
4List five concrete cost-optimization levers from this chapter, and for each, name the earlier chapter where the underlying idea first appeared.
5Explain why measurement is a prerequisite for cost optimization. What would you log, and how would it guide where you focus your effort?
6For an agent of your choice, describe the trade-offs you would weigh among capability, cost, speed, and privacy, and explain the model choices you would make for its different steps.

View detailed solutions for all chapters →