Part 6 · Using Language Models in Practice

Chapter 30Working with LLM APIs in Code

We close Part VI with the practical engineering that surrounds every real model call. Using a model in a playground is easy; using it reliably inside a program means handling the message format, carrying on a conversation, recovering from failures, and keeping costs under control. None of this is glamorous, but it is exactly the difference between a fragile demo and something you can depend on — and it is the immediate groundwork for the agents you will build next. As always, everything is hands-on and beginner-friendly, building on the first API call you made all the way back in Chapter 3.

From Playground to Production

Sending one prompt and reading one reply is the simplest possible use of a model. Real applications need more: they hold multi-turn conversations, they survive network hiccups and rate limits, they track and control spending, and they keep secrets safe. This chapter assembles those practical skills. Think of it as the wiring around the model — unseen when it works, sorely missed when it is absent.

The Message Format, Revisited

Recall from Chapters 4 and 11 that a conversation is a list of message dictionaries, each with a role and content. The roles are typically system (overall instructions), user (what the person says), and assistant (what the model says). This structure is how you communicate with a model in code.

python
messages = [
    {"role": "system", "content": "You are a concise, friendly assistant."},
    {"role": "user", "content": "What is an embedding?"},
]
response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=200, messages=messages
)

The System Message

The system message is a special, dedicated channel for instructions that should govern the entire conversation — the model's role, tone, rules, and persona. It is the proper home for the role-setting you learned in Chapter 27. Whereas a user message is a single turn, the system message frames everything: "You are a careful medical-information assistant who always reminds users to consult a doctor," set once at the top, shapes every reply that follows. Use it for durable behavior, and ordinary user messages for the moment-to-moment requests.

Managing Multi-Turn Conversations

Here is a fact that surprises almost every beginner: the model is stateless. It remembers nothing between calls. Each time you call the API, it sees only what you send in that request. To carry on a conversation, you must keep the history and send the whole thing every time, appending each new exchange.

python
messages = [{"role": "system", "content": "You are a helpful tutor."}]

def ask(user_text):
    messages.append({"role": "user", "content": user_text})
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=300, messages=messages
    )
    reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})   # remember the reply
    return reply

ask("What is a vector?")
ask("How is that used in RAG?")   # the model sees the whole history, so 'that' makes sense

Because you resend the full history, every conversation grows, and that growth is exactly what eventually bumps against the context window from Chapter 12 — which is why long conversations need summarizing or trimming.

Handling Errors Gracefully

Networked services fail sometimes: a request times out, a rate limit is hit, a connection drops (the very errors you met in Chapter 3's troubleshooting). Robust code expects this and retries, ideally with exponential backoff — waiting a little longer before each retry so you do not hammer a struggling service.

python
import time

def call_with_retry(messages, max_retries=4):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6", max_tokens=300, messages=messages
            )
        except Exception as error:
            wait = 2 ** attempt          # 1s, 2s, 4s, 8s -- back off each time
            print(f"Attempt {attempt + 1} failed ({error}); retrying in {wait}s")
            time.sleep(wait)
    raise RuntimeError("All retries failed")

Controlling Cost

Since cloud models bill per token (Chapter 11), keeping an eye on usage is part of responsible engineering. Cap output length with maximum tokens, choose a smaller model when it suffices, avoid resending needless context, and log your token usage so costs never surprise you. Most APIs report the tokens used in every response.

python
response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=300, messages=messages
)
usage = response.usage            # tokens consumed by this call
print("Input tokens:", usage.input_tokens, "Output tokens:", usage.output_tokens)
# Log these to track spending over time.

Streaming in Code

As we saw in Chapter 26, streaming delivers tokens as they are generated, which makes an application feel responsive. In code, instead of waiting for one complete response, you loop over the pieces as they arrive and display each immediately — the typing effect users appreciate. Whenever a person is waiting on output, streaming is worth the small extra effort.

Putting It Together: A Robust Wrapper

The professional habit is to wrap all of this — message handling, retries, and usage logging — behind a single function of your own, so the rest of your program simply calls that function and never worries about the details. This also delivers the provider-independence advice from Chapter 14: if your whole program talks to one wrapper, swapping the underlying model later is a one-place change.

python
def chat(messages, model="claude-sonnet-4-6", max_tokens=300):
    response = call_with_retry(messages)           # handles failures
    log_usage(response.usage)                       # tracks cost
    return response.content[0].text                 # returns just the text

# The rest of your app calls chat() and stays blissfully unaware of the wiring.

Keeping Secrets and Staying Flexible

Two final reminders tie back to earlier chapters. Keep your API key in a .env file, never in your code (Chapter 3), so secrets stay safe. And route every model call through a thin wrapper like the one above (Chapter 14), so you remain free to switch models and providers as the landscape shifts. Together, these habits keep your code secure, adaptable, and easy to maintain.

Summary

Using a model in real code means handling the machinery around the call. Conversations are lists of role-tagged messages, with the system message setting durable behavior for the whole exchange. The model is stateless, so you must keep and resend the full history yourself — which is what eventually fills the context window. Robust code retries failed calls with exponential backoff, controls cost by capping output and logging token usage, and uses streaming so applications feel responsive. The professional pattern is to wrap message handling, retries, and logging behind a single function, keep your key in a .env file, and route calls through a thin wrapper so you stay secure and provider-independent.

This completes Part VI and the entire "using models" arc — running inference, prompting, advanced techniques, tool calling, and the engineering around it. You now have every prerequisite to build agents. Part VII begins the heart of the book: the core of AI agents, starting with the anatomy of an agent in Chapter 31.

Practice

Exercises

  1. 1Write a small program that holds a three-turn conversation with a model, correctly keeping and resending the history so the model can refer back to earlier turns. Confirm it understands a follow-up question that depends on context.
  2. 2Explain what it means that the model is 'stateless,' and why this forces you to manage conversation history yourself. How does this connect to the context window from Chapter 12?
  3. 3Implement the `call_with_retry` wrapper with exponential backoff. Explain why backing off increasingly long between retries is better than retrying instantly.
  4. 4Add usage logging to your model calls so that every call records its input and output token counts. Run a few calls and inspect the totals.
  5. 5Modify a model call to stream its response token by token and display the text as it arrives. Describe how the experience differs from waiting for the whole response.
  6. 6Write a single `chat()` wrapper function that combines message handling, retries, and usage logging, and explain how routing all calls through it makes your code easier to maintain and to switch providers later.
View detailed solutions for all chapters →