Part 7 · The Core of AI Agents

Chapter 37Vector Databases and Semantic Search

⏱ 7 min read·✏️ 6 exercises·The Core of AI Agents

We close the core of agents with the storage layer that quietly powers two of its most important capabilities. Both the RAG of Chapter 36 and the agent memory of Chapter 34 work by storing embeddings and finding the most similar ones — and when you have more than a handful of items, you need a specialized tool to do that quickly: a vector database. This chapter explains what one is, why a plain list does not scale, how vector databases achieve their speed, and how to use one in practice. It is the infrastructure beneath retrieval and memory, and it completes Part VII.

Recap: Why We Need This

Two chapters have now leaned on the same operation. In Chapter 36, RAG embedded document chunks and, for each question, found the most similar chunks. In Chapter 34, agent memory embedded facts and experiences and recalled the most relevant ones. Both did this with a simple list and a loop that compared the query to every stored item. That is fine for a few dozen items — but real systems have thousands, millions, or billions, and the simple approach collapses under that weight. A vector database is the tool built for exactly this.

The Core Operation: Similarity Search

Everything here rests on one operation, which you already understand from Chapters 9 and 36: similarity search. You embed all your items into vectors and store them. Given a query, you embed it too, and find the stored vectors closest to the query vector — closest meaning most similar in meaning, measured by cosine similarity. That "find the nearest vectors" step is the entire job of a vector database. Everything else is making it fast and convenient.

Why a Plain List Doesn't Scale

Recall how Chapter 36 found relevant chunks: it compared the query to every single stored vector, one by one — a linear scan. With a hundred items, that is a hundred comparisons: instant. With ten million items, it is ten million comparisons for every query — far too slow to be usable. The naive approach does not just get slower; it becomes hopeless at scale. We need a way to find the nearest vectors without checking them all.

What a Vector Database Does

A vector database is a specialized database for storing embeddings and finding the most similar ones extremely fast, even among millions or billions of vectors. Where an ordinary database is built to look things up by exact values ("find the customer with this ID"), a vector database is built to look things up by similarity ("find the items closest in meaning to this one"). It is purpose-built for the one operation that retrieval and memory depend on.

How It's Fast: Indexing

How does a vector database avoid comparing the query to everything? Through indexing — organizing the stored vectors in advance so that, at query time, it can skip the vast majority and only examine the promising ones. The intuition is the index at the back of a book: rather than reading every page to find a topic, you jump to the right place. A vector database arranges its vectors so that similar ones are grouped, letting it home in on the relevant region instead of scanning the whole collection.

There is one honest trade-off worth knowing. To gain this enormous speed, vector databases usually use approximate nearest-neighbor search: they find vectors that are almost certainly among the closest, occasionally missing an exact best match, in exchange for being orders of magnitude faster. For retrieval and memory this trade is almost always worth it — a tiny chance of a slightly suboptimal result, for the ability to search millions of items in milliseconds.

Using a Vector Database in Code

Using a vector database is conceptually identical to the list-based code from Chapter 36 — store vectors, query for the nearest — but the database handles the storage and the fast search for you. The shape below mirrors how popular vector databases work.

python

db = VectorDB()     # a vector database (hosted service or local library)

# STORE: add each item's embedding, with the original text alongside it.
for chunk in chunks:
    db.add(vector=embed(chunk), text=chunk)

# QUERY: find the most similar items to a question -- fast, even at huge scale.
def retrieve(question, k=3):
    return db.search(vector=embed(question), top_k=k)   # returns the nearest texts

print(retrieve("What is your return policy?"))

Compare this to Chapter 36's hand-written loop over a list: the logic is the same, but db.search replaces the linear scan and stays fast no matter how many items you store. This single swap is what takes a RAG system or an agent memory from a toy of a few items to a production system over millions.

Metadata and Filtering

Vector databases let you store extra information — metadata — alongside each vector: a source, a date, a category, a user ID. This unlocks filtering: searching for the most similar items that also match some condition. "Find the most relevant chunks, but only from documents this user is allowed to see," or "only from the last month." Metadata and filtering are essential in practice — they let retrieval respect permissions, recency, and scope, which matters as much for agent memory (whose memories?) as for RAG (whose documents?).

Choosing a Vector Database

As with models (Chapter 14), your options span a spectrum. There are hosted vector databases that run as a service you call over the network — easy, scalable, no infrastructure — and local or embedded ones you run yourself, which keep data private and have no per-use fee but require setup. The decision mirrors the hosted-versus-local thinking from earlier: start with whatever is simplest for your scale, and move toward self-hosting when privacy, cost, or control demand it. For a few thousand items, even the plain list from Chapter 36 is fine; reach for a real vector database when your collection grows large.

Vector DBs Power RAG and Memory

Step back and see how the part fits together. The vector database is the storage layer beneath both of the capabilities that make agents knowledgeable and continuous. Under RAG (Chapter 36), it stores and searches your document chunks. Under agent memory (Chapter 34), it stores and searches the agent's own facts and experiences. One piece of infrastructure, the same similarity search, serving two of the most important needs in agent building. Understanding it means understanding what holds the knowledge and the memory of every serious agent.

Summary

A vector database stores embeddings and finds the most similar ones quickly, which is the operation underlying both RAG and agent memory. A plain list works for a handful of items but requires comparing the query to every stored vector — hopelessly slow at scale. Vector databases solve this with indexing, organizing vectors so a query can skip most of them, usually trading exactness for speed via approximate nearest-neighbor search. In code, a vector database simply replaces the hand-written linear scan with a fast search that holds up over millions of items, and it adds metadata and filtering so searches can respect permissions, recency, and scope. Hosted and local options exist along a familiar spectrum. The vector database is the shared infrastructure beneath retrieval and memory — the storage layer at the core of capable agents.

This completes Part VII — the core of agents. You now understand the full anatomy: the loop, ReAct, tools, memory, planning, retrieval, and the storage beneath them. Part VIII puts it all to work, building real-world agents with modern frameworks, beginning with an overview of the frameworks themselves in Chapter 38.

Practice

Exercises

1Explain why the list-and-loop approach to similarity search from Chapter 36 does not scale to millions of items. What exactly becomes the bottleneck?
2Using any vector database (hosted or a local library), store the embeddings of several text snippets and run a similarity query. Confirm it returns the most relevant snippets.
3Compare the results of a keyword search and a semantic (embedding) search for the same query over a small collection — for example, searching 'car' when a document says 'automobile.' Explain the difference.
4Explain, in your own words and with an analogy, what an index does and why it lets a vector database avoid comparing the query to every stored vector.
5Explain what approximate nearest-neighbor search trades away and what it gains. Why is this trade almost always worth it for retrieval and memory?
6Describe a concrete use of metadata filtering in a vector search — for instance, restricting results by user or by date — and explain why it matters for both RAG and agent memory.

View detailed solutions for all chapters →