Contextual Retrieval with VLLM and LlamaIndex
If you've built a RAG system, you've probably noticed something frustrating: sometimes the search returns chunks that seem relevant but aren't quite right. The words match, but the context is off.
This happens because of context loss during chunking. When we split documents into smaller pieces for embedding, each piece loses its relationship to the bigger picture. The Contextual Retriever pattern fixes this by adding context back before we embed anything.
I'll walk through both the intuition and the technical implementation. Feel free to skip to whatever section is most useful to you.
01. The Context Loss Problem
The Torn Page Problem
Imagine you find a single page on the ground. It says: "The function returns embeddings from the last hidden layer." Is this about Python? JavaScript? Machine learning? Database queries? Without knowing which book this page came from, you're just guessing.
This is exactly what happens in traditional RAG systems. We chop documents into small pieces (chunks) to search through them, but each piece loses the context of where it came from. It's like tearing pages out of books and throwing them in a pile.
Finding information: with vs without context
"The function returns embeddings from the last hidden layer."
Like a page torn from a book - useless alone
📖 Book: LlamaIndex Documentation
📑 Chapter: Custom Embedding Models
"The function returns embeddings from the last hidden layer."
Now we know exactly what this is about!
Vector Similarity Without Context
When chunks are embedded without context, the vector space becomes ambiguous. A chunk saying "The function returns embeddings" could match queries about any embedding function in any library. The embedding model captures the words, but not the semantic scope.
This leads to low recall and irrelevant results. The right chunk exists in your database, but the query vector doesn't find it because the embedding space is too crowded with similar-but-wrong matches.
Traditional chunking loses hierarchy. Contextual chunking preserves it.
"The function returns a list of embeddings for the input tokens. It uses the model's hidden states from the last layer."
[Section: Custom Embedding Models]
"The function returns a list of embeddings for the input tokens. It uses the model's hidden states from the last layer."
Vector embedding captures semantic meaning. With context, embeddings become far more precise.
02. The Solution: Contextual Enrichment
Adding the Book Title to Every Page
The fix is simple in concept. Before saving each chunk, we add a header that explains where it came from. It's like writing the book title and chapter name at the top of every page. Anyone finding that page now knows exactly what book it's from.
We use an AI (specifically VLLM, which is a fast language model) to read each chunk and write a short description of what it's about. This description gets stored with the chunk, making searches much more accurate.
Building context step by step, like adding ingredients to a recipe
The Enrichment Pipeline
The contextual enrichment pipeline operates in stages. Each stage adds more semantic information to the chunk:
- Document context: Title, summary, document type
- Section context: Headers, subsection hierarchy
- Neighbor context: Surrounding chunks for narrative flow
- LLM-generated description: Semantic summary of the chunk's purpose
Contextual Enrichment Pipeline: Each stage adds semantic information
The enriched text is then embedded. The resulting vector captures not just the chunk's content, but its semantic position within the document hierarchy.
03. Generating Context with VLLM
AI as Your Librarian
Think of VLLM as a super-fast librarian. For every page in your library, this librarian writes a quick note: "This page is from the Python documentation, specifically about how the embed function works with custom models."
VLLM is designed to do this very quickly. It can process thousands of chunks per minute, making it practical even for large document collections.
An AI reads the chunk and explains what it's about
"The function returns embeddings from the hidden layer..."
VLLM ready
Waiting...
High-Throughput Context Generation
VLLM provides the throughput needed for production-scale context generation. The key configuration involves batch processing, model selection, and token limits:
from llama_index.llms.vllm import Vllm
# Configure VLLM endpoint for high throughput
llm = Vllm(
api_url="<YOUR_API_URI>",
model="meta-llama/Llama-2-7b-chat-hf",
max_tokens=150 # Keep context descriptions concise
)
# The prompt template is critical
CONTEXT_PROMPT = f"""Given this document context:
Title: {doc.title}
Section: {section.header}
Describe what this chunk is about in one sentence:
{chunk.text}"""
context = llm.complete(CONTEXT_PROMPT).textVLLM Context Generation Pipeline: Batch processing for throughput
04. Vector Storage with Milvus
A Smart Filing Cabinet
Milvus is like a magical filing cabinet. Instead of organizing files alphabetically, it organizes them by meaning. When you search for "how to embed documents," it instantly finds all the pages that talk about embedding, even if they use different words.
With contextual enrichment, each file now has a label saying what book it came from. The filing cabinet gives you much better results because it knows the full context of each page.
Searching for "how to embed documents" - which results are better?
Generic results - may or may not be relevant
"The embed() function takes input..."
"Documents can be processed..."
"Embedding vectors represent..."
Context tells us exactly where each result comes from
Vector Similarity Search
Milvus handles vector similarity search with support for hybrid queries that combine dense vectors and sparse keyword matching. The enriched context improves both:
from pymilvus import connections, Collection
# Connect to Milvus
connections.connect("default", host="localhost", port="19530")
# Create collection with contextual field
collection = Collection(
"contextual_chunks",
schema=CollectionSchema([
FieldSchema("id", DataType.INT64, is_primary=True),
FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=768),
FieldSchema("context", DataType.VARCHAR, max_length=500),
FieldSchema("chunk_text", DataType.VARCHAR, max_length=2000),
FieldSchema("doc_title", DataType.VARCHAR, max_length=200),
])
)
# Search with context-aware similarity
results = collection.search(
data=[query_embedding],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 10}},
limit=5,
output_fields=["context", "chunk_text", "doc_title"]
)Milvus Vector Search: Query vector finds nearest neighbors in embedding space
COSINE similarity finds semantically similar chunks. Contextual embeddings make similarity more meaningful.
Complete Architecture
Here's how all the pieces fit together. Document processing feeds into VLLM for context generation, which flows into Milvus for vector storage. LlamaIndex orchestrates the queries:
Complete Contextual Retriever Architecture
+----------------+ +----------------+ +----------------+
| Documents | --> | Chunking | --> | VLLM |
| (PDF, MD) | | (Semantic) | | Context Gen |
+----------------+ +----------------+ +----------------+
|
v
+----------------+ +----------------+ +----------------+
| Query | --> | Similarity | <-- | Milvus |
| (User Input) | | Search | | Vector Store |
+----------------+ +----------------+ +----------------+
|
v
+----------------+
| LlamaIndex |
| Response |
+----------------+Wrapping Up
If you take away just a few things from this post, let it be these:
- Context is king. Raw chunks lose the document hierarchy that makes them meaningful. Adding context back dramatically improves search relevance.
- LLM-generated descriptions work. VLLM produces semantic descriptions at scale, transforming ambiguous chunks into well-labeled content.
- Milvus handles production scale. Vector databases like Milvus handle millions of vectors with sub-second latency, essential for real-time RAG.
- The enrichment pipeline matters. Document context, section headers, and neighbor chunks all contribute to embedding quality.
The full working notebook is available on GitHub:Contextual-Retriever-working-notebook
This pattern has become essential in my RAG implementations. It consistently improves retrieval quality across different domains and document types. Give it a try and see the difference for yourself.
Signup for Updates:
I promise to only email you cool shit. Draft chapters, progress updates, sneak peaks at illustrations I'm working on. Stuff like that.