Contextual Retrieval with VLLM and LlamaIndex

If you've built a RAG system, you've probably noticed something frustrating: sometimes the search returns chunks that seem relevant but aren't quite right. The words match, but the context is off.

This happens because of context loss during chunking. When we split documents into smaller pieces for embedding, each piece loses its relationship to the bigger picture. The Contextual Retriever pattern fixes this by adding context back before we embed anything.

I'll walk through both the intuition and the technical implementation. Feel free to skip to whatever section is most useful to you.

What We'll Cover

01. The Context Loss Problem

// LAYMAN UNDERSTANDING

The Torn Page Problem

Imagine you find a single page on the ground. It says: "The function returns embeddings from the last hidden layer." Is this about Python? JavaScript? Machine learning? Database queries? Without knowing which book this page came from, you're just guessing.

This is exactly what happens in traditional RAG systems. We chop documents into small pieces (chunks) to search through them, but each piece loses the context of where it came from. It's like tearing pages out of books and throwing them in a pile.

Illustration

Finding information: with vs without context

📄Without Context

"The function returns embeddings from the last hidden layer."

❓ Which function?

❓ What library?

❓ What topic?

Like a page torn from a book - useless alone

📚With Context

📖 Book: LlamaIndex Documentation
📑 Chapter: Custom Embedding Models

"The function returns embeddings from the last hidden layer."

✓ LlamaIndex's embed function

✓ About custom models

✓ Technical documentation

Now we know exactly what this is about!

// TECHNICAL DEEP DIVE

Vector Similarity Without Context

When chunks are embedded without context, the vector space becomes ambiguous. A chunk saying "The function returns embeddings" could match queries about any embedding function in any library. The embedding model captures the words, but not the semantic scope.

This leads to low recall and irrelevant results. The right chunk exists in your database, but the query vector doesn't find it because the embedding space is too crowded with similar-but-wrong matches.

Illustration

Traditional chunking loses hierarchy. Contextual chunking preserves it.

Traditional Chunk

"The function returns a list of embeddings for the input tokens. It uses the model's hidden states from the last layer."

Missing: What function? Which model? What context?

Contextual Chunk

[Doc: LlamaIndex Embeddings Guide]
[Section: Custom Embedding Models]

"The function returns a list of embeddings for the input tokens. It uses the model's hidden states from the last layer."

Context preserved: Document + Section + Hierarchy

Vector embedding captures semantic meaning. With context, embeddings become far more precise.

02. The Solution: Contextual Enrichment

// LAYMAN UNDERSTANDING

Adding the Book Title to Every Page

The fix is simple in concept. Before saving each chunk, we add a header that explains where it came from. It's like writing the book title and chapter name at the top of every page. Anyone finding that page now knows exactly what book it's from.

We use an AI (specifically VLLM, which is a fast language model) to read each chunk and write a short description of what it's about. This description gets stored with the chunk, making searches much more accurate.

Illustration

Building context step by step, like adding ingredients to a recipe

📄

Raw ChunkJust the text

📖

Add DocumentWhich document?

📑

Add SectionWhich section?

🔗

Add NeighborsNearby content?

✨

Enriched!Full context

Step 1: Just the text

// TECHNICAL DEEP DIVE

The Enrichment Pipeline

The contextual enrichment pipeline operates in stages. Each stage adds more semantic information to the chunk:

Document context: Title, summary, document type
Section context: Headers, subsection hierarchy
Neighbor context: Surrounding chunks for narrative flow
LLM-generated description: Semantic summary of the chunk's purpose

Illustration

Contextual Enrichment Pipeline: Each stage adds semantic information

Raw Chunk

Document Context

Section Context

Neighbor Context

Enriched Chunk

Isolated text without context

The enriched text is then embedded. The resulting vector captures not just the chunk's content, but its semantic position within the document hierarchy.

03. Generating Context with VLLM

// LAYMAN UNDERSTANDING

AI as Your Librarian

Think of VLLM as a super-fast librarian. For every page in your library, this librarian writes a quick note: "This page is from the Python documentation, specifically about how the embed function works with custom models."

VLLM is designed to do this very quickly. It can process thousands of chunks per minute, making it practical even for large document collections.

Illustration

An AI reads the chunk and explains what it's about

📄 Raw Chunk:

"The function returns embeddings from the hidden layer..."

🤖

VLLM ready

✨ Generated Context:

Waiting...

// TECHNICAL DEEP DIVE

High-Throughput Context Generation

VLLM provides the throughput needed for production-scale context generation. The key configuration involves batch processing, model selection, and token limits:

from llama_index.llms.vllm import Vllm

# Configure VLLM endpoint for high throughput
llm = Vllm(
    api_url="<YOUR_API_URI>",
    model="meta-llama/Llama-2-7b-chat-hf",
    max_tokens=150  # Keep context descriptions concise
)

# The prompt template is critical
CONTEXT_PROMPT = f"""Given this document context:
Title: {doc.title}
Section: {section.header}

Describe what this chunk is about in one sentence:
{chunk.text}"""

context = llm.complete(CONTEXT_PROMPT).text

Illustration

VLLM Context Generation Pipeline: Batch processing for throughput

Step 1

Chunk

Raw text chunk

Step 2

Context Prompt

Document context + chunk

Step 3

VLLM Generate

LLM produces description

Step 4

Enriched Vector

Context + Original embedded

04. Vector Storage with Milvus

// LAYMAN UNDERSTANDING

A Smart Filing Cabinet

Milvus is like a magical filing cabinet. Instead of organizing files alphabetically, it organizes them by meaning. When you search for "how to embed documents," it instantly finds all the pages that talk about embedding, even if they use different words.

With contextual enrichment, each file now has a label saying what book it came from. The filing cabinet gives you much better results because it knows the full context of each page.

Illustration

Searching for "how to embed documents" - which results are better?

🔍 Query: "how to embed documents"

Traditional Search Results

"The embed() function takes input..."

"Documents can be processed..."

"Embedding vectors represent..."

Generic results - may or may not be relevant

Contextual Search Results

[LlamaIndex Docs → Embeddings]
"The embed() function takes input..."

[OpenAI API → Embeddings Guide]
"Documents can be processed..."

[Vector DB Tutorial → Getting Started]
"Embedding vectors represent..."

Context tells us exactly where each result comes from

// TECHNICAL DEEP DIVE

Vector Similarity Search

Milvus handles vector similarity search with support for hybrid queries that combine dense vectors and sparse keyword matching. The enriched context improves both:

from pymilvus import connections, Collection

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

# Create collection with contextual field
collection = Collection(
    "contextual_chunks",
    schema=CollectionSchema([
        FieldSchema("id", DataType.INT64, is_primary=True),
        FieldSchema("embedding", DataType.FLOAT_VECTOR, dim=768),
        FieldSchema("context", DataType.VARCHAR, max_length=500),
        FieldSchema("chunk_text", DataType.VARCHAR, max_length=2000),
        FieldSchema("doc_title", DataType.VARCHAR, max_length=200),
    ])
)

# Search with context-aware similarity
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 10}},
    limit=5,
    output_fields=["context", "chunk_text", "doc_title"]
)

Illustration

Milvus Vector Search: Query vector finds nearest neighbors in embedding space

Query Vector

[0.23, ...]

→

Milvus Collection

→

Top-K

...

COSINE similarity finds semantically similar chunks. Contextual embeddings make similarity more meaningful.

Complete Architecture

Here's how all the pieces fit together. Document processing feeds into VLLM for context generation, which flows into Milvus for vector storage. LlamaIndex orchestrates the queries:

Illustration

Complete Contextual Retriever Architecture

+----------------+     +----------------+     +----------------+
|   Documents    | --> |   Chunking     | --> |   VLLM         |
|   (PDF, MD)    |     |   (Semantic)   |     |   Context Gen  |
+----------------+     +----------------+     +----------------+
                                                     |
                                                     v
+----------------+     +----------------+     +----------------+
|   Query        | --> |   Similarity   | <-- |   Milvus       |
|   (User Input) |     |   Search       |     |   Vector Store |
+----------------+     +----------------+     +----------------+
                              |
                              v
                       +----------------+
                       |   LlamaIndex   |
                       |   Response     |
                       +----------------+

Wrapping Up

If you take away just a few things from this post, let it be these:

Context is king. Raw chunks lose the document hierarchy that makes them meaningful. Adding context back dramatically improves search relevance.
LLM-generated descriptions work. VLLM produces semantic descriptions at scale, transforming ambiguous chunks into well-labeled content.
Milvus handles production scale. Vector databases like Milvus handle millions of vectors with sub-second latency, essential for real-time RAG.
The enrichment pipeline matters. Document context, section headers, and neighbor chunks all contribute to embedding quality.

The full working notebook is available on GitHub:Contextual-Retriever-working-notebook

This pattern has become essential in my RAG implementations. It consistently improves retrieval quality across different domains and document types. Give it a try and see the difference for yourself.

01. The Context Loss Problem

The Torn Page Problem

Vector Similarity Without Context

02. The Solution: Contextual Enrichment

Adding the Book Title to Every Page

The Enrichment Pipeline

03. Generating Context with VLLM

AI as Your Librarian

High-Throughput Context Generation

04. Vector Storage with Milvus

A Smart Filing Cabinet

Vector Similarity Search

Complete Architecture

Wrapping Up

Signup for Updates: