Deep Dive into RAG: Hybrid Search and Re-ranking

AI RAG Search Python Algorithm Engineering

Updated on Feb 02, 2025

In our previous article, we explored the macro-architecture of production-grade RAG. Today, we’re diving into the deep end, focusing on the most critical and challenging component of any RAG system: Retrieval.

Many developers face this common frustration when deploying RAG: “I’ve chunked my documents perfectly and stored them in ChromaDB, but when a user asks for ‘Product Model X-200,’ the embedding search returns an irrelevant paragraph about ‘ISO 9001 certification’ instead of the actual spec sheet.”

The root cause is simple: Embedding models (Bi-Encoder models) excel at capturing semantic similarity but often struggle with “Exact Matching” (e.g., part numbers, names, or code IDs).

To solve this, we need to deploy two heavy-duty techniques: Hybrid Search and Re-ranking. This article cuts through the hype to provide the code and algorithmic principles you need to build a high-precision retrieval pipeline.

1. Hybrid Search: The Power of Combination

The core philosophy of Hybrid Search is “don’t put all your eggs in one basket.” It combines two fundamentally different retrieval technologies:

Dense Retrieval: Calculating cosine similarity of embedding vectors. Excels at understanding “intent.”
Sparse Retrieval: Inverted indices based on the BM25 algorithm. Excels at “exact matching.”

Hybrid Search Concept

Why combine them?

Consider a user query: “How to fix Error 502 Bad Gateway?”

Vector Search Perspective: It sees “fix,” “error,” and “gateway.” It might return a general article on “server maintenance,” which is semantically related but might not mention the specific 502 code at all. This is “relevant but useless.”

BM25 Perspective: It’s a literal matching machine. it fixates on the “502” token. It is highly likely to locate the specific troubleshooting manual that contains the “502” string.

Hybrid Search Perspective: It understands that you want to fix a fault (Vector’s job) and locks onto the specific “502” feature (BM25’s job). Only by combining both can you guarantee high Recall.

Practical Implementation: RRF Fusion with LangChain

How do you merge results from two different scales (vectors are usually 0.7-0.9 cosine similarity, while BM25 scores are absolute values like 10-20)? The most robust algorithm is RRF (Reciprocal Rank Fusion). It ignores the raw scores and focuses entirely on the ranking.

The formula: $$ RRFscore(d) = \sum_{r \in R} \frac{1}{k + r(d)} $$ Where $k$ is a constant (usually 60) and $r(d)$ is the document’s rank in that specific retrieval path.

Here is the Python implementation logic:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# 1. Initialize the two Retrievers
# Assume 'docs' is your list of split Documents
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 50  # Recall top 50 via sparse search

embedding = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 50}) # Recall top 50 via vector search

# 2. Initialize the Hybrid Retriever (EnsembleRetriever uses RRF by default)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]  # Weights can be tuned; 50/50 is a good start
)

# 3. Execute Retrieval
docs = ensemble_retriever.invoke("How to fix Error 502?")
# Returns a deduplicated and re-ranked list of top results

In production, if you use platforms like ElasticSearch or Milvus, they have Hybrid Search built-in, following these same principles.

2. Re-ranking: Surgical Precision

After Hybrid Search, we usually have a candidate pool of Top 50 or even Top 100 documents. Crucial note: Never throw all 50 documents directly into the LLM.

Here’s why:

Noise Interference: Out of 50, only 3 might be truly relevant. The other 47 are distractions that significantly degrade the LLM’s reasoning (the Distraction Issue).
Sky-high Costs: Tokens are expensive. Feeding 10k extra tokens per request will blow up your API bill.
Increased Latency: Longer contexts mean slower Time-To-First-Token (TTFT).

This is where the Re-ranker (Cross-Encoder model) comes in.

Re-ranking Workflow

Bi-Encoder vs. Cross-Encoder

Vector search uses a Bi-Encoder: Queries and Documents are encoded independently into vectors. This is blazingly fast (milliseconds) and perfect for searching through millions of records (Coarse-stage retrieval).
Re-ranking uses a Cross-Encoder: It joins the Query and the Document together (e.g., [CLS] Query [SEP] Document [SEP]) and feeds them into a BERT-like model for deep Full-Attention interaction. It can judge with extreme precision whether “this specific sentence answers this specific question.” It’s slower (hundreds of milliseconds) but vastly more accurate (Fine-stage re-ranking).

Model Selection and Implementation

Some of the best performing Re-rankers today include:

BGE-Reranker-v2-m3: The powerhouse of the open-source community with excellent multilingual support.
Cohere Rerank API: A SOTA commercial solution. No need to manage GPU resources—perfect for startups.
Jina Reranker: Another strong contender with support for extremely long contexts.

Using HuggingFace + BGE-Reranker:

from sentence_transformers import CrossEncoder

# Load model (GPU deployment recommended)
model = CrossEncoder('BAAI/bge-reranker-v2-m3', max_length=512)

def rerank_documents(query, retrieved_docs, top_k=5):
    # Construct input pairs: [[query, doc1], [query, doc2], ...]
    pairs = [[query, doc.page_content] for doc in retrieved_docs]
    
    # Predict scores
    scores = model.predict(pairs)
    
    # Pair docs with scores
    doc_score_pairs = list(zip(retrieved_docs, scores))
    
    # Sort docs descending by score
    doc_score_pairs.sort(key=lambda x: x[1], reverse=True)
    
    # Set an empirical threshold for relevance (optional)
    filtered_results = [
        doc for doc, score in doc_score_pairs 
        if score > 0.5 
    ]
    
    return filtered_results[:top_k]

# Chain this after the ensemble_retriever
final_docs = rerank_documents("How to fix Error 502?", docs)

3. The Complete Production Pipeline

A mature RAG retrieval pipeline should look like this:

Query Rewrite: If a user asks “How much is it?”, use conversation history to rewrite the query to “What is the price of the iPhone 15 Pro Max?”
Hybrid Retrieval (Coarse-stage):
- Parallel Vector Search (Top 50).
- Parallel BM25 Search (Top 50).
- RRF Fusion to get a unique Top 60.
Re-ranking (Fine-stage):
- Use a Cross-Encoder to score the Top 60.
- If the top-scoring doc is still below a threshold (e.g., < -5), trigger a “Refusal” mechanism—tell the user “I couldn’t find relevant information in the knowledge base” rather than hallucinating.
Context Construction: Slice the Top 5 and feed them into the Prompt.

4. Key Takeaways and Optimization Tips

Is Re-ranking too slow? Cross-Encoders are heavy. If you’re latency-sensitive, consider ColBERT architectures (like Jina-ColBERT), which preserve the precision of Cross-Encoders with speeds closer to Bi-Encoders.
Multilingual Support: If your docs are in English but your users ask in Spanish, BM25 will fail. You’ll need a “Query Translation” step before retrieval.
Don’t Hardcode Thresholds: Re-ranker logits might not be normalized. Run a test dataset, observe the score distributions between positive and negative samples, and set a threshold based on statistical evidence.

Summary

If your RAG system is consistently providing off-target answers, don’t just throw a larger LLM at it. According to the law of diminishing returns, you should check your retrieval pipeline first.

Retrieval is like prepping ingredients for a chef. If you can use Hybrid Search and Re-ranking to throw out the “rotten leaves” (irrelevant docs) and provide only the “premium wagyu” (precise context), even a standard chef (a smaller LLM) can cook up a masterpiece.