Embedding Models and Reranking for RAG

Selecting, fine-tuning, and combining embedding models with cross-encoder rerankers for production retrieval pipelines

Published

April 9, 2025

Keywords: RAG, embedding models, reranking, cross-encoder, bi-encoder, ColBERT, MTEB, Sentence Transformers, fine-tuning, Matryoshka, quantization, LlamaIndex, LangChain, retrieval, cosine similarity, dense retrieval, hybrid search

Introduction

The retrieval stage is the make-or-break component of any RAG pipeline. If the right chunks never get retrieved, the LLM has no chance of producing a correct answer — no matter how capable the model is.

At the heart of retrieval are two decisions: which embedding model converts your text to vectors, and whether a reranker re-scores the top results before they reach the LLM. These choices directly control recall (did you find the right documents?) and precision (did you avoid the wrong ones?).

This article covers everything you need to make these decisions for production RAG: how bi-encoders and cross-encoders work, how to choose from dozens of competing models, when and how to fine-tune, how to reduce costs with quantization, and how to wire a full retrieve-and-rerank pipeline in LlamaIndex and LangChain.

How Embedding Models Work

Bi-Encoders: The Workhorse of Dense Retrieval

A bi-encoder (also called a Sentence Transformer) encodes query and document independently into fixed-size vectors. At retrieval time, you compare vectors with cosine similarity or dot product.

graph LR
    A["Query"] --> B["Encoder"]
    C["Document"] --> D["Same Encoder"]
    B --> E["Query Vector<br/>[768 dims]"]
    D --> F["Doc Vector<br/>[768 dims]"]
    E --> G["Cosine<br/>Similarity"]
    F --> G
    G --> H["Score: 0.87"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#4a90d9,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#1abc9c,color:#fff,stroke:#333

Key property: Document vectors can be pre-computed and cached. At query time, only the query needs to be embedded — then you search millions of pre-computed vectors in milliseconds using approximate nearest neighbor (ANN) indices.

\text{similarity}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}

Cross-Encoders: The Precision Instrument

A cross-encoder takes a query-document pair as a single input and outputs a relevance score directly. The query and document tokens attend to each other through every transformer layer — enabling much richer interaction.

graph LR
    A["[CLS] Query [SEP] Document [SEP]"] --> B["Full Transformer<br/>(joint attention)"]
    B --> C["Relevance Score:<br/>8.61"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

Key property: Cross-encoders are far more accurate than bi-encoders because they model fine-grained query-document interactions. But they cannot pre-compute document representations — every query requires a forward pass through the transformer for every candidate document.

Bi-Encoder vs Cross-Encoder

Aspect Bi-Encoder Cross-Encoder
Architecture Encode query & doc separately Encode query+doc jointly
Pre-compute docs Yes (offline indexing) No (per-query inference)
Speed Milliseconds over millions Seconds over hundreds
Accuracy Good Significantly better
Use case First-stage retrieval Second-stage reranking
Scalability O(1) per query (ANN) O(k) per query (top-k)

The standard production pattern: bi-encoder retrieves top-k candidates, cross-encoder reranks them.

ColBERT: Late Interaction (Best of Both Worlds)

ColBERT (Khattab & Zaharia, 2020) introduces a middle ground: late interaction. It encodes query and document independently (like a bi-encoder) but retains per-token embeddings instead of pooling into a single vector. At scoring time, it computes fine-grained token-level similarity:

\text{score}(q, d) = \sum_{i \in |q|} \max_{j \in |d|} \mathbf{q}_i \cdot \mathbf{d}_j

Each query token finds its best-matching document token, and scores are summed. This captures detailed interactions without the cost of full cross-attention.

Property Bi-Encoder ColBERT Cross-Encoder
Pre-compute docs Yes Yes (per-token) No
Interaction depth None (pooled) Token-level max-sim Full attention
Storage 1 vector/doc N vectors/doc N/A
Accuracy Good Very good Best
Latency Fastest Fast Slowest

Choosing an Embedding Model

The MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 56+ tasks spanning retrieval, classification, clustering, reranking, and more. It is the standard benchmark for comparing embedding models.

Key metrics to focus on for RAG:

  • Retrieval (nDCG@10) — the most relevant metric for RAG
  • Reranking — if you plan to use the model as a reranker
  • Classification — relevant for intent detection or routing

Current Model Landscape (2025)

Model Provider Dims Context Open Source MTEB Retrieval Notes
NV-Embed-v2 NVIDIA 4096 32K Yes 72.31 #1 MTEB overall (ICLR 2025)
text-embedding-3-large OpenAI 3072 8191 No ~62 Matryoshka support
text-embedding-3-small OpenAI 1536 8191 No ~55 Cost-effective
voyage-3-large Voyage AI 1024 32K No ~67 Code & multilingual
Cohere embed-v4 Cohere 1024 varies No ~65 Built-in binary quantization
jina-embeddings-v3 Jina AI 1024 8192 Yes ~66 Task-specific LoRA adapters
mxbai-embed-large-v1 Mixedbread 1024 512 Yes 54.39 Strong open-source
BGE-large-en-v1.5 BAAI 1024 512 Yes ~54 Widely used baseline
nomic-embed-text-v1.5 Nomic 768 8192 Yes ~53 MRL, runs on Ollama
GTE-large-en-v1.5 Alibaba 1024 8192 Yes ~57 Long context
all-MiniLM-L6-v2 SBERT 384 256 Yes ~42 Tiny, fast, good baseline

Selection Decision Matrix

graph TD
    A["Start"] --> B{"Data leaves<br/>your infra?"}
    B -->|"No (privacy)"| C{"GPU<br/>available?"}
    B -->|"Yes (API OK)"| D{"Budget<br/>priority?"}
    C -->|Yes| E["NV-Embed-v2<br/>or GTE-large"]
    C -->|No| F["nomic-embed-text<br/>(Ollama)"]
    D -->|Low cost| G["text-embedding-3-small"]
    D -->|Best quality| H["voyage-3-large<br/>or Cohere embed-v4"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333

Embedding Best Practices

  1. Same model for indexing and queries — never mix embedding models; vectors from different models are incompatible
  2. Use task-specific prefixes when the model supports them:
# Many models expect a prefix for queries vs. documents
query = "query: What is attention in transformers?"
document = "passage: The attention mechanism allows the model to..."
  1. Normalize vectors for cosine similarity — most models output unit vectors, but verify
  2. Batch your embedding calls — embedding one-by-one is orders of magnitude slower
  3. Match context length to your chunks — a model with 512-token context cannot embed 1000-token chunks

Embedding Models in Practice

LlamaIndex

# OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Local via HuggingFace
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
    trust_remote_code=True,
)

# Local via Ollama (fully offline)
from llama_index.embeddings.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Use in pipeline
from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(documents)

LangChain

# OpenAI
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Local via HuggingFace
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# Local via Ollama
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Use in pipeline
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embeddings)

Reranking: The Highest-ROI Upgrade

Adding a reranker is consistently the single biggest quality improvement you can make to an existing RAG pipeline — often boosting relevant retrieval by 5–15% with minimal code change.

How Retrieve-and-Rerank Works

graph LR
    A["Query"] --> B["Bi-Encoder<br/>Retrieval"]
    B --> C["Top-20<br/>Candidates"]
    C --> D["Cross-Encoder<br/>Reranker"]
    D --> E["Reranked<br/>Top-5"]
    E --> F["LLM"]
    F --> G["Answer"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#C8CFEA,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

The pipeline:

  1. Retrieve top-N candidates with a fast bi-encoder (N = 20–50)
  2. Rerank all N candidates with a cross-encoder
  3. Return top-k reranked results to the LLM (k = 3–5)

The retrieve-wider, rerank-narrower pattern ensures you don’t miss relevant documents (high recall from step 1) while surfacing the most relevant ones (high precision from step 2).

Cross-Encoder Reranker Models

Model Provider Accuracy Speed Notes
rerank-v3.5 Cohere Best (API) Fast Production API, multilingual
jina-reranker-v2 Jina AI Very good Fast Open weights, multilingual
bge-reranker-v2-m3 BAAI Very good Medium Open source, multilingual
ms-marco-MiniLM-L6-v2 SBERT Good Fastest Tiny, great for prototyping
ms-marco-MiniLM-L12-v2 SBERT Better Fast Good quality/speed balance
NV-RerankQA-Mistral-4B NVIDIA Excellent Slower LLM-based reranker

Reranking with LlamaIndex

from llama_index.core import VectorStoreIndex, Settings
from llama_index.postprocessor.cohere_rerank import CohereRerank

# Retrieve more candidates, rerank to top-5
reranker = CohereRerank(
    api_key="YOUR_COHERE_KEY",
    top_n=5,
    model="rerank-v3.5",
)

query_engine = index.as_query_engine(
    similarity_top_k=20,          # Retrieve 20
    node_postprocessors=[reranker],  # Rerank to 5
)

response = query_engine.query("What are the benefits of RLHF?")

With a local cross-encoder:

from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker],
)

Reranking with LangChain

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Load cross-encoder
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

# Wrap the base retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

reranking_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

results = reranking_retriever.invoke("What are the benefits of RLHF?")

With Cohere reranker:

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

reranker = CohereRerank(
    model="rerank-v3.5",
    top_n=5,
)

reranking_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
)

Reranking with Sentence Transformers Directly

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

query = "What is the attention mechanism?"
passages = [
    "The attention mechanism allows the model to focus on relevant parts...",
    "Berlin is the capital of Germany...",
    "Transformers use self-attention to capture long-range dependencies...",
]

# Score each query-passage pair
scores = model.predict([(query, p) for p in passages])
# => array([ 8.92, -4.32,  7.61], dtype=float32)

# Or use the built-in rank method
ranks = model.rank(query, passages)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")

Hybrid Search: Dense + Sparse

Dense retrieval (embeddings) excels at semantic matching but can miss exact keyword matches — acronyms, product names, error codes. Sparse retrieval (BM25) handles these well. Combining both via Reciprocal Rank Fusion (RRF) gives the best of both worlds.

graph TB
    A["Query"] --> B["Dense Retrieval<br/>(Bi-Encoder)"]
    A --> C["Sparse Retrieval<br/>(BM25)"]
    B --> D["Dense Top-20"]
    C --> E["Sparse Top-20"]
    D --> F["Reciprocal Rank<br/>Fusion"]
    E --> F
    F --> G["Fused Top-20"]
    G --> H["Cross-Encoder<br/>Reranker"]
    H --> I["Final Top-5"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style F fill:#e74c3c,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333
    style I fill:#1abc9c,color:#fff,stroke:#333

LangChain: Hybrid + Reranking Pipeline

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Stage 1: Hybrid retrieval
bm25_retriever = BM25Retriever.from_documents(chunks, k=20)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7],  # Dense-weighted
)

# Stage 2: Reranking
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever,
)

results = final_retriever.invoke("What is RLHF?")

LlamaIndex: Hybrid + Reranking Pipeline

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

# Stage 1: Hybrid retrieval
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes, similarity_top_k=20
)
vector_retriever = index.as_retriever(similarity_top_k=20)

hybrid_retriever = QueryFusionRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    num_queries=1,
    use_async=False,
    similarity_top_k=20,
)

# Stage 2: Reranking
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

# Use in query engine
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
)

response = query_engine.query("What is RLHF?")

Fine-Tuning Embedding Models

Off-the-shelf embedding models work well for general domains. But for specialized corpora — medical records, legal documents, internal codebases — fine-tuning can improve retrieval by 5–20%.

When to Fine-Tune

Scenario Fine-Tune? Why
General Q&A over docs No Pre-trained models are good enough
Domain-specific jargon Yes Models may not understand domain terms
Internal company data Maybe Start without, fine-tune if recall is poor
Non-English language Maybe Check MTEB for your language first
Code/SQL retrieval Yes Code semantics differ from natural language

Fine-Tuning with Sentence Transformers

The standard approach: contrastive learning with hard negatives — training on (query, positive, negative) triplets.

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from datasets import Dataset

# 1. Prepare training data
# Each example: {"query": ..., "positive": ..., "negative": ...}
train_data = Dataset.from_dict({
    "anchor": ["What causes diabetes?", ...],
    "positive": ["Diabetes is caused by insulin resistance...", ...],
    "negative": ["Berlin is the capital of Germany...", ...],
})

# 2. Load base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# 3. Define loss
loss = losses.MultipleNegativesRankingLoss(model)

# 4. Training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="./finetuned-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
)

# 5. Train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_data,
    loss=loss,
)

trainer.train()
model.save_pretrained("./finetuned-embeddings")

Generating Training Data from Your RAG Pipeline

You don’t always need labeled datasets. Use your existing pipeline to mine hard negatives — documents the retriever returned but that weren’t relevant:

def mine_hard_negatives(
    queries: list[str],
    relevant_docs: list[str],
    retriever,
    top_k: int = 20,
) -> list[dict]:
    """Generate training triplets by mining hard negatives from retrieval."""
    triplets = []
    for query, positive in zip(queries, relevant_docs):
        results = retriever.invoke(query)
        for doc in results[:top_k]:
            if doc.page_content != positive:
                triplets.append({
                    "anchor": query,
                    "positive": positive,
                    "negative": doc.page_content,
                })
                break  # One hard negative per query
    return triplets

Fine-Tuning Cross-Encoder Rerankers

Cross-encoders can also be fine-tuned for your domain:

from sentence_transformers import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import (
    CERerankingEvaluator,
)
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Load base model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# Training data: (query, passage, label) where label is 0 or 1
# Use the same hard negative mining approach

args = SentenceTransformerTrainingArguments(
    output_dir="./finetuned-reranker",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    fp16=True,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
)

trainer.train()

Scaling Embeddings for Production

Matryoshka Representation Learning (MRL)

MRL trains embedding models so that the first N dimensions already form a good embedding. You can truncate from 1024 dims to 256 dims with minimal quality loss — reducing storage and compute by 4x.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

# Full 768-dim embedding
full_emb = model.encode("What is attention?")  # shape: (768,)

# Truncated to 256 dims — still high quality with MRL
truncated_emb = full_emb[:256]  # shape: (256,)

OpenAI’s text-embedding-3-large supports this natively via the dimensions parameter:

from openai import OpenAI

client = OpenAI()

# Full 3072 dims
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="What is attention?",
    dimensions=3072,
)

# Truncated to 256 dims (93.1% performance retention at 12x compression)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="What is attention?",
    dimensions=256,
)

Embedding Quantization

Reduce embedding storage and speed up search by quantizing from float32 to smaller types:

Precision Size per dim Memory (250M vectors, 1024d) Speed Quality Retained
float32 4 bytes 953 GB 1x 100%
int8 (scalar) 1 byte 238 GB ~4x ~99.3%
binary (1-bit) 0.125 bytes 30 GB ~25x ~96%

The results are remarkable: binary quantization retains 96% of performance at 32x memory reduction and 25x speed improvement.

Binary Quantization with Sentence Transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# Full float32 embeddings
embeddings = model.encode(["Document text here..."])
# shape: (1, 1024), dtype: float32, size: 4096 bytes

# Binary quantization — 32x smaller
binary_embeddings = quantize_embeddings(embeddings, precision="binary")
# shape: (1, 128), dtype: int8, size: 128 bytes

# Scalar int8 quantization — 4x smaller
int8_embeddings = quantize_embeddings(
    embeddings,
    precision="int8",
    calibration_embeddings=calibration_set,  # Recommended
)
# shape: (1, 1024), dtype: int8, size: 1024 bytes

Three-Stage Retrieval Pipeline

For maximum scalability, combine binary search, scalar rescoring, and cross-encoder reranking:

graph LR
    A["Query"] --> B["Binary Search<br/>(Hamming dist)<br/>→ Top-200"]
    B --> C["Scalar Rescore<br/>(int8 × float32)<br/>→ Top-20"]
    C --> D["Cross-Encoder<br/>Rerank<br/>→ Top-5"]
    D --> E["LLM"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

This pipeline uses:

  • 5 GB memory for the binary index (vs. 200 GB for float32)
  • 50 GB disk for the int8 index
  • Near-lossless quality with massive infrastructure savings

Memory & Cost at Scale

Vectors float32 (1024d) int8 binary binary + int8 rescore
1M 3.8 GB 953 MB 119 MB 119 MB mem + 953 MB disk
10M 38 GB 9.5 GB 1.2 GB 1.2 GB mem + 9.5 GB disk
100M 381 GB 95 GB 12 GB 12 GB mem + 95 GB disk
1B 3.8 TB 953 GB 119 GB 119 GB mem + 953 GB disk

Evaluating Retrieval Quality

Key Metrics

Metric What it Measures Formula
Recall@k Were relevant docs in top-k? \frac{\text{relevant retrieved}}{\text{total relevant}}
nDCG@k Were relevant docs ranked high? Takes rank position into account
MRR Where is the first relevant doc? \frac{1}{\text{rank of first relevant}}
Precision@k What fraction of top-k is relevant? \frac{\text{relevant in top-k}}{k}
Hit Rate Is at least one relevant doc in top-k? Binary: 0 or 1

Quick Evaluation with LlamaIndex

from llama_index.core.evaluation import RetrieverEvaluator

# Prepare evaluation dataset: {query: [relevant_doc_ids]}
qa_pairs = [
    {"query": "What is attention?", "expected_ids": ["doc_42", "doc_15"]},
    # ...
]

evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=index.as_retriever(similarity_top_k=5)
)

results = []
for pair in qa_pairs:
    result = await evaluator.aevaluate(
        query=pair["query"],
        expected_ids=pair["expected_ids"],
    )
    results.append(result)

# Aggregate
avg_mrr = sum(r.metric_vals["mrr"] for r in results) / len(results)
avg_hit = sum(r.metric_vals["hit_rate"] for r in results) / len(results)
print(f"MRR: {avg_mrr:.3f}, Hit Rate: {avg_hit:.3f}")

RAGAS for End-to-End RAG Evaluation

from ragas import evaluate
from ragas.metrics import context_recall, context_precision, faithfulness

result = evaluate(
    dataset=eval_dataset,  # Contains queries, ground truth, contexts, answers
    metrics=[context_recall, context_precision, faithfulness],
)

print(result)
# {'context_recall': 0.87, 'context_precision': 0.72, 'faithfulness': 0.93}

Comparing Retrieval Configurations

Always A/B test your changes. Here’s a pattern for comparing retrieval configs:

configs = {
    "baseline": {"top_k": 5, "reranker": None},
    "reranked": {"top_k": 20, "reranker": "cross-encoder/ms-marco-MiniLM-L6-v2"},
    "hybrid": {"top_k": 20, "reranker": "cross-encoder/ms-marco-MiniLM-L6-v2", "hybrid": True},
}

for name, config in configs.items():
    # Build retriever with config
    retriever = build_retriever(**config)

    # Evaluate on test set
    metrics = evaluate_retriever(retriever, test_queries)
    print(f"{name:15s} | MRR: {metrics['mrr']:.3f} | Recall@5: {metrics['recall']:.3f}")

Complete Production Pipeline

Here’s a complete retrieve-and-rerank pipeline combining all the techniques:

LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

# Configure models
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3.2", request_timeout=120)

# Load & chunk
documents = SimpleDirectoryReader("./data").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)

# Build index
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)

# Reranker
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

# Query engine with reranking
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker],
)

response = query_engine.query("How does fine-tuning work?")
print(response)

LangChain

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load & chunk
loader = DirectoryLoader("./data", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50
).split_documents(documents)

# Embed & index
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    encode_kwargs={"normalize_embeddings": True},
)
vectorstore = FAISS.from_documents(chunks, embeddings)

# Retrieve + Rerank
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(
        model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
    ),
    top_n=5,
)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)

# Generate
llm = ChatOllama(model="llama3.2")
prompt = ChatPromptTemplate.from_template(
    "Answer based on this context:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How does fine-tuning work?")
print(answer)

Common Pitfalls

Pitfall Symptom Fix
Mixing embedding models Zero or random relevance scores Always use the same model for indexing and querying
Ignoring task prefixes Lower retrieval quality Add "query:" / "passage:" prefixes when model expects them
Chunk exceeds context window Embedding truncated silently Match chunk size to embedding model’s context limit
No reranking Many irrelevant results in top-k Add cross-encoder reranking — highest-ROI upgrade
Reranking too few candidates Reranker can’t fix what wasn’t retrieved Retrieve 20–50 candidates, rerank to 3–5
Float32 at scale Out of memory Apply int8 or binary quantization
Skipping evaluation No idea if changes help Track MRR, Recall@k, and nDCG@k

Conclusion

Retrieval quality is the foundation of every RAG system, and it rests on two pillars: embedding models for fast candidate retrieval and rerankers for precision refinement.

Key takeaways:

  1. Bi-encoders retrieve, cross-encoders rerank — this two-stage pattern is the production standard
  2. Reranking is the highest-ROI upgrade — retrieve top-20, rerank to top-5 with a cross-encoder
  3. Match your model to your constraints — API models for simplicity, open-source for privacy, fine-tuned for specialized domains
  4. Quantization makes scale affordable — binary embeddings retain 96% of quality at 32x memory reduction
  5. Hybrid search (dense + BM25) catches what embeddings miss — exact keywords, acronyms, codes
  6. Fine-tune when general models underperform — use hard-negative mining from your own retrieval pipeline
  7. Always evaluate quantitatively — use Recall@k, MRR, and nDCG@k to guide every decision

The complete pipelines above can be copy-pasted into your project. Start with a pre-trained bi-encoder, add reranking, and fine-tune only when evaluation metrics demand it.

For serving the LLM backbone, see Scaling LLM Serving for Enterprise Production. For running models locally, see Run LLM locally with Ollama. For observability, see Observability for Multi-Turn LLM Conversations. For guardrails, see Guardrails for LLM Applications with Giskard.

References

  • Muennighoff et al., MTEB: Massive Text Embedding Benchmark, 2022. arXiv:2210.07316
  • Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020. arXiv:2004.12832
  • Lee et al., NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR 2025. arXiv:2405.17428
  • Shakir et al., Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval, HuggingFace Blog, 2024. Blog
  • Kusupati et al., Matryoshka Representation Learning, NeurIPS 2022. arXiv:2205.13147
  • Reimers & Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP 2019. arXiv:1908.10084
  • Nogueira & Cho, Passage Re-ranking with BERT, 2020. arXiv:1901.04085
  • MTEB Leaderboard, Massive Text Embedding Benchmark, HuggingFace, 2026. Leaderboard
  • Sentence Transformers Documentation, Cross Encoder Usage, 2026. Docs

Read More

  • Improve retrieval input quality with advanced chunking strategies — embedding performance depends heavily on chunk design.
  • Add graph-based retrieval with GraphRAG for multi-hop reasoning over entity relationships.
  • Evaluate your embedding and reranking pipeline with RAGAS and DeepEval to measure real-world retrieval gains.
  • Fine-tune your embeddings and rerankers on domain data using techniques from Fine-tuning RAG Components.