Embedding Models and Reranking for RAG

Selecting, fine-tuning, and combining embedding models with cross-encoder rerankers for production retrieval pipelines

Published

April 9, 2025

Keywords: RAG, embedding models, reranking, cross-encoder, bi-encoder, ColBERT, MTEB, Sentence Transformers, fine-tuning, Matryoshka, quantization, LlamaIndex, LangChain, retrieval, cosine similarity, dense retrieval, hybrid search

Introduction

The retrieval stage is the make-or-break component of any RAG pipeline. If the right chunks never get retrieved, the LLM has no chance of producing a correct answer — no matter how capable the model is.

At the heart of retrieval are two decisions: which embedding model converts your text to vectors, and whether a reranker re-scores the top results before they reach the LLM. These choices directly control recall (did you find the right documents?) and precision (did you avoid the wrong ones?).

This article covers everything you need to make these decisions for production RAG: how bi-encoders and cross-encoders work, how to choose from dozens of competing models, when and how to fine-tune, how to reduce costs with quantization, and how to wire a full retrieve-and-rerank pipeline in LlamaIndex and LangChain.

How Embedding Models Work

Bi-Encoders: The Workhorse of Dense Retrieval

A bi-encoder (also called a Sentence Transformer) encodes query and document independently into fixed-size vectors. At retrieval time, you compare vectors with cosine similarity or dot product.

graph LR
    A["Query"] --> B["Encoder"]
    C["Document"] --> D["Same Encoder"]
    B --> E["Query Vector<br/>[768 dims]"]
    D --> F["Doc Vector<br/>[768 dims]"]
    E --> G["Cosine<br/>Similarity"]
    F --> G
    G --> H["Score: 0.87"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#4a90d9,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#1abc9c,color:#fff,stroke:#333

Key property: Document vectors can be pre-computed and cached. At query time, only the query needs to be embedded — then you search millions of pre-computed vectors in milliseconds using approximate nearest neighbor (ANN) indices.

\text{similarity}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||}

Cross-Encoders: The Precision Instrument

A cross-encoder takes a query-document pair as a single input and outputs a relevance score directly. The query and document tokens attend to each other through every transformer layer — enabling much richer interaction.

graph LR
    A["[CLS] Query [SEP] Document [SEP]"] --> B["Full Transformer<br/>(joint attention)"]
    B --> C["Relevance Score:<br/>8.61"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

Key property: Cross-encoders are far more accurate than bi-encoders because they model fine-grained query-document interactions. But they cannot pre-compute document representations — every query requires a forward pass through the transformer for every candidate document.

Bi-Encoder vs Cross-Encoder

Aspect	Bi-Encoder	Cross-Encoder
Architecture	Encode query & doc separately	Encode query+doc jointly
Pre-compute docs	Yes (offline indexing)	No (per-query inference)
Speed	Milliseconds over millions	Seconds over hundreds
Accuracy	Good	Significantly better
Use case	First-stage retrieval	Second-stage reranking
Scalability	O(1) per query (ANN)	O(k) per query (top-k)

The standard production pattern: bi-encoder retrieves top-k candidates, cross-encoder reranks them.

ColBERT: Late Interaction (Best of Both Worlds)

ColBERT (Khattab & Zaharia, 2020) introduces a middle ground: late interaction. It encodes query and document independently (like a bi-encoder) but retains per-token embeddings instead of pooling into a single vector. At scoring time, it computes fine-grained token-level similarity:

\text{score}(q, d) = \sum_{i \in |q|} \max_{j \in |d|} \mathbf{q}_i \cdot \mathbf{d}_j

Each query token finds its best-matching document token, and scores are summed. This captures detailed interactions without the cost of full cross-attention.

Property	Bi-Encoder	ColBERT	Cross-Encoder
Pre-compute docs	Yes	Yes (per-token)	No
Interaction depth	None (pooled)	Token-level max-sim	Full attention
Storage	1 vector/doc	N vectors/doc	N/A
Accuracy	Good	Very good	Best
Latency	Fastest	Fast	Slowest

Choosing an Embedding Model

The MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 56+ tasks spanning retrieval, classification, clustering, reranking, and more. It is the standard benchmark for comparing embedding models.

Key metrics to focus on for RAG:

Retrieval (nDCG@10) — the most relevant metric for RAG
Reranking — if you plan to use the model as a reranker
Classification — relevant for intent detection or routing

Current Model Landscape (2025)

Model	Provider	Dims	Context	Open Source	MTEB Retrieval	Notes
NV-Embed-v2	NVIDIA	4096	32K	Yes	72.31	#1 MTEB overall (ICLR 2025)
text-embedding-3-large	OpenAI	3072	8191	No	~62	Matryoshka support
text-embedding-3-small	OpenAI	1536	8191	No	~55	Cost-effective
voyage-3-large	Voyage AI	1024	32K	No	~67	Code & multilingual
Cohere embed-v4	Cohere	1024	varies	No	~65	Built-in binary quantization
jina-embeddings-v3	Jina AI	1024	8192	Yes	~66	Task-specific LoRA adapters
mxbai-embed-large-v1	Mixedbread	1024	512	Yes	54.39	Strong open-source
BGE-large-en-v1.5	BAAI	1024	512	Yes	~54	Widely used baseline
nomic-embed-text-v1.5	Nomic	768	8192	Yes	~53	MRL, runs on Ollama
GTE-large-en-v1.5	Alibaba	1024	8192	Yes	~57	Long context
all-MiniLM-L6-v2	SBERT	384	256	Yes	~42	Tiny, fast, good baseline

Selection Decision Matrix

graph TD
    A["Start"] --> B{"Data leaves<br/>your infra?"}
    B -->|"No (privacy)"| C{"GPU<br/>available?"}
    B -->|"Yes (API OK)"| D{"Budget<br/>priority?"}
    C -->|Yes| E["NV-Embed-v2<br/>or GTE-large"]
    C -->|No| F["nomic-embed-text<br/>(Ollama)"]
    D -->|Low cost| G["text-embedding-3-small"]
    D -->|Best quality| H["voyage-3-large<br/>or Cohere embed-v4"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333

Embedding Best Practices

Same model for indexing and queries — never mix embedding models; vectors from different models are incompatible
Use task-specific prefixes when the model supports them:

# Many models expect a prefix for queries vs. documents
query = "query: What is attention in transformers?"
document = "passage: The attention mechanism allows the model to..."

Normalize vectors for cosine similarity — most models output unit vectors, but verify
Batch your embedding calls — embedding one-by-one is orders of magnitude slower
Match context length to your chunks — a model with 512-token context cannot embed 1000-token chunks

Embedding Models in Practice

LlamaIndex

# OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Local via HuggingFace
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
    trust_remote_code=True,
)

# Local via Ollama (fully offline)
from llama_index.embeddings.ollama import OllamaEmbedding

embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Use in pipeline
from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(documents)

LangChain

# OpenAI
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Local via HuggingFace
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# Local via Ollama
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Use in pipeline
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embeddings)

Reranking: The Highest-ROI Upgrade

Adding a reranker is consistently the single biggest quality improvement you can make to an existing RAG pipeline — often boosting relevant retrieval by 5–15% with minimal code change.

How Retrieve-and-Rerank Works

graph LR
    A["Query"] --> B["Bi-Encoder<br/>Retrieval"]
    B --> C["Top-20<br/>Candidates"]
    C --> D["Cross-Encoder<br/>Reranker"]
    D --> E["Reranked<br/>Top-5"]
    E --> F["LLM"]
    F --> G["Answer"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#C8CFEA,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

The pipeline:

Retrieve top-N candidates with a fast bi-encoder (N = 20–50)
Rerank all N candidates with a cross-encoder
Return top-k reranked results to the LLM (k = 3–5)

The retrieve-wider, rerank-narrower pattern ensures you don’t miss relevant documents (high recall from step 1) while surfacing the most relevant ones (high precision from step 2).

Cross-Encoder Reranker Models

Model	Provider	Accuracy	Speed	Notes
rerank-v3.5	Cohere	Best (API)	Fast	Production API, multilingual
jina-reranker-v2	Jina AI	Very good	Fast	Open weights, multilingual
bge-reranker-v2-m3	BAAI	Very good	Medium	Open source, multilingual
ms-marco-MiniLM-L6-v2	SBERT	Good	Fastest	Tiny, great for prototyping
ms-marco-MiniLM-L12-v2	SBERT	Better	Fast	Good quality/speed balance
NV-RerankQA-Mistral-4B	NVIDIA	Excellent	Slower	LLM-based reranker

Reranking with LlamaIndex

from llama_index.core import VectorStoreIndex, Settings
from llama_index.postprocessor.cohere_rerank import CohereRerank

# Retrieve more candidates, rerank to top-5
reranker = CohereRerank(
    api_key="YOUR_COHERE_KEY",
    top_n=5,
    model="rerank-v3.5",
)

query_engine = index.as_query_engine(
    similarity_top_k=20,          # Retrieve 20
    node_postprocessors=[reranker],  # Rerank to 5
)

response = query_engine.query("What are the benefits of RLHF?")

With a local cross-encoder:

from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker],
)

Reranking with LangChain

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Load cross-encoder
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

# Wrap the base retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

reranking_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

results = reranking_retriever.invoke("What are the benefits of RLHF?")

With Cohere reranker:

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

reranker = CohereRerank(
    model="rerank-v3.5",
    top_n=5,
)

reranking_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
)

Reranking with Sentence Transformers Directly

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

query = "What is the attention mechanism?"
passages = [
    "The attention mechanism allows the model to focus on relevant parts...",
    "Berlin is the capital of Germany...",
    "Transformers use self-attention to capture long-range dependencies...",
]

# Score each query-passage pair
scores = model.predict([(query, p) for p in passages])
# => array([ 8.92, -4.32,  7.61], dtype=float32)

# Or use the built-in rank method
ranks = model.rank(query, passages)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")

Hybrid Search: Dense + Sparse

Dense retrieval (embeddings) excels at semantic matching but can miss exact keyword matches — acronyms, product names, error codes. Sparse retrieval (BM25) handles these well. Combining both via Reciprocal Rank Fusion (RRF) gives the best of both worlds.

graph TB
    A["Query"] --> B["Dense Retrieval<br/>(Bi-Encoder)"]
    A --> C["Sparse Retrieval<br/>(BM25)"]
    B --> D["Dense Top-20"]
    C --> E["Sparse Top-20"]
    D --> F["Reciprocal Rank<br/>Fusion"]
    E --> F
    F --> G["Fused Top-20"]
    G --> H["Cross-Encoder<br/>Reranker"]
    H --> I["Final Top-5"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style F fill:#e74c3c,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333
    style I fill:#1abc9c,color:#fff,stroke:#333

LangChain: Hybrid + Reranking Pipeline

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Stage 1: Hybrid retrieval
bm25_retriever = BM25Retriever.from_documents(chunks, k=20)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7],  # Dense-weighted
)

# Stage 2: Reranking
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever,
)

results = final_retriever.invoke("What is RLHF?")

LlamaIndex: Hybrid + Reranking Pipeline

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

# Stage 1: Hybrid retrieval
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes, similarity_top_k=20
)
vector_retriever = index.as_retriever(similarity_top_k=20)

hybrid_retriever = QueryFusionRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    num_queries=1,
    use_async=False,
    similarity_top_k=20,
)

# Stage 2: Reranking
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

# Use in query engine
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
)

response = query_engine.query("What is RLHF?")

Fine-Tuning Embedding Models

Off-the-shelf embedding models work well for general domains. But for specialized corpora — medical records, legal documents, internal codebases — fine-tuning can improve retrieval by 5–20%.

When to Fine-Tune

Scenario	Fine-Tune?	Why
General Q&A over docs	No	Pre-trained models are good enough
Domain-specific jargon	Yes	Models may not understand domain terms
Internal company data	Maybe	Start without, fine-tune if recall is poor
Non-English language	Maybe	Check MTEB for your language first
Code/SQL retrieval	Yes	Code semantics differ from natural language

Fine-Tuning with Sentence Transformers

The standard approach: contrastive learning with hard negatives — training on (query, positive, negative) triplets.

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from datasets import Dataset

# 1. Prepare training data
# Each example: {"query": ..., "positive": ..., "negative": ...}
train_data = Dataset.from_dict({
    "anchor": ["What causes diabetes?", ...],
    "positive": ["Diabetes is caused by insulin resistance...", ...],
    "negative": ["Berlin is the capital of Germany...", ...],
})

# 2. Load base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# 3. Define loss
loss = losses.MultipleNegativesRankingLoss(model)

# 4. Training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="./finetuned-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
)

# 5. Train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_data,
    loss=loss,
)

trainer.train()
model.save_pretrained("./finetuned-embeddings")

Generating Training Data from Your RAG Pipeline

You don’t always need labeled datasets. Use your existing pipeline to mine hard negatives — documents the retriever returned but that weren’t relevant:

def mine_hard_negatives(
    queries: list[str],
    relevant_docs: list[str],
    retriever,
    top_k: int = 20,
) -> list[dict]:
    """Generate training triplets by mining hard negatives from retrieval."""
    triplets = []
    for query, positive in zip(queries, relevant_docs):
        results = retriever.invoke(query)
        for doc in results[:top_k]:
            if doc.page_content != positive:
                triplets.append({
                    "anchor": query,
                    "positive": positive,
                    "negative": doc.page_content,
                })
                break  # One hard negative per query
    return triplets

Fine-Tuning Cross-Encoder Rerankers

Cross-encoders can also be fine-tuned for your domain:

from sentence_transformers import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import (
    CERerankingEvaluator,
)
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Load base model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# Training data: (query, passage, label) where label is 0 or 1
# Use the same hard negative mining approach

args = SentenceTransformerTrainingArguments(
    output_dir="./finetuned-reranker",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    fp16=True,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
)

trainer.train()

Scaling Embeddings for Production

Matryoshka Representation Learning (MRL)

MRL trains embedding models so that the first N dimensions already form a good embedding. You can truncate from 1024 dims to 256 dims with minimal quality loss — reducing storage and compute by 4x.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

# Full 768-dim embedding
full_emb = model.encode("What is attention?")  # shape: (768,)

# Truncated to 256 dims — still high quality with MRL
truncated_emb = full_emb[:256]  # shape: (256,)

OpenAI’s text-embedding-3-large supports this natively via the dimensions parameter:

from openai import OpenAI

client = OpenAI()

# Full 3072 dims
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="What is attention?",
    dimensions=3072,
)

# Truncated to 256 dims (93.1% performance retention at 12x compression)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="What is attention?",
    dimensions=256,
)

Embedding Quantization

Reduce embedding storage and speed up search by quantizing from float32 to smaller types:

Precision	Size per dim	Memory (250M vectors, 1024d)	Speed	Quality Retained
float32	4 bytes	953 GB	1x	100%
int8 (scalar)	1 byte	238 GB	~4x	~99.3%
binary (1-bit)	0.125 bytes	30 GB	~25x	~96%

The results are remarkable: binary quantization retains 96% of performance at 32x memory reduction and 25x speed improvement.

Binary Quantization with Sentence Transformers

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# Full float32 embeddings
embeddings = model.encode(["Document text here..."])
# shape: (1, 1024), dtype: float32, size: 4096 bytes

# Binary quantization — 32x smaller
binary_embeddings = quantize_embeddings(embeddings, precision="binary")
# shape: (1, 128), dtype: int8, size: 128 bytes

# Scalar int8 quantization — 4x smaller
int8_embeddings = quantize_embeddings(
    embeddings,
    precision="int8",
    calibration_embeddings=calibration_set,  # Recommended
)
# shape: (1, 1024), dtype: int8, size: 1024 bytes

Three-Stage Retrieval Pipeline

For maximum scalability, combine binary search, scalar rescoring, and cross-encoder reranking:

graph LR
    A["Query"] --> B["Binary Search<br/>(Hamming dist)<br/>→ Top-200"]
    B --> C["Scalar Rescore<br/>(int8 × float32)<br/>→ Top-20"]
    C --> D["Cross-Encoder<br/>Rerank<br/>→ Top-5"]
    D --> E["LLM"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

This pipeline uses:

5 GB memory for the binary index (vs. 200 GB for float32)
50 GB disk for the int8 index
Near-lossless quality with massive infrastructure savings

Memory & Cost at Scale

Vectors	float32 (1024d)	int8	binary	binary + int8 rescore
1M	3.8 GB	953 MB	119 MB	119 MB mem + 953 MB disk
10M	38 GB	9.5 GB	1.2 GB	1.2 GB mem + 9.5 GB disk
100M	381 GB	95 GB	12 GB	12 GB mem + 95 GB disk
1B	3.8 TB	953 GB	119 GB	119 GB mem + 953 GB disk

Evaluating Retrieval Quality

Key Metrics

Metric	What it Measures	Formula
Recall@k	Were relevant docs in top-k?	\frac{\text{relevant retrieved}}{\text{total relevant}}
nDCG@k	Were relevant docs ranked high?	Takes rank position into account
MRR	Where is the first relevant doc?	\frac{1}{\text{rank of first relevant}}
Precision@k	What fraction of top-k is relevant?	\frac{\text{relevant in top-k}}{k}
Hit Rate	Is at least one relevant doc in top-k?	Binary: 0 or 1

Quick Evaluation with LlamaIndex

from llama_index.core.evaluation import RetrieverEvaluator

# Prepare evaluation dataset: {query: [relevant_doc_ids]}
qa_pairs = [
    {"query": "What is attention?", "expected_ids": ["doc_42", "doc_15"]},
    # ...
]

evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=index.as_retriever(similarity_top_k=5)
)

results = []
for pair in qa_pairs:
    result = await evaluator.aevaluate(
        query=pair["query"],
        expected_ids=pair["expected_ids"],
    )
    results.append(result)

# Aggregate
avg_mrr = sum(r.metric_vals["mrr"] for r in results) / len(results)
avg_hit = sum(r.metric_vals["hit_rate"] for r in results) / len(results)
print(f"MRR: {avg_mrr:.3f}, Hit Rate: {avg_hit:.3f}")

RAGAS for End-to-End RAG Evaluation

from ragas import evaluate
from ragas.metrics import context_recall, context_precision, faithfulness

result = evaluate(
    dataset=eval_dataset,  # Contains queries, ground truth, contexts, answers
    metrics=[context_recall, context_precision, faithfulness],
)

print(result)
# {'context_recall': 0.87, 'context_precision': 0.72, 'faithfulness': 0.93}

Comparing Retrieval Configurations

Always A/B test your changes. Here’s a pattern for comparing retrieval configs:

configs = {
    "baseline": {"top_k": 5, "reranker": None},
    "reranked": {"top_k": 20, "reranker": "cross-encoder/ms-marco-MiniLM-L6-v2"},
    "hybrid": {"top_k": 20, "reranker": "cross-encoder/ms-marco-MiniLM-L6-v2", "hybrid": True},
}

for name, config in configs.items():
    # Build retriever with config
    retriever = build_retriever(**config)

    # Evaluate on test set
    metrics = evaluate_retriever(retriever, test_queries)
    print(f"{name:15s} | MRR: {metrics['mrr']:.3f} | Recall@5: {metrics['recall']:.3f}")

Complete Production Pipeline

Here’s a complete retrieve-and-rerank pipeline combining all the techniques:

LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.postprocessor.sentence_transformer_rerank import (
    SentenceTransformerRerank,
)

# Configure models
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3.2", request_timeout=120)

# Load & chunk
documents = SimpleDirectoryReader("./data").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)

# Build index
index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], show_progress=True
)

# Reranker
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L6-v2",
    top_n=5,
)

# Query engine with reranking
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker],
)

response = query_engine.query("How does fine-tuning work?")
print(response)

LangChain

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load & chunk
loader = DirectoryLoader("./data", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50
).split_documents(documents)

# Embed & index
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    encode_kwargs={"normalize_embeddings": True},
)
vectorstore = FAISS.from_documents(chunks, embeddings)

# Retrieve + Rerank
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(
        model_name="cross-encoder/ms-marco-MiniLM-L6-v2"
    ),
    top_n=5,
)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)

# Generate
llm = ChatOllama(model="llama3.2")
prompt = ChatPromptTemplate.from_template(
    "Answer based on this context:\n{context}\n\nQuestion: {question}"
)

rag_chain = (
    {
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How does fine-tuning work?")
print(answer)

Common Pitfalls

Pitfall	Symptom	Fix
Mixing embedding models	Zero or random relevance scores	Always use the same model for indexing and querying
Ignoring task prefixes	Lower retrieval quality	Add `"query:"` / `"passage:"` prefixes when model expects them
Chunk exceeds context window	Embedding truncated silently	Match chunk size to embedding model’s context limit
No reranking	Many irrelevant results in top-k	Add cross-encoder reranking — highest-ROI upgrade
Reranking too few candidates	Reranker can’t fix what wasn’t retrieved	Retrieve 20–50 candidates, rerank to 3–5
Float32 at scale	Out of memory	Apply int8 or binary quantization
Skipping evaluation	No idea if changes help	Track MRR, Recall@k, and nDCG@k

Conclusion

Retrieval quality is the foundation of every RAG system, and it rests on two pillars: embedding models for fast candidate retrieval and rerankers for precision refinement.

Key takeaways:

Bi-encoders retrieve, cross-encoders rerank — this two-stage pattern is the production standard
Reranking is the highest-ROI upgrade — retrieve top-20, rerank to top-5 with a cross-encoder
Match your model to your constraints — API models for simplicity, open-source for privacy, fine-tuned for specialized domains
Quantization makes scale affordable — binary embeddings retain 96% of quality at 32x memory reduction
Hybrid search (dense + BM25) catches what embeddings miss — exact keywords, acronyms, codes
Fine-tune when general models underperform — use hard-negative mining from your own retrieval pipeline
Always evaluate quantitatively — use Recall@k, MRR, and nDCG@k to guide every decision

The complete pipelines above can be copy-pasted into your project. Start with a pre-trained bi-encoder, add reranking, and fine-tune only when evaluation metrics demand it.

For serving the LLM backbone, see Scaling LLM Serving for Enterprise Production. For running models locally, see Run LLM locally with Ollama. For observability, see Observability for Multi-Turn LLM Conversations. For guardrails, see Guardrails for LLM Applications with Giskard.

References

Muennighoff et al., MTEB: Massive Text Embedding Benchmark, 2022. arXiv:2210.07316
Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020. arXiv:2004.12832
Lee et al., NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, ICLR 2025. arXiv:2405.17428
Shakir et al., Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval, HuggingFace Blog, 2024. Blog
Kusupati et al., Matryoshka Representation Learning, NeurIPS 2022. arXiv:2205.13147
Reimers & Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP 2019. arXiv:1908.10084
Nogueira & Cho, Passage Re-ranking with BERT, 2020. arXiv:1901.04085
MTEB Leaderboard, Massive Text Embedding Benchmark, HuggingFace, 2026. Leaderboard
Sentence Transformers Documentation, Cross Encoder Usage, 2026. Docs

Improve retrieval input quality with advanced chunking strategies — embedding performance depends heavily on chunk design.
Add graph-based retrieval with GraphRAG for multi-hop reasoning over entity relationships.
Evaluate your embedding and reranking pipeline with RAGAS and DeepEval to measure real-world retrieval gains.
Fine-tune your embeddings and rerankers on domain data using techniques from Fine-tuning RAG Components.

Introduction

How Embedding Models Work

Bi-Encoders: The Workhorse of Dense Retrieval

Cross-Encoders: The Precision Instrument

Bi-Encoder vs Cross-Encoder

ColBERT: Late Interaction (Best of Both Worlds)

Choosing an Embedding Model

The MTEB Benchmark

Current Model Landscape (2025)

Selection Decision Matrix

Embedding Best Practices

Embedding Models in Practice

LlamaIndex

LangChain

Reranking: The Highest-ROI Upgrade

How Retrieve-and-Rerank Works

Cross-Encoder Reranker Models

Reranking with LlamaIndex

Reranking with LangChain

Reranking with Sentence Transformers Directly

Hybrid Search: Dense + Sparse

LangChain: Hybrid + Reranking Pipeline

LlamaIndex: Hybrid + Reranking Pipeline

Fine-Tuning Embedding Models

When to Fine-Tune

Fine-Tuning with Sentence Transformers

Generating Training Data from Your RAG Pipeline

Fine-Tuning Cross-Encoder Rerankers

Scaling Embeddings for Production

Matryoshka Representation Learning (MRL)

Embedding Quantization

Binary Quantization with Sentence Transformers

Three-Stage Retrieval Pipeline

Memory & Cost at Scale

Evaluating Retrieval Quality

Key Metrics

Quick Evaluation with LlamaIndex

RAGAS for End-to-End RAG Evaluation

Comparing Retrieval Configurations

Complete Production Pipeline

LlamaIndex

LangChain

Common Pitfalls

Conclusion

References

Read More