!pip install -q langchain langchain-openai langchain-experimental langchain-text-splitters llama-index llama-index-core llama-index-embeddings-openai sentence-transformers tiktokenAdvanced Chunking Strategies for RAG
Comparing fixed-size, recursive, semantic, agentic, and late chunking methods for optimal retrieval quality
Table of Contents
1. Setup & Installation
import os
# os.environ["OPENAI_API_KEY"] = "your-api-key-here" # Uncomment and set2. Sample Document for Testing
We create a sample markdown document to demonstrate different chunking strategies.
sample_document = """
# Attention Mechanisms in Transformers
## Introduction
The Transformer architecture revolutionized natural language processing when it was introduced by Vaswani et al. in 2017. At its core lies the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when producing each part of the output.
Unlike recurrent neural networks (RNNs), which process sequences one token at a time, Transformers process all tokens in parallel. This parallelism is enabled by the attention mechanism, which computes a weighted sum of all input representations for each output position.
## Self-Attention
Self-attention, also known as intra-attention, computes attention weights between all pairs of positions in a single sequence. For each position, it produces a query (Q), key (K), and value (V) vector by multiplying the input embedding with learned weight matrices.
The attention score between position i and position j is computed as the dot product of Q_i and K_j, scaled by the square root of the key dimension. These scores are then passed through a softmax function to produce attention weights, which are used to compute a weighted sum of the value vectors.
## Multi-Head Attention
Instead of performing a single attention function, Multi-Head Attention runs multiple attention heads in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
Each head operates on a lower-dimensional projection of the queries, keys, and values. The outputs from all heads are concatenated and linearly projected to produce the final output. Typical configurations use 8 or 16 attention heads.
## Cross-Attention
Cross-attention is used in encoder-decoder architectures where the queries come from the decoder and the keys and values come from the encoder. This allows the decoder to attend to all positions in the input sequence when generating each output token.
This mechanism is fundamental to tasks like machine translation, where the decoder needs to selectively focus on different parts of the source sentence when generating each word of the translation.
"""
print(f"Document length: {len(sample_document)} characters")3. Fixed-Size (Character / Token) Splitting
The simplest approach: split text into chunks of exactly N characters or tokens, with optional overlap.
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
# Character-based splitting
char_splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separator="" # Split on any character boundary
)
char_chunks = char_splitter.split_text(sample_document)
print(f"Character splitting: {len(char_chunks)} chunks")
for i, chunk in enumerate(char_chunks):
print(f" Chunk {i}: {len(chunk)} chars β {chunk[:80]}...")# Token-based splitting (more precise)
token_splitter = TokenTextSplitter(
chunk_size=256,
chunk_overlap=50,
encoding_name="cl100k_base" # GPT-4 tokenizer
)
token_chunks = token_splitter.split_text(sample_document)
print(f"Token splitting: {len(token_chunks)} chunks")
for i, chunk in enumerate(token_chunks):
print(f" Chunk {i}: {len(chunk)} chars β {chunk[:80]}...")# LlamaIndex equivalent
from llama_index.core.node_parser import TokenTextSplitter as LITokenTextSplitter
from llama_index.core import Document
li_splitter = LITokenTextSplitter(
chunk_size=256,
chunk_overlap=50,
)
li_docs = [Document(text=sample_document)]
nodes = li_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex token splitting: {len(nodes)} nodes")
for i, node in enumerate(nodes):
print(f" Node {i}: {len(node.text)} chars")4. Recursive Character Splitting
The most popular method: splits using an ordered list of separators, trying the largest structural boundaries first.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", ".", "?", "!", " ", ""],
length_function=len,
)
recursive_chunks = splitter.split_text(sample_document)
print(f"Recursive splitting: {len(recursive_chunks)} chunks\n")
for i, chunk in enumerate(recursive_chunks):
print(f"--- Chunk {i} ({len(chunk)} chars) ---")
print(chunk[:200])
print()# LlamaIndex SentenceSplitter (equivalent to recursive splitting)
from llama_index.core.node_parser import SentenceSplitter
sentence_splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=50,
)
sentence_nodes = sentence_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex SentenceSplitter: {len(sentence_nodes)} nodes")
for i, node in enumerate(sentence_nodes):
print(f" Node {i}: {len(node.text)} chars β {node.text[:80]}...")5. Document-Aware (Structural) Splitting
Leverages document structure (markdown headers, HTML tags) to create chunks aligned with the authorβs organization.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_chunks = md_splitter.split_text(sample_document)
print(f"Markdown splitting: {len(md_chunks)} chunks\n")
for i, chunk in enumerate(md_chunks):
print(f"--- Chunk {i} ---")
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:150]}...")
print()# Two-stage: structural + recursive for consistent sizes
from langchain.text_splitter import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
# Stage 1: Split by structure
structural_chunks = md_splitter.split_text(sample_document)
# Stage 2: Enforce size limits
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
)
final_chunks = text_splitter.split_documents(structural_chunks)
print(f"Two-stage splitting: {len(final_chunks)} chunks")
for i, chunk in enumerate(final_chunks):
print(f" Chunk {i}: {len(chunk.page_content)} chars, metadata={chunk.metadata}")# LlamaIndex MarkdownNodeParser
from llama_index.core.node_parser import MarkdownNodeParser
md_parser = MarkdownNodeParser()
md_nodes = md_parser.get_nodes_from_documents(li_docs)
print(f"LlamaIndex MarkdownNodeParser: {len(md_nodes)} nodes")
for i, node in enumerate(md_nodes):
print(f" Node {i}: {len(node.text)} chars β {node.text[:80]}...")6. Semantic Chunking
Uses embedding similarity to detect topic boundaries β splits where semantic coherence drops.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split at top 5% similarity drops
)
semantic_chunks = chunker.split_text(sample_document)
print(f"Semantic chunking: {len(semantic_chunks)} chunks\n")
for i, chunk in enumerate(semantic_chunks):
print(f"--- Chunk {i} ({len(chunk)} chars) ---")
print(chunk[:150])
print()# LlamaIndex SemanticSplitterNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
semantic_nodes = semantic_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex semantic splitting: {len(semantic_nodes)} nodes")
for i, node in enumerate(semantic_nodes):
print(f" Node {i}: {len(node.text)} chars")7. Parent-Child (Hierarchical) Chunking
Small chunks for precise search; when a child matches, its larger parent is sent to the LLM for richer context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document as LCDocument
# Child splitter (small chunks for search)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Parent splitter (larger chunks for LLM context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)
vectorstore = FAISS.from_texts(
[""], OpenAIEmbeddings(model="text-embedding-3-small")
)
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents([LCDocument(page_content=sample_document)])
# Search with small child chunks, get back larger parent chunks
results = retriever.invoke("What is multi-head attention?")
print(f"Retrieved {len(results)} parent chunks\n")
for i, doc in enumerate(results):
print(f"--- Parent Chunk {i} ({len(doc.page_content)} chars) ---")
print(doc.page_content[:200])
print()# LlamaIndex Auto Merging Retriever
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core import StorageContext, VectorStoreIndex
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[512, 256, 128] # Parent -> child -> grandchild
)
hier_nodes = node_parser.get_nodes_from_documents(li_docs)
leaf_nodes = get_leaf_nodes(hier_nodes)
print(f"Total hierarchical nodes: {len(hier_nodes)}")
print(f"Leaf nodes (for indexing): {len(leaf_nodes)}")8. Comparing Chunking Strategies
Summary of all strategies and when to use each.
# Compare chunk counts and sizes across strategies
strategies = {
"Fixed-Size (Char)": char_chunks,
"Fixed-Size (Token)": token_chunks,
"Recursive Character": recursive_chunks,
"Semantic": semantic_chunks,
}
print(f"{'Strategy':<25} {'Chunks':<8} {'Avg Size':<10} {'Min':<8} {'Max':<8}")
print("-" * 60)
for name, chunks in strategies.items():
sizes = [len(c) for c in chunks]
print(f"{name:<25} {len(chunks):<8} {sum(sizes)/len(sizes):<10.0f} {min(sizes):<8} {max(sizes):<8}")
print("\nMarkdown structural:", len(md_chunks), "chunks")
print("Two-stage (structural + recursive):", len(final_chunks), "chunks")
print("Hierarchical:", len(hier_nodes), "total nodes,", len(leaf_nodes), "leaf nodes")