Advanced Chunking Strategies for RAG

Comparing fixed-size, recursive, semantic, agentic, and late chunking methods for optimal retrieval quality

Open In Colab

πŸ“– Read the full article


Table of Contents

  1. Setup & Installation
  2. Sample Document for Testing
  3. Fixed-Size (Character / Token) Splitting
  4. Recursive Character Splitting
  5. Document-Aware (Structural) Splitting
  6. Semantic Chunking
  7. Parent-Child (Hierarchical) Chunking
  8. Comparing Chunking Strategies

1. Setup & Installation

!pip install -q langchain langchain-openai langchain-experimental langchain-text-splitters llama-index llama-index-core llama-index-embeddings-openai sentence-transformers tiktoken
import os
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"  # Uncomment and set

2. Sample Document for Testing

We create a sample markdown document to demonstrate different chunking strategies.

sample_document = """
# Attention Mechanisms in Transformers

## Introduction

The Transformer architecture revolutionized natural language processing when it was introduced by Vaswani et al. in 2017. At its core lies the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when producing each part of the output.

Unlike recurrent neural networks (RNNs), which process sequences one token at a time, Transformers process all tokens in parallel. This parallelism is enabled by the attention mechanism, which computes a weighted sum of all input representations for each output position.

## Self-Attention

Self-attention, also known as intra-attention, computes attention weights between all pairs of positions in a single sequence. For each position, it produces a query (Q), key (K), and value (V) vector by multiplying the input embedding with learned weight matrices.

The attention score between position i and position j is computed as the dot product of Q_i and K_j, scaled by the square root of the key dimension. These scores are then passed through a softmax function to produce attention weights, which are used to compute a weighted sum of the value vectors.

## Multi-Head Attention

Instead of performing a single attention function, Multi-Head Attention runs multiple attention heads in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

Each head operates on a lower-dimensional projection of the queries, keys, and values. The outputs from all heads are concatenated and linearly projected to produce the final output. Typical configurations use 8 or 16 attention heads.

## Cross-Attention

Cross-attention is used in encoder-decoder architectures where the queries come from the decoder and the keys and values come from the encoder. This allows the decoder to attend to all positions in the input sequence when generating each output token.

This mechanism is fundamental to tasks like machine translation, where the decoder needs to selectively focus on different parts of the source sentence when generating each word of the translation.
"""

print(f"Document length: {len(sample_document)} characters")

3. Fixed-Size (Character / Token) Splitting

The simplest approach: split text into chunks of exactly N characters or tokens, with optional overlap.

from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter

# Character-based splitting
char_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separator=""  # Split on any character boundary
)

char_chunks = char_splitter.split_text(sample_document)
print(f"Character splitting: {len(char_chunks)} chunks")
for i, chunk in enumerate(char_chunks):
    print(f"  Chunk {i}: {len(chunk)} chars β€” {chunk[:80]}...")
# Token-based splitting (more precise)
token_splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=50,
    encoding_name="cl100k_base"  # GPT-4 tokenizer
)

token_chunks = token_splitter.split_text(sample_document)
print(f"Token splitting: {len(token_chunks)} chunks")
for i, chunk in enumerate(token_chunks):
    print(f"  Chunk {i}: {len(chunk)} chars β€” {chunk[:80]}...")
# LlamaIndex equivalent
from llama_index.core.node_parser import TokenTextSplitter as LITokenTextSplitter
from llama_index.core import Document

li_splitter = LITokenTextSplitter(
    chunk_size=256,
    chunk_overlap=50,
)

li_docs = [Document(text=sample_document)]
nodes = li_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex token splitting: {len(nodes)} nodes")
for i, node in enumerate(nodes):
    print(f"  Node {i}: {len(node.text)} chars")

4. Recursive Character Splitting

The most popular method: splits using an ordered list of separators, trying the largest structural boundaries first.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    length_function=len,
)

recursive_chunks = splitter.split_text(sample_document)
print(f"Recursive splitting: {len(recursive_chunks)} chunks\n")
for i, chunk in enumerate(recursive_chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:200])
    print()
# LlamaIndex SentenceSplitter (equivalent to recursive splitting)
from llama_index.core.node_parser import SentenceSplitter

sentence_splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

sentence_nodes = sentence_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex SentenceSplitter: {len(sentence_nodes)} nodes")
for i, node in enumerate(sentence_nodes):
    print(f"  Node {i}: {len(node.text)} chars β€” {node.text[:80]}...")

5. Document-Aware (Structural) Splitting

Leverages document structure (markdown headers, HTML tags) to create chunks aligned with the author’s organization.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

md_chunks = md_splitter.split_text(sample_document)
print(f"Markdown splitting: {len(md_chunks)} chunks\n")
for i, chunk in enumerate(md_chunks):
    print(f"--- Chunk {i} ---")
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:150]}...")
    print()
# Two-stage: structural + recursive for consistent sizes
from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

# Stage 1: Split by structure
structural_chunks = md_splitter.split_text(sample_document)

# Stage 2: Enforce size limits
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)

final_chunks = text_splitter.split_documents(structural_chunks)
print(f"Two-stage splitting: {len(final_chunks)} chunks")
for i, chunk in enumerate(final_chunks):
    print(f"  Chunk {i}: {len(chunk.page_content)} chars, metadata={chunk.metadata}")
# LlamaIndex MarkdownNodeParser
from llama_index.core.node_parser import MarkdownNodeParser

md_parser = MarkdownNodeParser()
md_nodes = md_parser.get_nodes_from_documents(li_docs)
print(f"LlamaIndex MarkdownNodeParser: {len(md_nodes)} nodes")
for i, node in enumerate(md_nodes):
    print(f"  Node {i}: {len(node.text)} chars β€” {node.text[:80]}...")

6. Semantic Chunking

Uses embedding similarity to detect topic boundaries β€” splits where semantic coherence drops.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split at top 5% similarity drops
)

semantic_chunks = chunker.split_text(sample_document)
print(f"Semantic chunking: {len(semantic_chunks)} chunks\n")
for i, chunk in enumerate(semantic_chunks):
    print(f"--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:150])
    print()
# LlamaIndex SemanticSplitterNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model,
)

semantic_nodes = semantic_splitter.get_nodes_from_documents(li_docs)
print(f"LlamaIndex semantic splitting: {len(semantic_nodes)} nodes")
for i, node in enumerate(semantic_nodes):
    print(f"  Node {i}: {len(node.text)} chars")

7. Parent-Child (Hierarchical) Chunking

Small chunks for precise search; when a child matches, its larger parent is sent to the LLM for richer context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document as LCDocument

# Child splitter (small chunks for search)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Parent splitter (larger chunks for LLM context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)

vectorstore = FAISS.from_texts(
    [""], OpenAIEmbeddings(model="text-embedding-3-small")
)
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents([LCDocument(page_content=sample_document)])

# Search with small child chunks, get back larger parent chunks
results = retriever.invoke("What is multi-head attention?")
print(f"Retrieved {len(results)} parent chunks\n")
for i, doc in enumerate(results):
    print(f"--- Parent Chunk {i} ({len(doc.page_content)} chars) ---")
    print(doc.page_content[:200])
    print()
# LlamaIndex Auto Merging Retriever
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core import StorageContext, VectorStoreIndex

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]  # Parent -> child -> grandchild
)

hier_nodes = node_parser.get_nodes_from_documents(li_docs)
leaf_nodes = get_leaf_nodes(hier_nodes)

print(f"Total hierarchical nodes: {len(hier_nodes)}")
print(f"Leaf nodes (for indexing): {len(leaf_nodes)}")

8. Comparing Chunking Strategies

Summary of all strategies and when to use each.

# Compare chunk counts and sizes across strategies
strategies = {
    "Fixed-Size (Char)": char_chunks,
    "Fixed-Size (Token)": token_chunks,
    "Recursive Character": recursive_chunks,
    "Semantic": semantic_chunks,
}

print(f"{'Strategy':<25} {'Chunks':<8} {'Avg Size':<10} {'Min':<8} {'Max':<8}")
print("-" * 60)
for name, chunks in strategies.items():
    sizes = [len(c) for c in chunks]
    print(f"{name:<25} {len(chunks):<8} {sum(sizes)/len(sizes):<10.0f} {min(sizes):<8} {max(sizes):<8}")

print("\nMarkdown structural:", len(md_chunks), "chunks")
print("Two-stage (structural + recursive):", len(final_chunks), "chunks")
print("Hierarchical:", len(hier_nodes), "total nodes,", len(leaf_nodes), "leaf nodes")