Design Patterns for AI Agents

A practical catalogue of architectural patterns — from reflection and tool use to multi-agent collaboration — for building reliable, production-grade LLM agents

Published

June 10, 2025

Keywords: AI agent design patterns, agentic workflows, reflection, tool use, planning, multi-agent collaboration, routing, orchestrator-workers, evaluator-optimizer, prompt chaining, parallelization, guardrails, ReAct, LangGraph, LlamaIndex, agent architecture

Introduction

Building a useful AI agent is easy. Building a reliable one is hard. The difference almost always comes down to architecture — specifically, which design patterns you choose and how you compose them.

Over the past two years, the AI community has converged on a set of recurring architectural patterns that appear in every successful agent system — whether it’s a coding assistant resolving GitHub issues, a research agent synthesizing papers, or a customer-support bot handling refunds. These patterns are not framework-specific. They work with OpenAI, Anthropic, open-source models, LangGraph, LlamaIndex, CrewAI, or raw API calls.

Andrew Ng identified four foundational agentic design patterns — Reflection, Tool Use, Planning, and Multi-Agent Collaboration — that improve GPT-4 and GPT-3.5 performance. Anthropic’s engineering team documented practical workflow patterns — Prompt Chaining, Routing, Parallelization, Orchestrator-Workers, and Evaluator-Optimizer — observed in dozens of production deployments. Liu et al. catalogued 18 architectural patterns for foundation model-based agents with detailed trade-off analysis. And Lilian Weng’s foundational survey on LLM-powered autonomous agents decomposed agents into Planning, Memory, and Tool Use components.

This article synthesizes these perspectives into a practical pattern catalogue — organized by complexity, with concrete code examples, architecture diagrams, and guidance on when to use (and when to avoid) each pattern.

The Agent Architecture Stack

Before diving into individual patterns, it helps to understand where they fit in the overall agent architecture. Every LLM agent, regardless of framework, consists of the same core components:

graph TD
    A["User Query"] --> B["Goal Creator<br/><small>Interpret intent, clarify ambiguity</small>"]
    B --> C["Planner<br/><small>Decompose into steps</small>"]
    C --> D["Executor<br/><small>Call tools, query LLMs</small>"]
    D --> E["Reflector<br/><small>Evaluate output quality</small>"]
    E --> F{"Done?"}
    F -->|No| C
    F -->|Yes| G["Response"]

    H["Memory<br/><small>Short-term & long-term</small>"] -.-> C
    H -.-> D
    H -.-> E
    I["Tools<br/><small>APIs, search, code exec</small>"] -.-> D
    J["Guardrails<br/><small>Input/output validation</small>"] -.-> B
    J -.-> G

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#f5a623,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333
    style H fill:#7f8c8d,color:#fff,stroke:#333
    style I fill:#7f8c8d,color:#fff,stroke:#333
    style J fill:#e74c3c,color:#fff,stroke:#333

The patterns in this article map to different components and interactions within this stack. We’ll build from simple compositional workflows up to fully autonomous agents, following Anthropic’s advice: start with the simplest solution possible, and only increase complexity when needed.

Pattern 1: Prompt Chaining

The simplest agentic pattern. Decompose a task into a fixed sequence of LLM calls, where each call processes the output of the previous one. Programmatic checks (“gates”) between steps ensure the process stays on track.

graph LR
    A["Input"] --> B["LLM Call 1<br/><small>Generate</small>"]
    B --> C{"Gate<br/><small>Check quality</small>"}
    C -->|Pass| D["LLM Call 2<br/><small>Refine</small>"]
    C -->|Fail| B
    D --> E["LLM Call 3<br/><small>Format</small>"]
    E --> F["Output"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

When to Use

Tasks that decompose cleanly into fixed subtasks (e.g., generate outline → validate → write content → translate)
When you want to trade latency for accuracy by making each LLM call easier
When intermediate results need programmatic validation

Implementation

from openai import OpenAI

client = OpenAI()


def chain_step(prompt: str, model: str = "gpt-4o-mini") -> str:
    """Single LLM call in the chain."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content


def prompt_chain_example(topic: str) -> str:
    """Generate a blog post through a 3-step chain."""
    # Step 1: Generate outline
    outline = chain_step(
        f"Create a detailed outline for a blog post about: {topic}. "
        f"Return only the outline with numbered sections."
    )

    # Gate: Check outline has at least 3 sections
    if outline.count("\n") < 3:
        outline = chain_step(
            f"The following outline is too short. Expand it to at least 5 sections:\n{outline}"
        )

    # Step 2: Write content from outline
    content = chain_step(
        f"Write a blog post based on this outline. "
        f"Each section should be 2-3 paragraphs:\n\n{outline}"
    )

    # Step 3: Add summary and polish
    final = chain_step(
        f"Add an executive summary at the top and a conclusion at the bottom "
        f"of this blog post. Fix any grammar issues:\n\n{content}"
    )

    return final

Trade-offs

Strength	Weakness
Simple to implement and debug	Fixed sequence — can’t adapt to input
Each step is a focused, easier task	Latency compounds across steps
Gates catch errors early	If step 1 fails, everything downstream fails

Pattern 2: Routing

Classify an input and direct it to a specialized handler. This allows you to optimize prompts and tool configurations per category without one-size-fits-all compromises.

graph TD
    A["Input"] --> B["Router<br/><small>Classify input type</small>"]
    B -->|Type A| C["Handler A<br/><small>Specialized prompt + tools</small>"]
    B -->|Type B| D["Handler B<br/><small>Different prompt + tools</small>"]
    B -->|Type C| E["Handler C<br/><small>Another specialization</small>"]
    C --> F["Output"]
    D --> F
    E --> F

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e67e22,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

When to Use

Distinct categories of input that benefit from different handling (e.g., customer support: general questions vs. refund requests vs. technical issues)
When you want to route simple queries to cheaper/faster models and complex queries to more capable ones
When optimizing for one input type hurts performance on others

Implementation

import json
from openai import OpenAI

client = OpenAI()

ROUTE_SCHEMA = {
    "type": "function",
    "function": {
        "name": "route_query",
        "description": "Classify the user query into a category.",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["technical_support", "billing", "general_inquiry", "complaint"],
                    "description": "The category of the user's query",
                },
                "complexity": {
                    "type": "string",
                    "enum": ["simple", "complex"],
                    "description": "Whether the query requires a simple or complex response",
                },
            },
            "required": ["category", "complexity"],
        },
    },
}


def route_query(query: str) -> dict:
    """Classify a query into category and complexity."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify the user's query."},
            {"role": "user", "content": query},
        ],
        tools=[ROUTE_SCHEMA],
        tool_choice={"type": "function", "function": {"name": "route_query"}},
    )
    return json.loads(response.choices[0].message.tool_calls[0].function.arguments)


# Specialized handlers per category
HANDLERS = {
    "technical_support": "You are a technical support specialist. Provide step-by-step solutions.",
    "billing": "You are a billing specialist. Help with payment and subscription issues.",
    "general_inquiry": "You are a helpful assistant. Provide clear, concise answers.",
    "complaint": "You are a customer relations specialist. Be empathetic and solution-oriented.",
}


def handle_query(query: str) -> str:
    """Route and handle a query."""
    route = route_query(query)
    system_prompt = HANDLERS[route["category"]]

    # Use a more capable model for complex queries
    model = "gpt-4o" if route["complexity"] == "complex" else "gpt-4o-mini"

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

Trade-offs

Strength	Weakness
Each handler is optimized for its category	Routing errors cascade into wrong handlers
Cost optimization through model selection	Overhead of classification step
Clean separation of concerns	Adding new categories requires updating router

Pattern 3: Reflection

Ask the LLM to critique and improve its own output. This is the simplest pattern that creates a genuine feedback loop, and Andrew Ng reports it as one of the most reliably effective patterns.

graph TD
    A["Input"] --> B["Generator<br/><small>Produce initial output</small>"]
    B --> C["Critic<br/><small>Evaluate and suggest improvements</small>"]
    C --> D{"Good<br/>enough?"}
    D -->|No| E["Refiner<br/><small>Apply feedback</small>"]
    E --> C
    D -->|Yes| F["Output"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

Reflection can take several forms, mirroring the taxonomy from Liu et al.’s pattern catalogue:

Reflection Type	Description	Best For
Self-Reflection	Same agent evaluates its own output	Speed, single-agent systems
Cross-Reflection	A different agent/model evaluates	Complex tasks, diverse perspectives
Human Reflection	Human provides feedback in the loop	High-stakes decisions, alignment

Implementation: Self-Reflection

def generate_with_reflection(
    task: str,
    max_rounds: int = 3,
    model: str = "gpt-4o-mini",
) -> str:
    """Generate output with iterative self-reflection."""
    client = OpenAI()

    # Step 1: Initial generation
    messages = [
        {"role": "system", "content": "You are an expert assistant."},
        {"role": "user", "content": task},
    ]
    response = client.chat.completions.create(model=model, messages=messages, temperature=0.7)
    draft = response.choices[0].message.content

    for round_num in range(max_rounds):
        # Step 2: Critique
        critique_response = client.chat.completions.create(
            model=model,
            messages=[{
                "role": "user",
                "content": f"Task: {task}\n\nDraft:\n{draft}\n\n"
                           f"Critique this draft. List specific issues with "
                           f"correctness, completeness, clarity, and style. "
                           f"If the draft is excellent, respond with 'APPROVED'.",
            }],
            temperature=0,
        )
        critique = critique_response.choices[0].message.content

        if "APPROVED" in critique.upper():
            break

        # Step 3: Refine based on critique
        refine_response = client.chat.completions.create(
            model=model,
            messages=[{
                "role": "user",
                "content": f"Task: {task}\n\nDraft:\n{draft}\n\n"
                           f"Critique:\n{critique}\n\n"
                           f"Rewrite the draft addressing all critique points.",
            }],
            temperature=0.7,
        )
        draft = refine_response.choices[0].message.content

    return draft

Implementation: Cross-Reflection with LangGraph

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI


class ReflectionState(TypedDict):
    task: str
    draft: str
    critique: str
    round: int
    max_rounds: int


generator = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
critic = ChatOpenAI(model="gpt-4o", temperature=0)  # More capable critic


def generate(state: ReflectionState) -> dict:
    if state.get("critique"):
        prompt = (
            f"Task: {state['task']}\nDraft: {state['draft']}\n"
            f"Critique: {state['critique']}\nRewrite addressing all issues."
        )
    else:
        prompt = state["task"]
    response = generator.invoke([{"role": "user", "content": prompt}])
    return {"draft": response.content, "round": state.get("round", 0) + 1}


def critique(state: ReflectionState) -> dict:
    response = critic.invoke([{
        "role": "user",
        "content": f"Task: {state['task']}\nDraft:\n{state['draft']}\n\n"
                   f"Provide detailed critique. Say 'APPROVED' if excellent.",
    }])
    return {"critique": response.content}


def should_continue(state: ReflectionState) -> str:
    if "APPROVED" in state.get("critique", "").upper():
        return END
    if state.get("round", 0) >= state.get("max_rounds", 3):
        return END
    return "generate"


graph = StateGraph(ReflectionState)
graph.add_node("generate", generate)
graph.add_node("critique", critique)
graph.set_entry_point("generate")
graph.add_edge("generate", "critique")
graph.add_conditional_edges("critique", should_continue, {"generate": "generate", END: END})

app = graph.compile()

When Reflection Helps (and When It Doesn’t)

Helps: Code generation (test against unit tests), writing/translation (subjective quality), factual QA (verify against retrieved sources), math (check intermediate steps).

Doesn’t help: Simple factual lookups, classification tasks, when the model can’t evaluate its own domain expertise (e.g., advanced medical diagnosis).

Pattern 4: Tool Use

Equip the LLM with external functions it can call — search, code execution, APIs, databases — to ground its reasoning in real observations rather than hallucinating.

Tool Use is the foundation of the ReAct pattern (Reason + Act), where the agent interleaves thinking with tool calls in a loop. This is covered extensively in our article on Building a ReAct Agent from Scratch.

graph TD
    A["User Query"] --> B["LLM<br/><small>Reason about tools needed</small>"]
    B --> C{"Tool call<br/>needed?"}
    C -->|Yes| D["Tool Execution<br/><small>API, search, code, DB</small>"]
    D --> E["Observation<br/><small>Tool output</small>"]
    E --> B
    C -->|No| F["Final Answer"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

Tool Design Best Practices

Anthropic’s engineering team emphasizes that tool design matters as much as prompt design. Think of it as building an Agent-Computer Interface (ACI) — you should invest as much effort here as in Human-Computer Interfaces (HCI).

from langchain_core.tools import tool


# Good: Clear name, detailed docstring, constrained inputs
@tool
def search_knowledge_base(
    query: str,
    category: str = "all",
    max_results: int = 5,
) -> str:
    """Search the internal knowledge base for technical documentation.

    Use this tool when the user asks about:
    - API references, endpoints, and parameters
    - Configuration guides and how-to instructions
    - Known issues and workarounds

    DO NOT use this for general knowledge questions — use web_search instead.

    Args:
        query: Natural language search query. Be specific.
        category: Filter by category. Options: "api", "config", "troubleshooting", "all".
        max_results: Number of results to return (1-10).

    Returns:
        Relevant documentation excerpts with source references.
    """
    # Implementation here
    ...


# Bad: Vague name, no docstring, ambiguous purpose
@tool
def search(q: str) -> str:
    """Search for stuff."""  # Too vague — which search? When to use?
    ...

Key principles from Anthropic’s SWE-bench work:

Use absolute paths, not relative — the model won’t track directory changes correctly
Include examples of correct usage in tool descriptions
Constrain inputs — use enums, ranges, and required fields to prevent malformed calls
Make errors informative — return actionable error messages, not stack traces
Poka-yoke — design tools so it’s hard to make mistakes

Tool Registry Pattern

When an agent has access to many tools (10+), you can’t fit all descriptions into the context window. The Tool/Agent Registry pattern from Liu et al.’s catalogue maintains a searchable catalogue of tools:

from llama_index.core.tools import FunctionTool, ToolMetadata

# Register tools with rich metadata
tool_registry = {}


def register_tool(func, name: str, description: str, category: str):
    """Register a tool with metadata for dynamic selection."""
    tool_registry[name] = {
        "tool": FunctionTool.from_defaults(fn=func, name=name, description=description),
        "category": category,
        "description": description,
    }


def select_tools(query: str, max_tools: int = 5) -> list:
    """Dynamically select the most relevant tools for a query.
    Uses embedding similarity to match query against tool descriptions."""
    # In practice, embed the query and tool descriptions,
    # then return the top-k most similar tools
    ...

This is analogous to RAG for tools — retrieve the most relevant tool descriptions before each LLM call, rather than stuffing all of them into every prompt.

Pattern 5: Planning

Let the LLM autonomously decide what sequence of steps to execute. This is the most powerful — and most unpredictable — single-agent pattern.

Planning manifests in two forms:

graph TD
    subgraph SP["Single-Path Planning"]
        A1["Goal"] --> A2["Step 1"] --> A3["Step 2"] --> A4["Step 3"] --> A5["Result"]
    end

    subgraph MP["Multi-Path Planning"]
        B1["Goal"] --> B2["Step 1"]
        B2 --> B3a["Option A"]
        B2 --> B3b["Option B"]
        B3a --> B4a["Step 2A"]
        B3b --> B4b["Step 2B"]
        B4a --> B5["Best Result"]
        B4b --> B5
    end

    style SP fill:#F2F2F2,stroke:#D9D9D9
    style MP fill:#F2F2F2,stroke:#D9D9D9
    style A5 fill:#27ae60,color:#fff,stroke:#333
    style B5 fill:#27ae60,color:#fff,stroke:#333

Single-Path Plan Generator (Chain-of-Thought): Generate a linear sequence of steps. Simple, efficient, but inflexible if unexpected results appear.

Multi-Path Plan Generator (Tree-of-Thoughts): Generate multiple candidate approaches, evaluate each, and select the best. More robust, but significantly more expensive.

Implementation: Plan-and-Execute

import json
from openai import OpenAI

client = OpenAI()

PLAN_SCHEMA = {
    "type": "function",
    "function": {
        "name": "create_plan",
        "description": "Create a step-by-step plan to accomplish a task.",
        "parameters": {
            "type": "object",
            "properties": {
                "steps": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "step_number": {"type": "integer"},
                            "description": {"type": "string"},
                            "tool": {"type": "string", "description": "Tool to use for this step"},
                            "depends_on": {
                                "type": "array",
                                "items": {"type": "integer"},
                                "description": "Step numbers this step depends on",
                            },
                        },
                        "required": ["step_number", "description", "tool"],
                    },
                },
            },
            "required": ["steps"],
        },
    },
}


def create_plan(task: str, available_tools: list[str]) -> list[dict]:
    """Generate an execution plan for a complex task."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"You are a planning agent. Available tools: {available_tools}. "
                           f"Create a minimal plan — use the fewest steps possible.",
            },
            {"role": "user", "content": task},
        ],
        tools=[PLAN_SCHEMA],
        tool_choice={"type": "function", "function": {"name": "create_plan"}},
    )
    plan = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return plan["steps"]


def execute_plan(plan: list[dict], tools: dict) -> dict:
    """Execute a plan step by step, passing results between steps."""
    results = {}
    for step in sorted(plan, key=lambda s: s["step_number"]):
        # Gather dependency results
        context = {
            dep: results[dep] for dep in step.get("depends_on", []) if dep in results
        }
        # Execute the step
        tool_name = step["tool"]
        if tool_name in tools:
            result = tools[tool_name](step["description"], context)
        else:
            result = f"Tool '{tool_name}' not available"
        results[step["step_number"]] = result
    return results

When to Use Planning

From Andrew Ng’s experience: Planning is a very powerful capability; on the other hand, it leads to less predictable results. Reflection and Tool Use are more mature patterns. Use Planning when:

The task cannot be decomposed in advance — the steps depend on intermediate results
You need the agent to recover from unexpected errors by re-planning
The task is inherently exploratory (research, investigation, debugging)

Pattern 6: Parallelization

Run multiple LLM calls simultaneously and aggregate results. Two key variations:

graph TD
    subgraph Sectioning["Sectioning"]
        A1["Task"] --> A2["Subtask A"]
        A1 --> A3["Subtask B"]
        A1 --> A4["Subtask C"]
        A2 --> A5["Aggregator"]
        A3 --> A5
        A4 --> A5
        A5 --> A6["Output"]
    end

    subgraph Voting["Voting"]
        B1["Task"] --> B2["Attempt 1"]
        B1 --> B3["Attempt 2"]
        B1 --> B4["Attempt 3"]
        B2 --> B5["Majority Vote"]
        B3 --> B5
        B4 --> B5
        B5 --> B6["Output"]
    end

    style Sectioning fill:#F2F2F2,stroke:#D9D9D9
    style Voting fill:#F2F2F2,stroke:#D9D9D9
    style A6 fill:#27ae60,color:#fff,stroke:#333
    style B6 fill:#27ae60,color:#fff,stroke:#333

Sectioning: Break a task into independent subtasks and process them simultaneously. Example: evaluate code quality, security, and performance in parallel.

Voting: Run the same task multiple times and aggregate. Example: have three models independently classify content, take the majority vote. This pattern maps to Liu et al.’s Voting-based Cooperation pattern for multi-agent systems.

Implementation

import asyncio
from openai import AsyncOpenAI

aclient = AsyncOpenAI()


async def parallel_section(task: str, aspects: list[str]) -> dict:
    """Evaluate a task from multiple aspects in parallel (Sectioning)."""
    async def evaluate_aspect(aspect: str) -> tuple[str, str]:
        response = await aclient.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Evaluate the following from the perspective of {aspect}:\n\n{task}",
            }],
        )
        return aspect, response.choices[0].message.content

    results = await asyncio.gather(*[evaluate_aspect(a) for a in aspects])
    return dict(results)


async def parallel_vote(task: str, n_votes: int = 3) -> str:
    """Run the same classification multiple times and take majority vote (Voting)."""
    async def single_vote() -> str:
        response = await aclient.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": task}],
            temperature=0.7,  # Some randomness for diversity
        )
        return response.choices[0].message.content.strip()

    votes = await asyncio.gather(*[single_vote() for _ in range(n_votes)])

    # Majority vote
    from collections import Counter
    vote_counts = Counter(votes)
    return vote_counts.most_common(1)[0][0]


# Usage: sectioning
# results = await parallel_section(
#     "Review this Python code: ...",
#     aspects=["correctness", "security", "performance", "readability"]
# )

# Usage: voting
# label = await parallel_vote(
#     "Classify this text as positive, negative, or neutral: 'The product works but shipping was slow'"
# )

Pattern 7: Orchestrator-Workers

A central orchestrator LLM dynamically breaks down tasks and delegates to worker LLMs, then synthesizes the results. Unlike Parallelization, the subtasks are not pre-defined — the orchestrator decides what’s needed based on the input.

graph TD
    A["Complex Task"] --> B["Orchestrator<br/><small>Analyze, decompose, delegate</small>"]
    B --> C["Worker 1<br/><small>Subtask A</small>"]
    B --> D["Worker 2<br/><small>Subtask B</small>"]
    B --> E["Worker N<br/><small>Subtask N</small>"]
    C --> F["Orchestrator<br/><small>Synthesize results</small>"]
    D --> F
    E --> F
    F --> G["Final Output"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e67e22,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

This maps to Liu et al.’s Role-based Cooperation pattern, where agents assume roles like planner, assigner, and worker. MetaGPT, XAgent, and CrewAI all implement this pattern.

Implementation with LangGraph

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
import json


class OrchestratorState(TypedDict):
    task: str
    subtasks: list[dict]
    results: dict
    final_output: str


orchestrator_llm = ChatOpenAI(model="gpt-4o", temperature=0)
worker_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


def decompose(state: OrchestratorState) -> dict:
    """Orchestrator decomposes the task into subtasks."""
    response = orchestrator_llm.invoke([{
        "role": "user",
        "content": f"Break this task into 2-5 independent subtasks. "
                   f"Return JSON array of objects with 'id' and 'description':\n\n{state['task']}",
    }])
    subtasks = json.loads(response.content)
    return {"subtasks": subtasks}


def execute_subtasks(state: OrchestratorState) -> dict:
    """Workers execute each subtask."""
    results = {}
    for subtask in state["subtasks"]:
        response = worker_llm.invoke([{
            "role": "user",
            "content": f"Complete this subtask:\n{subtask['description']}",
        }])
        results[subtask["id"]] = response.content
    return {"results": results}


def synthesize(state: OrchestratorState) -> dict:
    """Orchestrator synthesizes worker results."""
    results_text = "\n\n".join(
        f"Subtask {k}: {v}" for k, v in state["results"].items()
    )
    response = orchestrator_llm.invoke([{
        "role": "user",
        "content": f"Original task: {state['task']}\n\n"
                   f"Subtask results:\n{results_text}\n\n"
                   f"Synthesize these into a comprehensive final answer.",
    }])
    return {"final_output": response.content}


graph = StateGraph(OrchestratorState)
graph.add_node("decompose", decompose)
graph.add_node("execute", execute_subtasks)
graph.add_node("synthesize", synthesize)
graph.set_entry_point("decompose")
graph.add_edge("decompose", "execute")
graph.add_edge("execute", "synthesize")
graph.add_edge("synthesize", END)

orchestrator_app = graph.compile()

Pattern 8: Evaluator-Optimizer (Iterative Refinement)

One LLM generates a response while another LLM evaluates and provides feedback, in a loop. This is the production-strength version of the Reflection pattern — instead of self-critique, a separate evaluator with explicit rubrics drives improvement.

graph TD
    A["Input"] --> B["Generator LLM<br/><small>Produce output</small>"]
    B --> C["Evaluator LLM<br/><small>Score against rubrics</small>"]
    C --> D{"Score ≥<br/>threshold?"}
    D -->|No| E["Feedback<br/><small>Specific improvements needed</small>"]
    E --> B
    D -->|Yes| F["Output ✓"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#9b59b6,color:#fff,stroke:#333
    style C fill:#e67e22,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333

When to Use

This pattern works when:

You have clear evaluation criteria that can be expressed as rubrics
LLM responses demonstrably improve when given human-like feedback
The task involves subjective quality (writing, translation, design)

Implementation

import json
from openai import OpenAI

client = OpenAI()

EVAL_SCHEMA = {
    "type": "function",
    "function": {
        "name": "evaluate",
        "parameters": {
            "type": "object",
            "properties": {
                "score": {"type": "number", "description": "Quality score from 1-10"},
                "strengths": {"type": "array", "items": {"type": "string"}},
                "weaknesses": {"type": "array", "items": {"type": "string"}},
                "suggestions": {"type": "array", "items": {"type": "string"}},
                "approved": {"type": "boolean"},
            },
            "required": ["score", "strengths", "weaknesses", "suggestions", "approved"],
        },
    },
}


def evaluator_optimizer(
    task: str,
    rubric: str,
    max_rounds: int = 3,
    threshold: float = 8.0,
) -> str:
    """Iteratively generate and evaluate until quality threshold is met."""
    draft = None
    feedback = None

    for round_num in range(max_rounds):
        # Generate (or refine)
        gen_prompt = f"Task: {task}"
        if feedback:
            gen_prompt += f"\n\nPrevious draft:\n{draft}\n\nFeedback:\n{feedback}\n\nRevise."

        gen_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": gen_prompt}],
            temperature=0.7,
        )
        draft = gen_response.choices[0].message.content

        # Evaluate
        eval_response = client.chat.completions.create(
            model="gpt-4o",  # More capable evaluator
            messages=[{
                "role": "user",
                "content": f"Evaluate this output against the rubric.\n\n"
                           f"Rubric:\n{rubric}\n\nOutput:\n{draft}",
            }],
            tools=[EVAL_SCHEMA],
            tool_choice={"type": "function", "function": {"name": "evaluate"}},
        )
        evaluation = json.loads(
            eval_response.choices[0].message.tool_calls[0].function.arguments
        )

        if evaluation["approved"] or evaluation["score"] >= threshold:
            return draft

        feedback = "\n".join(evaluation["suggestions"])

    return draft  # Return best effort after max rounds

Pattern 9: Multi-Agent Collaboration

The most powerful — and most complex — pattern. Multiple specialized agents collaborate, debate, or vote on tasks. This maps to three cooperation schemes from Liu et al.:

graph TD
    subgraph RB["Role-Based"]
        R1["Planner"] --> R2["Coder"]
        R2 --> R3["Reviewer"]
        R3 --> R4["Tester"]
    end

    subgraph DB["Debate-Based"]
        D1["Agent A"] <--> D2["Agent B"]
        D2 <--> D3["Agent C"]
        D1 <--> D3
    end

    subgraph VB["Voting-Based"]
        V1["Agent 1<br/><small>Vote: A</small>"]
        V2["Agent 2<br/><small>Vote: B</small>"]
        V3["Agent 3<br/><small>Vote: A</small>"]
        V1 --> V4["Majority: A"]
        V2 --> V4
        V3 --> V4
    end

    style RB fill:#F2F2F2,stroke:#D9D9D9
    style DB fill:#F2F2F2,stroke:#D9D9D9
    style VB fill:#F2F2F2,stroke:#D9D9D9

Role-based: Agents take assigned roles (planner, coder, reviewer) in a pipeline. Used by MetaGPT, ChatDev, and CrewAI.

Debate-based: Agents argue different positions and converge on a consensus. Reduces hallucination through adversarial checking.

Voting-based: Multiple agents independently produce answers, and the majority wins. Simple but effective — “More agents is all you need” (Li et al., 2024).

Implementation: Role-Based Multi-Agent with LangGraph

from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool


# Specialized agents
researcher_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
writer_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
editor_llm = ChatOpenAI(model="gpt-4o", temperature=0)


@tool
def search_web(query: str) -> str:
    """Search the web for current information."""
    # Implementation with your preferred search API
    ...


@tool
def check_facts(claims: str) -> str:
    """Verify factual claims against reliable sources."""
    # Implementation
    ...


# Create specialized agents
researcher = create_react_agent(
    model=researcher_llm,
    tools=[search_web],
    prompt="You are a research specialist. Gather comprehensive, accurate information. "
           "Always cite your sources.",
)

writer = create_react_agent(
    model=writer_llm,
    tools=[],
    prompt="You are a skilled writer. Create engaging, clear content from research notes. "
           "Maintain accuracy while being accessible.",
)

editor = create_react_agent(
    model=editor_llm,
    tools=[check_facts],
    prompt="You are a rigorous editor. Check for factual accuracy, clarity, and completeness. "
           "Verify all claims. Suggest specific improvements.",
)

For full multi-agent orchestration patterns, see our dedicated article on Multi-Agent RAG Orchestration Patterns.

Pattern 10: Guardrails

Control the inputs and outputs of LLMs to enforce safety, quality, and compliance requirements. Guardrails are not a single pattern but a cross-cutting concern that applies to every other pattern.

graph LR
    A["User Input"] --> B["Input<br/>Guardrails"]
    B --> C["LLM / Agent"]
    C --> D["Output<br/>Guardrails"]
    D --> E["User Response"]

    B2["Reject / Redirect"] --> B
    D2["Reject / Retry"] --> D

    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#9b59b6,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333

Liu et al. describe this as the Multimodal Guardrails pattern — an intermediate layer between the foundation model and all other components that validates inputs and filters outputs. Key implementations include NVIDIA NeMo Guardrails, Meta’s Llama Guard, and Guardrails AI.

Guardrails in agent systems should check:

Layer	What to Check	Example
Input	Prompt injection, PII, off-topic queries	Reject `"ignore all previous instructions"`
Tool calls	Valid tool names, safe arguments, rate limits	Block `rm -rf /` in code execution tools
Tool outputs	Sensitive data leakage, error handling	Redact API keys from tool responses
Agent output	Hallucination, tone, compliance, format	Verify cited sources actually exist

For more detail on implementing guardrails, see Guardrails and Safety for Autonomous Retrieval Agents.

Choosing the Right Pattern

Not every task needs an autonomous agent. Anthropic’s key advice: start with simple prompts, optimize with evaluation, and add agentic patterns only when simpler solutions fall short.

graph TD
    A["Task"] --> B{"Single LLM call<br/>sufficient?"}
    B -->|Yes| C["Direct Prompting"]
    B -->|No| D{"Fixed sequence<br/>of steps?"}
    D -->|Yes| E["Prompt Chaining"]
    D -->|No| F{"Distinct input<br/>categories?"}
    F -->|Yes| G["Routing"]
    F -->|No| H{"Independent<br/>subtasks?"}
    H -->|Yes| I["Parallelization"]
    H -->|No| J{"Need iterative<br/>improvement?"}
    J -->|Yes| K["Evaluator-Optimizer"]
    J -->|No| L{"Need tools or<br/>external data?"}
    L -->|Yes| M["ReAct / Tool Use"]
    L -->|No| N{"Subtasks unknown<br/>in advance?"}
    N -->|Yes| O["Orchestrator-Workers"]
    N -->|No| P["Multi-Agent"]

    style C fill:#27ae60,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style I fill:#27ae60,color:#fff,stroke:#333
    style K fill:#27ae60,color:#fff,stroke:#333
    style M fill:#27ae60,color:#fff,stroke:#333
    style O fill:#f5a623,color:#fff,stroke:#333
    style P fill:#e74c3c,color:#fff,stroke:#333

Pattern Selection Summary

Pattern	Complexity	Latency	Cost	Predictability	Best For
Prompt Chaining	Low	Medium	Low	High	Fixed multi-step workflows
Routing	Low	Low	Low	High	Diverse input types
Reflection	Low	Medium	Medium	High	Quality improvement
Tool Use / ReAct	Medium	Medium	Medium	Medium	Grounded reasoning, fact-checking
Parallelization	Medium	Low	Medium	High	Independent evaluations, voting
Planning	Medium	High	High	Low	Exploratory, open-ended tasks
Orchestrator-Workers	High	High	High	Medium	Dynamic task decomposition
Evaluator-Optimizer	Medium	High	High	Medium	Iterative quality refinement
Multi-Agent	High	Very High	Very High	Low	Complex collaborative tasks
Guardrails	Low	Low	Low	High	Safety, compliance (always use)

Common Pitfalls

Pitfall	Symptom	Fix
Over-engineering	Simple task wrapped in multi-agent system	Start with single LLM call, add complexity only when measured performance improves
No stopping conditions	Agent loops forever, burns tokens	Always set max steps, token budgets, and timeout limits
Poor tool descriptions	Agent calls wrong tools or passes bad arguments	Treat tool docstrings like API docs for a junior developer
No evaluation	Can’t tell if agent changes improve results	Build automated evals before adding patterns
Ignoring latency	30-second response time for simple questions	Route simple queries to fast paths, reserve agents for complex tasks
Hallucinated tools	Agent invents tool names that don’t exist	Validate tool names before execution; return available tools in error messages
Context overflow	Long agent sessions exceed model limit	Summarize older messages, use memory systems

Conclusion

Design patterns for AI agents are not theoretical — they are the building blocks that every production agent system uses. The key insight from both academic research and industry practice is that composition of simple patterns beats monolithic complexity.

Key takeaways:

Start simple: Prompt Chaining and Routing handle most tasks. Only reach for Planning or Multi-Agent when simpler patterns fail.
Reflection is underrated: Self-critique with explicit feedback loops reliably improves output quality with minimal implementation effort.
Tool design = prompt design: Invest as much effort in tool descriptions and interfaces as in your system prompts. Anthropic’s SWE-bench agent spent more time on tool optimization than prompt optimization.
Guardrails are not optional: Every agent needs input validation, output filtering, and stopping conditions. These are safety-critical, not nice-to-have.
Patterns compose: Real systems combine multiple patterns — a routing layer that dispatches to specialized ReAct agents, each using reflection, all wrapped in guardrails.
Evaluate before you complicate: Add patterns only when automated evaluations show they improve results. Complexity without measurement is just overhead.

The field is evolving rapidly — new patterns emerge as LLMs gain capabilities in reasoning, planning, and tool use. But the architectural principles are stable: decomposition, feedback loops, grounding in observations, and fail-safe stopping conditions.

References

Ng, A., Agentic Design Patterns Parts 1–5, DeepLearning.AI The Batch, 2024 — the four foundational patterns: Reflection, Tool Use, Planning, Multi-Agent Collaboration.
Anthropic, Building Effective Agents, Anthropic Engineering, 2024 — practical workflow patterns from production deployments.
Liu et al., Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents, arXiv 2024 — 18 architectural patterns with trade-off analysis.
Weng, L., LLM Powered Autonomous Agents, Lil’Log, 2023 — foundational survey on agent architecture (Planning, Memory, Tool Use).
Wang et al., A Survey on Large Language Model based Autonomous Agents, Frontiers of Computer Science, 2024 — comprehensive survey of LLM-based agent construction and applications.
Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023 — the foundational Thought-Action-Observation loop.
Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023 — self-reflection with dynamic memory.
Madaan et al., Self-Refine: Iterative Refinement with Self-Feedback, NeurIPS 2023 — iterative self-critique and refinement.
Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS 2024 — multi-path planning with tree search.
Hong et al., MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework, ICLR 2024 — role-based multi-agent cooperation for software development.
Wu et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023 — multi-agent conversation framework.
Li et al., More Agents Is All You Need, 2024 — voting-based cooperation with multiple agents.

Understand the ReAct loop in depth with Building a ReAct Agent from Scratch — implementing the Thought-Action-Observation cycle with LangGraph and LlamaIndex.
Build graph-based agent workflows with Building Agents with LangGraph — state machines, conditional edges, and human-in-the-loop.
Explore multi-agent architectures with Multi-Agent RAG Orchestration Patterns — supervisor, swarm, and hierarchical topologies.
Connect tools to agents with Tool Use and Function Calling for Retrieval Agents — structured function calling, MCP, and tool registries.
Add planning to your agents with Planning and Query Decomposition for Complex Retrieval — plan-and-execute, adaptive replanning.
Implement persistent context with Memory Systems for Long-Running Retrieval Agents — short-term, long-term, and episodic memory.
Add safety layers with Guardrails and Safety for Autonomous Retrieval Agents — input validation, output filtering, and runtime controls.
Monitor agent behavior in production with Evaluating and Debugging AI Agents — tracing, metrics, and failure analysis.
Scale agents for real-world use with Deploying Retrieval Agents in Production — infrastructure, cost optimization, and reliability.

Introduction

The Agent Architecture Stack

Pattern 1: Prompt Chaining

When to Use

Implementation

Trade-offs

Pattern 2: Routing

When to Use

Implementation

Trade-offs

Pattern 3: Reflection

Implementation: Self-Reflection

Implementation: Cross-Reflection with LangGraph

When Reflection Helps (and When It Doesn’t)

Pattern 4: Tool Use

Tool Design Best Practices

Tool Registry Pattern

Pattern 5: Planning

Implementation: Plan-and-Execute

When to Use Planning

Pattern 6: Parallelization

Implementation

Pattern 7: Orchestrator-Workers

Implementation with LangGraph

Pattern 8: Evaluator-Optimizer (Iterative Refinement)

When to Use

Implementation

Pattern 9: Multi-Agent Collaboration

Implementation: Role-Based Multi-Agent with LangGraph

Pattern 10: Guardrails

Choosing the Right Pattern

Pattern Selection Summary

Common Pitfalls

Conclusion

References

Read More