FinOps Best Practices for LLM Applications

A comprehensive guide to optimizing costs for LLM-powered applications

Open In Colab

Article: FinOps Best Practices for LLM Applications

Setup & Installation
Understanding LLM Cost Structure
OpenAI Prompt Caching
Anthropic Prompt Caching
Model Routing
Cascade Pattern
Prompt Optimization
Semantic Caching with GPTCache
Custom Semantic Cache
Cost Monitoring

1. Setup & Installation

!pip install -q openai anthropic gptcache tiktoken

import os
import json
import time
import numpy as np
from openai import OpenAI
from anthropic import Anthropic
import tiktoken

# Set API keys
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

openai_client = OpenAI()
anthropic_client = Anthropic()

2. Understanding LLM Cost Structure

LLM costs are driven by token usage. Here’s a pricing overview:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input (per 1M tokens)
GPT-4o	$2.50	$10.00	$1.25
GPT-4o-mini	$0.15	$0.60	$0.075
Claude 3.5 Sonnet	$3.00	$15.00	$0.30
Claude 3.5 Haiku	$0.80	$4.00	$0.08

Key cost drivers: - Input tokens: System prompts, context, user messages - Output tokens: Model-generated responses - Frequency: Number of API calls per unit time - Model choice: Price varies 10-100x between models

3. OpenAI Prompt Caching

OpenAI automatically caches prompts longer than 1024 tokens, giving a 50% discount on cached input tokens.

# Define a long system prompt (must be > 1024 tokens for caching)
LONG_SYSTEM_PROMPT = """
You are a highly specialized financial analysis assistant. Your role is to help users
understand complex financial documents, analyze market trends, and provide insights
based on the data provided. You should always cite specific numbers and sources when
making claims. Follow these detailed guidelines:

1. Data Accuracy: Always verify calculations and cross-reference numbers.
2. Risk Assessment: Highlight potential risks in any financial recommendation.
3. Regulatory Compliance: Ensure all advice complies with standard financial regulations.
4. Historical Context: Provide historical comparisons when analyzing trends.
5. Clear Communication: Explain complex financial concepts in accessible language.

""" + "Additional context and guidelines. " * 200  # Pad to exceed 1024 tokens

def query_with_caching(user_message):
    """Send a request and check for cached tokens."""
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": LONG_SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )

    usage = response.usage
    cached_tokens = usage.prompt_tokens_details.cached_tokens

    print(f"Total prompt tokens: {usage.prompt_tokens}")
    print(f"Cached tokens: {cached_tokens}")
    print(f"Cache hit rate: {cached_tokens / usage.prompt_tokens * 100:.1f}%")
    print(f"Response: {response.choices[0].message.content[:200]}...")

    return response

# First call - populates cache
print("=== First Call (Cold) ===")
query_with_caching("Analyze the current bond market trends.")

print("\n=== Second Call (Warm - should hit cache) ===")
query_with_caching("What are the risks of investing in tech stocks?")

4. Anthropic Prompt Caching

Anthropic provides explicit cache control with cache_control blocks, offering a 90% discount on cached reads.

# Anthropic prompt caching with explicit cache_control
SYSTEM_CONTENT = "You are a financial analysis expert. " * 300  # Long system prompt

def query_anthropic_with_caching(user_message):
    """Send a request with Anthropic's explicit prompt caching."""
    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_CONTENT,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ]
    )

    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
    print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
    print(f"Output tokens: {usage.output_tokens}")

    return response

# First call - creates cache
print("=== First Call (Cache Creation) ===")
query_anthropic_with_caching("Summarize key market indicators.")

print("\n=== Second Call (Cache Hit) ===")
query_anthropic_with_caching("What about inflation trends?")

5. Model Routing

Route queries to the most cost-effective model based on complexity.

MODEL_MAP = {
    "simple": "gpt-4o-mini",
    "moderate": "gpt-4o",
    "complex": "gpt-4o"
}

def classify_complexity(query: str) -> str:
    """Use a cheap model to classify query complexity."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Classify the complexity of this query as 'simple', 'moderate', or 'complex'. Respond with only the classification word."
            },
            {"role": "user", "content": query}
        ],
        max_tokens=10
    )
    return response.choices[0].message.content.strip().lower()

def route_request(query: str) -> str:
    """Route the request to the appropriate model."""
    complexity = classify_complexity(query)
    model = MODEL_MAP.get(complexity, "gpt-4o-mini")

    print(f"Query: {query}")
    print(f"Classified as: {complexity} -> Using model: {model}")

    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}]
    )
    return response.choices[0].message.content

# Test with different complexity levels
queries = [
    "What is the capital of France?",
    "Explain the pros and cons of index fund investing.",
    "Build a multi-factor risk model for a diversified portfolio including derivatives."
]

for q in queries:
    result = route_request(q)
    print(f"Response preview: {result[:100]}...\n")

6. Cascade Pattern

Try a small, cheap model first and only escalate to a larger model if the response confidence is low.

def cascade_request(query: str, confidence_threshold: float = 0.7) -> dict:
    """Try small model first, escalate to large model if confidence is low."""

    # Step 1: Try with the small model
    small_response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question. At the end, rate your confidence "
                    "from 0.0 to 1.0 in the format: [CONFIDENCE: X.X]"
                )
            },
            {"role": "user", "content": query}
        ]
    )

    small_answer = small_response.choices[0].message.content

    # Extract confidence score
    import re
    match = re.search(r"\[CONFIDENCE:\s*(\d+\.\d+)\]", small_answer)
    confidence = float(match.group(1)) if match else 0.5

    print(f"Small model confidence: {confidence}")

    if confidence >= confidence_threshold:
        print("-> Using small model response (cost-efficient)")
        return {"model": "gpt-4o-mini", "response": small_answer, "escalated": False}

    # Step 2: Escalate to large model
    print("-> Escalating to large model")
    large_response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )

    return {
        "model": "gpt-4o",
        "response": large_response.choices[0].message.content,
        "escalated": True
    }

# Test cascade pattern
result = cascade_request("What is 2 + 2?")
print(f"\nModel used: {result['model']}, Escalated: {result['escalated']}")

result = cascade_request("Derive the Black-Scholes formula from first principles.")
print(f"\nModel used: {result['model']}, Escalated: {result['escalated']}")

7. Prompt Optimization

Reduce token usage through efficient prompt engineering.

Technique	Before	After	Savings
Remove filler words	“Could you please kindly…”	“List…”	~30%
Use abbreviations	“information”	“info”	~10%
Structured output	Free-form text	JSON schema	~20%
Context window management	Full history	Summarized history	~50%

# Context window management with tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    """Count tokens in a text string."""
    return len(encoding.encode(text))

def trim_messages_to_budget(messages: list, max_tokens: int = 4000) -> list:
    """Trim conversation history to fit within a token budget."""
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens <= max_tokens:
        return messages

    # Always keep system message and last user message
    system_msg = messages[0] if messages[0]["role"] == "system" else None
    last_msg = messages[-1]

    trimmed = []
    if system_msg:
        trimmed.append(system_msg)

    remaining_budget = max_tokens - sum(count_tokens(m["content"]) for m in trimmed) - count_tokens(last_msg["content"])

    # Add messages from most recent, working backwards
    middle_messages = messages[1:-1] if system_msg else messages[:-1]
    kept = []
    for msg in reversed(middle_messages):
        msg_tokens = count_tokens(msg["content"])
        if remaining_budget >= msg_tokens:
            kept.insert(0, msg)
            remaining_budget -= msg_tokens
        else:
            break

    trimmed.extend(kept)
    trimmed.append(last_msg)

    print(f"Trimmed from {len(messages)} to {len(trimmed)} messages")
    print(f"Token count: {total_tokens} -> {sum(count_tokens(m['content']) for m in trimmed)}")

    return trimmed

# Example usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about Python."},
    {"role": "assistant", "content": "Python is a versatile programming language... " * 100},
    {"role": "user", "content": "What about its data science libraries?"},
    {"role": "assistant", "content": "Python has many data science libraries... " * 100},
    {"role": "user", "content": "Summarize the key takeaways."}
]

trimmed = trim_messages_to_budget(messages, max_tokens=500)

8. Semantic Caching with GPTCache

Cache semantically similar queries to avoid redundant API calls.

from gptcache import cache
from gptcache.adapter import openai as gptcache_openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation import SearchDistanceEvaluation

# Initialize GPTCache with ONNX embeddings + FAISS
onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation()
)
cache.set_openai_key()

# First query - cache miss, calls API
start = time.time()
response1 = gptcache_openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)
print(f"First call: {time.time() - start:.2f}s")
print(f"Response: {response1['choices'][0]['message']['content'][:200]}")

# Similar query - should hit cache
start = time.time()
response2 = gptcache_openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain what ML is"}]
)
print(f"\nSecond call (cached): {time.time() - start:.2f}s")
print(f"Response: {response2['choices'][0]['message']['content'][:200]}")

9. Custom Semantic Cache

Build a lightweight semantic cache using OpenAI embeddings and cosine similarity.

class SemanticCache:
    """Simple semantic cache using OpenAI embeddings."""

    def __init__(self, similarity_threshold: float = 0.92):
        self.cache = []  # List of (embedding, query, response)
        self.similarity_threshold = similarity_threshold

    def _get_embedding(self, text: str) -> list:
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list, b: list) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def lookup(self, query: str):
        """Check cache for a semantically similar query."""
        query_embedding = self._get_embedding(query)

        best_match = None
        best_score = 0

        for emb, cached_query, cached_response in self.cache:
            score = self._cosine_similarity(query_embedding, emb)
            if score > best_score:
                best_score = score
                best_match = (cached_query, cached_response)

        if best_score >= self.similarity_threshold and best_match:
            print(f"Cache HIT (similarity: {best_score:.4f})")
            print(f"Matched query: '{best_match[0]}'")
            return best_match[1]

        print(f"Cache MISS (best similarity: {best_score:.4f})")
        return None

    def store(self, query: str, response: str):
        """Store a query-response pair in cache."""
        embedding = self._get_embedding(query)
        self.cache.append((embedding, query, response))

    def query(self, query: str) -> str:
        """Query with cache - lookup first, then call API if needed."""
        cached = self.lookup(query)
        if cached:
            return cached

        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": query}]
        )
        answer = response.choices[0].message.content
        self.store(query, answer)
        return answer

# Test the custom cache
semantic_cache = SemanticCache(similarity_threshold=0.90)

print("=== Query 1 ===")
r1 = semantic_cache.query("What are the benefits of cloud computing?")
print(f"Response: {r1[:150]}...\n")

print("=== Query 2 (similar) ===")
r2 = semantic_cache.query("What advantages does cloud computing offer?")
print(f"Response: {r2[:150]}...")

10. Cost Monitoring

Production Cost Monitoring Tips

Track per-request costs: Log usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens for every API call.
Set budget alerts: Use OpenAI’s usage dashboard or build custom alerts when daily spend exceeds thresholds.
Monitor cache hit rates: Track the ratio of cached vs. fresh requests to measure caching effectiveness.
Use vLLM prefix caching for self-hosted models: Enable with --enable-prefix-caching flag when starting the vLLM server:

# vLLM with prefix caching enabled
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --enable-prefix-caching \
    --max-model-len 8192

Summary of Cost Optimization Strategies

Strategy	Potential Savings	Implementation Effort
Prompt Caching (OpenAI)	50% on cached inputs	Low
Prompt Caching (Anthropic)	90% on cached reads	Low
Model Routing	60-80% on simple queries	Medium
Cascade Pattern	40-70% overall	Medium
Prompt Optimization	20-50% fewer tokens	Low
Semantic Caching	80-95% on repeated queries	Medium
vLLM Prefix Caching	Variable (self-hosted)	High

Table of Contents