FinOps Best Practices for LLM Applications

A comprehensive guide to optimizing costs for LLM-powered applications

Open In Colab

Article: FinOps Best Practices for LLM Applications

Table of Contents

  1. Setup & Installation
  2. Understanding LLM Cost Structure
  3. OpenAI Prompt Caching
  4. Anthropic Prompt Caching
  5. Model Routing
  6. Cascade Pattern
  7. Prompt Optimization
  8. Semantic Caching with GPTCache
  9. Custom Semantic Cache
  10. Cost Monitoring

1. Setup & Installation

!pip install -q openai anthropic gptcache tiktoken
import os
import json
import time
import numpy as np
from openai import OpenAI
from anthropic import Anthropic
import tiktoken

# Set API keys
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

openai_client = OpenAI()
anthropic_client = Anthropic()

2. Understanding LLM Cost Structure

LLM costs are driven by token usage. Here’s a pricing overview:

Model Input (per 1M tokens) Output (per 1M tokens) Cached Input (per 1M tokens)
GPT-4o $2.50 $10.00 $1.25
GPT-4o-mini $0.15 $0.60 $0.075
Claude 3.5 Sonnet $3.00 $15.00 $0.30
Claude 3.5 Haiku $0.80 $4.00 $0.08

Key cost drivers: - Input tokens: System prompts, context, user messages - Output tokens: Model-generated responses - Frequency: Number of API calls per unit time - Model choice: Price varies 10-100x between models

3. OpenAI Prompt Caching

OpenAI automatically caches prompts longer than 1024 tokens, giving a 50% discount on cached input tokens.

# Define a long system prompt (must be > 1024 tokens for caching)
LONG_SYSTEM_PROMPT = """
You are a highly specialized financial analysis assistant. Your role is to help users
understand complex financial documents, analyze market trends, and provide insights
based on the data provided. You should always cite specific numbers and sources when
making claims. Follow these detailed guidelines:

1. Data Accuracy: Always verify calculations and cross-reference numbers.
2. Risk Assessment: Highlight potential risks in any financial recommendation.
3. Regulatory Compliance: Ensure all advice complies with standard financial regulations.
4. Historical Context: Provide historical comparisons when analyzing trends.
5. Clear Communication: Explain complex financial concepts in accessible language.

""" + "Additional context and guidelines. " * 200  # Pad to exceed 1024 tokens

def query_with_caching(user_message):
    """Send a request and check for cached tokens."""
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": LONG_SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )

    usage = response.usage
    cached_tokens = usage.prompt_tokens_details.cached_tokens

    print(f"Total prompt tokens: {usage.prompt_tokens}")
    print(f"Cached tokens: {cached_tokens}")
    print(f"Cache hit rate: {cached_tokens / usage.prompt_tokens * 100:.1f}%")
    print(f"Response: {response.choices[0].message.content[:200]}...")

    return response

# First call - populates cache
print("=== First Call (Cold) ===")
query_with_caching("Analyze the current bond market trends.")

print("\n=== Second Call (Warm - should hit cache) ===")
query_with_caching("What are the risks of investing in tech stocks?")

4. Anthropic Prompt Caching

Anthropic provides explicit cache control with cache_control blocks, offering a 90% discount on cached reads.

# Anthropic prompt caching with explicit cache_control
SYSTEM_CONTENT = "You are a financial analysis expert. " * 300  # Long system prompt

def query_anthropic_with_caching(user_message):
    """Send a request with Anthropic's explicit prompt caching."""
    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_CONTENT,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ]
    )

    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
    print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
    print(f"Output tokens: {usage.output_tokens}")

    return response

# First call - creates cache
print("=== First Call (Cache Creation) ===")
query_anthropic_with_caching("Summarize key market indicators.")

print("\n=== Second Call (Cache Hit) ===")
query_anthropic_with_caching("What about inflation trends?")

5. Model Routing

Route queries to the most cost-effective model based on complexity.

MODEL_MAP = {
    "simple": "gpt-4o-mini",
    "moderate": "gpt-4o",
    "complex": "gpt-4o"
}

def classify_complexity(query: str) -> str:
    """Use a cheap model to classify query complexity."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Classify the complexity of this query as 'simple', 'moderate', or 'complex'. Respond with only the classification word."
            },
            {"role": "user", "content": query}
        ],
        max_tokens=10
    )
    return response.choices[0].message.content.strip().lower()

def route_request(query: str) -> str:
    """Route the request to the appropriate model."""
    complexity = classify_complexity(query)
    model = MODEL_MAP.get(complexity, "gpt-4o-mini")

    print(f"Query: {query}")
    print(f"Classified as: {complexity} -> Using model: {model}")

    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}]
    )
    return response.choices[0].message.content

# Test with different complexity levels
queries = [
    "What is the capital of France?",
    "Explain the pros and cons of index fund investing.",
    "Build a multi-factor risk model for a diversified portfolio including derivatives."
]

for q in queries:
    result = route_request(q)
    print(f"Response preview: {result[:100]}...\n")

6. Cascade Pattern

Try a small, cheap model first and only escalate to a larger model if the response confidence is low.

def cascade_request(query: str, confidence_threshold: float = 0.7) -> dict:
    """Try small model first, escalate to large model if confidence is low."""

    # Step 1: Try with the small model
    small_response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question. At the end, rate your confidence "
                    "from 0.0 to 1.0 in the format: [CONFIDENCE: X.X]"
                )
            },
            {"role": "user", "content": query}
        ]
    )

    small_answer = small_response.choices[0].message.content

    # Extract confidence score
    import re
    match = re.search(r"\[CONFIDENCE:\s*(\d+\.\d+)\]", small_answer)
    confidence = float(match.group(1)) if match else 0.5

    print(f"Small model confidence: {confidence}")

    if confidence >= confidence_threshold:
        print("-> Using small model response (cost-efficient)")
        return {"model": "gpt-4o-mini", "response": small_answer, "escalated": False}

    # Step 2: Escalate to large model
    print("-> Escalating to large model")
    large_response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )

    return {
        "model": "gpt-4o",
        "response": large_response.choices[0].message.content,
        "escalated": True
    }

# Test cascade pattern
result = cascade_request("What is 2 + 2?")
print(f"\nModel used: {result['model']}, Escalated: {result['escalated']}")

result = cascade_request("Derive the Black-Scholes formula from first principles.")
print(f"\nModel used: {result['model']}, Escalated: {result['escalated']}")

7. Prompt Optimization

Reduce token usage through efficient prompt engineering.

Technique Before After Savings
Remove filler words “Could you please kindly…” “List…” ~30%
Use abbreviations “information” “info” ~10%
Structured output Free-form text JSON schema ~20%
Context window management Full history Summarized history ~50%
# Context window management with tiktoken
encoding = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    """Count tokens in a text string."""
    return len(encoding.encode(text))

def trim_messages_to_budget(messages: list, max_tokens: int = 4000) -> list:
    """Trim conversation history to fit within a token budget."""
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens <= max_tokens:
        return messages

    # Always keep system message and last user message
    system_msg = messages[0] if messages[0]["role"] == "system" else None
    last_msg = messages[-1]

    trimmed = []
    if system_msg:
        trimmed.append(system_msg)

    remaining_budget = max_tokens - sum(count_tokens(m["content"]) for m in trimmed) - count_tokens(last_msg["content"])

    # Add messages from most recent, working backwards
    middle_messages = messages[1:-1] if system_msg else messages[:-1]
    kept = []
    for msg in reversed(middle_messages):
        msg_tokens = count_tokens(msg["content"])
        if remaining_budget >= msg_tokens:
            kept.insert(0, msg)
            remaining_budget -= msg_tokens
        else:
            break

    trimmed.extend(kept)
    trimmed.append(last_msg)

    print(f"Trimmed from {len(messages)} to {len(trimmed)} messages")
    print(f"Token count: {total_tokens} -> {sum(count_tokens(m['content']) for m in trimmed)}")

    return trimmed

# Example usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about Python."},
    {"role": "assistant", "content": "Python is a versatile programming language... " * 100},
    {"role": "user", "content": "What about its data science libraries?"},
    {"role": "assistant", "content": "Python has many data science libraries... " * 100},
    {"role": "user", "content": "Summarize the key takeaways."}
]

trimmed = trim_messages_to_budget(messages, max_tokens=500)

8. Semantic Caching with GPTCache

Cache semantically similar queries to avoid redundant API calls.

from gptcache import cache
from gptcache.adapter import openai as gptcache_openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation import SearchDistanceEvaluation

# Initialize GPTCache with ONNX embeddings + FAISS
onnx = Onnx()
cache_base = CacheBase("sqlite")
vector_base = VectorBase("faiss", dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation()
)
cache.set_openai_key()

# First query - cache miss, calls API
start = time.time()
response1 = gptcache_openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)
print(f"First call: {time.time() - start:.2f}s")
print(f"Response: {response1['choices'][0]['message']['content'][:200]}")

# Similar query - should hit cache
start = time.time()
response2 = gptcache_openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain what ML is"}]
)
print(f"\nSecond call (cached): {time.time() - start:.2f}s")
print(f"Response: {response2['choices'][0]['message']['content'][:200]}")

9. Custom Semantic Cache

Build a lightweight semantic cache using OpenAI embeddings and cosine similarity.

class SemanticCache:
    """Simple semantic cache using OpenAI embeddings."""

    def __init__(self, similarity_threshold: float = 0.92):
        self.cache = []  # List of (embedding, query, response)
        self.similarity_threshold = similarity_threshold

    def _get_embedding(self, text: str) -> list:
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list, b: list) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def lookup(self, query: str):
        """Check cache for a semantically similar query."""
        query_embedding = self._get_embedding(query)

        best_match = None
        best_score = 0

        for emb, cached_query, cached_response in self.cache:
            score = self._cosine_similarity(query_embedding, emb)
            if score > best_score:
                best_score = score
                best_match = (cached_query, cached_response)

        if best_score >= self.similarity_threshold and best_match:
            print(f"Cache HIT (similarity: {best_score:.4f})")
            print(f"Matched query: '{best_match[0]}'")
            return best_match[1]

        print(f"Cache MISS (best similarity: {best_score:.4f})")
        return None

    def store(self, query: str, response: str):
        """Store a query-response pair in cache."""
        embedding = self._get_embedding(query)
        self.cache.append((embedding, query, response))

    def query(self, query: str) -> str:
        """Query with cache - lookup first, then call API if needed."""
        cached = self.lookup(query)
        if cached:
            return cached

        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": query}]
        )
        answer = response.choices[0].message.content
        self.store(query, answer)
        return answer

# Test the custom cache
semantic_cache = SemanticCache(similarity_threshold=0.90)

print("=== Query 1 ===")
r1 = semantic_cache.query("What are the benefits of cloud computing?")
print(f"Response: {r1[:150]}...\n")

print("=== Query 2 (similar) ===")
r2 = semantic_cache.query("What advantages does cloud computing offer?")
print(f"Response: {r2[:150]}...")

10. Cost Monitoring

Production Cost Monitoring Tips

  • Track per-request costs: Log usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens for every API call.
  • Set budget alerts: Use OpenAI’s usage dashboard or build custom alerts when daily spend exceeds thresholds.
  • Monitor cache hit rates: Track the ratio of cached vs. fresh requests to measure caching effectiveness.
  • Use vLLM prefix caching for self-hosted models: Enable with --enable-prefix-caching flag when starting the vLLM server:
# vLLM with prefix caching enabled
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --enable-prefix-caching \
    --max-model-len 8192

Summary of Cost Optimization Strategies

Strategy Potential Savings Implementation Effort
Prompt Caching (OpenAI) 50% on cached inputs Low
Prompt Caching (Anthropic) 90% on cached reads Low
Model Routing 60-80% on simple queries Medium
Cascade Pattern 40-70% overall Medium
Prompt Optimization 20-50% fewer tokens Low
Semantic Caching 80-95% on repeated queries Medium
vLLM Prefix Caching Variable (self-hosted) High