Scaling LLM Serving for Enterprise Production

A comprehensive guide to scaling large language model inference from single GPU to millions of requests per second

Open In Colab

πŸ“– Read the full article


Table of Contents

  1. Setup & Installation
  2. The Scaling Roadmap
  3. Hardware Foundations
  4. LLM Serving Engines Comparison
  5. Tensor Parallelism
  6. Pipeline Parallelism
  7. Multi-Node Deployment
  8. Load Balancing with Nginx
  9. Kubernetes Orchestration
  10. Scaling Math

1. Setup & Installation

!pip install -q vllm openai requests

2. The Scaling Roadmap

Scaling LLM serving follows a progressive path from a single GPU to full orchestration:

Tier Setup Throughput Use Case
1 Single GPU ~10–50 req/s Development, prototyping
2 Multi-GPU (Tensor Parallelism) ~50–200 req/s Small-scale production
3 Multi-Node ~200–2K req/s Medium enterprise
4 Load-Balanced Replicas ~2K–100K req/s Large-scale production
5 Full Orchestration (K8s) ~100K–1M+ req/s Hyperscale enterprise

Each tier builds on the previous one. Start simple and scale as demand grows.

3. Hardware Foundations

GPU Selection Guide

GPU VRAM FP16 TFLOPS Best For
A100 40/80 GB 312 General-purpose LLM serving
H100 80 GB 990 High-throughput production
L40S 48 GB 362 Cost-effective inference
H200 141 GB 990 Very large models
B200 192 GB 2250 Next-gen flagship

Memory Planning

Model memory estimate (FP16):

Memory (GB) β‰ˆ Parameters (B) Γ— 2

KV cache memory per request:

KV Cache (GB) = 2 Γ— num_layers Γ— hidden_dim Γ— seq_len Γ— 2 bytes / 1e9

Total memory:

Total = Model Weights + KV Cache Γ— max_concurrent_requests + Overhead (~10-20%)

4. LLM Serving Engines Comparison

Engine Key Feature Throughput Multi-GPU Best For
vLLM PagedAttention, continuous batching Very High TP, PP General-purpose production
SGLang RadixAttention, structured generation Very High TP Structured output, multi-turn
TGI HuggingFace integration High TP HF ecosystem users
NVIDIA Triton Multi-model, multi-framework High Full NVIDIA stack Enterprise NVIDIA deployments

5. Tensor Parallelism

Tensor Parallelism (TP) splits individual layers across multiple GPUs, allowing you to serve models that don’t fit on a single GPU and increase throughput.

# Serve a model with Tensor Parallelism across 4 GPUs using vLLM
# Run this in the terminal:

# vllm serve meta-llama/Llama-3-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 4096 \
#     --port 8000

# Query the endpoint
import requests

response = requests.post(
    "http://localhost:8000/v1/completions",
    json={
        "model": "meta-llama/Llama-3-70B-Instruct",
        "prompt": "Explain tensor parallelism in one paragraph.",
        "max_tokens": 256
    }
)
print(response.json())

6. Pipeline Parallelism

Pipeline Parallelism (PP) splits the model’s layers across GPUs sequentially. It can be combined with Tensor Parallelism for maximum scalability.

# Pipeline Parallelism with vLLM
# Run in terminal:

# vllm serve meta-llama/Llama-3-70B-Instruct \
#     --pipeline-parallel-size 2 \
#     --port 8000

# Combined TP + PP (e.g., 8 GPUs = TP=4 x PP=2)
# vllm serve meta-llama/Llama-3-405B-Instruct \
#     --tensor-parallel-size 4 \
#     --pipeline-parallel-size 2 \
#     --max-model-len 4096 \
#     --port 8000

print("Pipeline Parallelism splits layers across GPUs sequentially.")
print("Combined TP+PP example: 8 GPUs = TP=4 Γ— PP=2")
print("Total GPUs required = tensor_parallel_size Γ— pipeline_parallel_size")

7. Multi-Node Deployment

For models that require more GPUs than a single server can provide, use multi-node deployment with Ray or native multiprocessing.

Ray Cluster Setup

Head node:

ray start --head --port=6379

Worker node(s):

ray start --address=<head-node-ip>:6379

Launch vLLM on the Ray cluster:

vllm serve meta-llama/Llama-3-405B-Instruct \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray

Native Multiprocessing

vLLM also supports native multiprocessing without Ray:

vllm serve meta-llama/Llama-3-70B-Instruct \
    --tensor-parallel-size 4 \
    --distributed-executor-backend mp

8. Load Balancing with Nginx

Use Nginx as a reverse proxy to distribute traffic across multiple vLLM replicas.

Nginx configuration:

upstream vllm_backends {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
    server vllm-3:8000;
    server vllm-4:8000;
}

server {
    listen 80;
    location / {
        proxy_pass http://vllm_backends;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

docker-compose.yml with multiple vLLM replicas:

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3-8B-Instruct --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: 1
  vllm-2:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3-8B-Instruct --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: 1
  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - vllm-1
      - vllm-2

9. Kubernetes Orchestration

Install with Helm:

helm install vllm vllm/vllm --values values.yaml

values.yaml:

servingEngineSpec:
  modelSpec:
    - name: "llama-3-8b"
      repository: "meta-llama/Llama-3-8B-Instruct"
      tensorParallelSize: 1
  replicaCount: 4
  resources:
    limits:
      nvidia.com/gpu: 1

HorizontalPodAutoscaler (HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 16
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_running
        target:
          type: AverageValue
          averageValue: 50

10. Scaling Math

Total Throughput Formula

Total Throughput (req/s) = per_replica_throughput Γ— num_replicas

Replicas Throughput Table

Replicas Per-Replica (req/s) Total Throughput (req/s)
1 50 50
4 50 200
16 50 800
64 50 3,200
256 50 12,800
1,000 50 50,000

Key factors affecting per-replica throughput: - Model size and quantization - GPU type and memory bandwidth - Batch size and sequence length - Serving engine optimizations (PagedAttention, continuous batching) - Network latency between nodes