Scaling LLM Serving for Enterprise Production

A comprehensive guide to scaling large language model inference from single GPU to millions of requests per second

Open In Colab

Setup & Installation
The Scaling Roadmap
Hardware Foundations
LLM Serving Engines Comparison
Tensor Parallelism
Pipeline Parallelism
Multi-Node Deployment
Load Balancing with Nginx
Kubernetes Orchestration
Scaling Math

1. Setup & Installation

!pip install -q vllm openai requests

2. The Scaling Roadmap

Scaling LLM serving follows a progressive path from a single GPU to full orchestration:

Tier	Setup	Throughput	Use Case
1	Single GPU	~10–50 req/s	Development, prototyping
2	Multi-GPU (Tensor Parallelism)	~50–200 req/s	Small-scale production
3	Multi-Node	~200–2K req/s	Medium enterprise
4	Load-Balanced Replicas	~2K–100K req/s	Large-scale production
5	Full Orchestration (K8s)	~100K–1M+ req/s	Hyperscale enterprise

Each tier builds on the previous one. Start simple and scale as demand grows.

3. Hardware Foundations

GPU Selection Guide

GPU	VRAM	FP16 TFLOPS	Best For
A100	40/80 GB	312	General-purpose LLM serving
H100	80 GB	990	High-throughput production
L40S	48 GB	362	Cost-effective inference
H200	141 GB	990	Very large models
B200	192 GB	2250	Next-gen flagship

Memory Planning

Model memory estimate (FP16):

Memory (GB) ≈ Parameters (B) × 2

KV cache memory per request:

KV Cache (GB) = 2 × num_layers × hidden_dim × seq_len × 2 bytes / 1e9

Total memory:

Total = Model Weights + KV Cache × max_concurrent_requests + Overhead (~10-20%)

4. LLM Serving Engines Comparison

Engine	Key Feature	Throughput	Multi-GPU	Best For
vLLM	PagedAttention, continuous batching	Very High	TP, PP	General-purpose production
SGLang	RadixAttention, structured generation	Very High	TP	Structured output, multi-turn
TGI	HuggingFace integration	High	TP	HF ecosystem users
NVIDIA Triton	Multi-model, multi-framework	High	Full NVIDIA stack	Enterprise NVIDIA deployments

5. Tensor Parallelism

Tensor Parallelism (TP) splits individual layers across multiple GPUs, allowing you to serve models that don’t fit on a single GPU and increase throughput.

# Serve a model with Tensor Parallelism across 4 GPUs using vLLM
# Run this in the terminal:

# vllm serve meta-llama/Llama-3-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --max-model-len 4096 \
#     --port 8000

# Query the endpoint
import requests

response = requests.post(
    "http://localhost:8000/v1/completions",
    json={
        "model": "meta-llama/Llama-3-70B-Instruct",
        "prompt": "Explain tensor parallelism in one paragraph.",
        "max_tokens": 256
    }
)
print(response.json())

6. Pipeline Parallelism

Pipeline Parallelism (PP) splits the model’s layers across GPUs sequentially. It can be combined with Tensor Parallelism for maximum scalability.

# Pipeline Parallelism with vLLM
# Run in terminal:

# vllm serve meta-llama/Llama-3-70B-Instruct \
#     --pipeline-parallel-size 2 \
#     --port 8000

# Combined TP + PP (e.g., 8 GPUs = TP=4 x PP=2)
# vllm serve meta-llama/Llama-3-405B-Instruct \
#     --tensor-parallel-size 4 \
#     --pipeline-parallel-size 2 \
#     --max-model-len 4096 \
#     --port 8000

print("Pipeline Parallelism splits layers across GPUs sequentially.")
print("Combined TP+PP example: 8 GPUs = TP=4 × PP=2")
print("Total GPUs required = tensor_parallel_size × pipeline_parallel_size")

7. Multi-Node Deployment

For models that require more GPUs than a single server can provide, use multi-node deployment with Ray or native multiprocessing.

Ray Cluster Setup

Head node:

ray start --head --port=6379

Worker node(s):

ray start --address=<head-node-ip>:6379

Launch vLLM on the Ray cluster:

vllm serve meta-llama/Llama-3-405B-Instruct \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray

Native Multiprocessing

vLLM also supports native multiprocessing without Ray:

vllm serve meta-llama/Llama-3-70B-Instruct \
    --tensor-parallel-size 4 \
    --distributed-executor-backend mp

8. Load Balancing with Nginx

Use Nginx as a reverse proxy to distribute traffic across multiple vLLM replicas.

Nginx configuration:

upstream vllm_backends {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
    server vllm-3:8000;
    server vllm-4:8000;
}

server {
    listen 80;
    location / {
        proxy_pass http://vllm_backends;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

docker-compose.yml with multiple vLLM replicas:

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3-8B-Instruct --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: 1
  vllm-2:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3-8B-Instruct --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: 1
  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - vllm-1
      - vllm-2

9. Kubernetes Orchestration

Install with Helm:

helm install vllm vllm/vllm --values values.yaml

values.yaml:

servingEngineSpec:
  modelSpec:
    - name: "llama-3-8b"
      repository: "meta-llama/Llama-3-8B-Instruct"
      tensorParallelSize: 1
  replicaCount: 4
  resources:
    limits:
      nvidia.com/gpu: 1

HorizontalPodAutoscaler (HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 16
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_requests_running
        target:
          type: AverageValue
          averageValue: 50

10. Scaling Math

Total Throughput Formula

Total Throughput (req/s) = per_replica_throughput × num_replicas

Replicas Throughput Table

Replicas	Per-Replica (req/s)	Total Throughput (req/s)
1	50	50
4	50	200
16	50	800
64	50	3,200
256	50	12,800
1,000	50	50,000

Key factors affecting per-replica throughput: - Model size and quantization - GPU type and memory bandwidth - Batch size and sequence length - Serving engine optimizations (PagedAttention, continuous batching) - Network latency between nodes

Table of Contents