!pip install -q vllm openai requestsScaling LLM Serving for Enterprise Production
A comprehensive guide to scaling large language model inference from single GPU to millions of requests per second
Table of Contents
1. Setup & Installation
2. The Scaling Roadmap
Scaling LLM serving follows a progressive path from a single GPU to full orchestration:
| Tier | Setup | Throughput | Use Case |
|---|---|---|---|
| 1 | Single GPU | ~10β50 req/s | Development, prototyping |
| 2 | Multi-GPU (Tensor Parallelism) | ~50β200 req/s | Small-scale production |
| 3 | Multi-Node | ~200β2K req/s | Medium enterprise |
| 4 | Load-Balanced Replicas | ~2Kβ100K req/s | Large-scale production |
| 5 | Full Orchestration (K8s) | ~100Kβ1M+ req/s | Hyperscale enterprise |
Each tier builds on the previous one. Start simple and scale as demand grows.
3. Hardware Foundations
GPU Selection Guide
| GPU | VRAM | FP16 TFLOPS | Best For |
|---|---|---|---|
| A100 | 40/80 GB | 312 | General-purpose LLM serving |
| H100 | 80 GB | 990 | High-throughput production |
| L40S | 48 GB | 362 | Cost-effective inference |
| H200 | 141 GB | 990 | Very large models |
| B200 | 192 GB | 2250 | Next-gen flagship |
Memory Planning
Model memory estimate (FP16):
Memory (GB) β Parameters (B) Γ 2
KV cache memory per request:
KV Cache (GB) = 2 Γ num_layers Γ hidden_dim Γ seq_len Γ 2 bytes / 1e9
Total memory:
Total = Model Weights + KV Cache Γ max_concurrent_requests + Overhead (~10-20%)
4. LLM Serving Engines Comparison
| Engine | Key Feature | Throughput | Multi-GPU | Best For |
|---|---|---|---|---|
| vLLM | PagedAttention, continuous batching | Very High | TP, PP | General-purpose production |
| SGLang | RadixAttention, structured generation | Very High | TP | Structured output, multi-turn |
| TGI | HuggingFace integration | High | TP | HF ecosystem users |
| NVIDIA Triton | Multi-model, multi-framework | High | Full NVIDIA stack | Enterprise NVIDIA deployments |
5. Tensor Parallelism
Tensor Parallelism (TP) splits individual layers across multiple GPUs, allowing you to serve models that donβt fit on a single GPU and increase throughput.
# Serve a model with Tensor Parallelism across 4 GPUs using vLLM
# Run this in the terminal:
# vllm serve meta-llama/Llama-3-70B-Instruct \
# --tensor-parallel-size 4 \
# --max-model-len 4096 \
# --port 8000
# Query the endpoint
import requests
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "meta-llama/Llama-3-70B-Instruct",
"prompt": "Explain tensor parallelism in one paragraph.",
"max_tokens": 256
}
)
print(response.json())6. Pipeline Parallelism
Pipeline Parallelism (PP) splits the modelβs layers across GPUs sequentially. It can be combined with Tensor Parallelism for maximum scalability.
# Pipeline Parallelism with vLLM
# Run in terminal:
# vllm serve meta-llama/Llama-3-70B-Instruct \
# --pipeline-parallel-size 2 \
# --port 8000
# Combined TP + PP (e.g., 8 GPUs = TP=4 x PP=2)
# vllm serve meta-llama/Llama-3-405B-Instruct \
# --tensor-parallel-size 4 \
# --pipeline-parallel-size 2 \
# --max-model-len 4096 \
# --port 8000
print("Pipeline Parallelism splits layers across GPUs sequentially.")
print("Combined TP+PP example: 8 GPUs = TP=4 Γ PP=2")
print("Total GPUs required = tensor_parallel_size Γ pipeline_parallel_size")7. Multi-Node Deployment
For models that require more GPUs than a single server can provide, use multi-node deployment with Ray or native multiprocessing.
Ray Cluster Setup
Head node:
ray start --head --port=6379Worker node(s):
ray start --address=<head-node-ip>:6379Launch vLLM on the Ray cluster:
vllm serve meta-llama/Llama-3-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend rayNative Multiprocessing
vLLM also supports native multiprocessing without Ray:
vllm serve meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--distributed-executor-backend mp8. Load Balancing with Nginx
Use Nginx as a reverse proxy to distribute traffic across multiple vLLM replicas.
Nginx configuration:
upstream vllm_backends {
least_conn;
server vllm-1:8000;
server vllm-2:8000;
server vllm-3:8000;
server vllm-4:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_backends;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
docker-compose.yml with multiple vLLM replicas:
services:
vllm-1:
image: vllm/vllm-openai:latest
command: --model meta-llama/Llama-3-8B-Instruct --port 8000
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
count: 1
vllm-2:
image: vllm/vllm-openai:latest
command: --model meta-llama/Llama-3-8B-Instruct --port 8000
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
count: 1
nginx:
image: nginx:latest
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- vllm-1
- vllm-29. Kubernetes Orchestration
Install with Helm:
helm install vllm vllm/vllm --values values.yamlvalues.yaml:
servingEngineSpec:
modelSpec:
- name: "llama-3-8b"
repository: "meta-llama/Llama-3-8B-Instruct"
tensorParallelSize: 1
replicaCount: 4
resources:
limits:
nvidia.com/gpu: 1HorizontalPodAutoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 2
maxReplicas: 16
metrics:
- type: Pods
pods:
metric:
name: vllm_requests_running
target:
type: AverageValue
averageValue: 5010. Scaling Math
Total Throughput Formula
Total Throughput (req/s) = per_replica_throughput Γ num_replicas
Replicas Throughput Table
| Replicas | Per-Replica (req/s) | Total Throughput (req/s) |
|---|---|---|
| 1 | 50 | 50 |
| 4 | 50 | 200 |
| 16 | 50 | 800 |
| 64 | 50 | 3,200 |
| 256 | 50 | 12,800 |
| 1,000 | 50 | 50,000 |
Key factors affecting per-replica throughput: - Model size and quantization - GPU type and memory bandwidth - Batch size and sequence length - Serving engine optimizations (PagedAttention, continuous batching) - Network latency between nodes