graph LR
A["Install vLLM"] --> B["Load / configure<br/>model"]
B --> C["Start OpenAI-compatible<br/>API server"]
C --> D["Query from<br/>Python / curl"]
D --> E["Deploy to<br/>production"]
style A fill:#ffce67,stroke:#333
style B fill:#ffce67,stroke:#333
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#6cc3d5,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
Deploying and Serving LLM with vLLM
vLLM, LLM serving, model deployment, inference optimization, PagedAttention, OpenAI API, batching, GPU inference, production LLM

Introduction
Serving Large Language Models (LLMs) in production requires more than just loading a model and running inference. You need high throughput, low latency, and efficient GPU memory usage to handle real-world traffic.
vLLM is an open-source library designed specifically for this purpose. It makes LLM serving:
- Fast (up to 24x higher throughput than naive serving)
- Memory-efficient (via PagedAttention)
- Production-ready (OpenAI-compatible API server)
- Easy to deploy (Docker, Kubernetes, cloud)
In this tutorial, we will walk through a complete pipeline:
- Install and configure vLLM
- Serve a model with the OpenAI-compatible API
- Query the model from Python
- Optimize for production deployment
What is vLLM?
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. Key features include:
- PagedAttention: Efficient memory management inspired by OS virtual memory, reducing GPU memory waste
- Continuous batching: Dynamically batches incoming requests for maximum throughput
- OpenAI-compatible API: Drop-in replacement for OpenAI API endpoints
- Tensor parallelism: Distribute models across multiple GPUs
- Support for many models: Llama, Mistral, Qwen, Phi, Gemma, and more
graph TD
A["vLLM Engine"] --> B["PagedAttention<br/>Memory efficiency"]
A --> C["Continuous Batching<br/>Max throughput"]
A --> D["OpenAI-compatible API<br/>Drop-in replacement"]
A --> E["Tensor Parallelism<br/>Multi-GPU support"]
A --> F["Wide Model Support<br/>Llama, Mistral, Qwen..."]
style A fill:#56cc9d,stroke:#333,color:#fff
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#6cc3d5,stroke:#333,color:#fff
style E fill:#6cc3d5,stroke:#333,color:#fff
style F fill:#6cc3d5,stroke:#333,color:#fff
Hardware Requirements
vLLM is designed for GPU inference. Minimum requirements depend on your model size:
| Model Size | Minimum GPU VRAM | Recommended GPU |
|---|---|---|
| 0.5B–3B | 4 GB | RTX 3060 / T4 |
| 7B–8B | 16 GB | RTX 4090 / A10 |
| 13B | 24 GB | A10 / A100 |
| 70B | 80 GB+ | A100 / H100 (multi-GPU) |
For CPU-only machines, consider using Ollama or llama.cpp instead.
Installation
Install vLLM
pip install vllmInstall with CUDA support (recommended)
pip install vllm[cuda]Verify installation
import vllm
print(vllm.__version__)Offline Inference (Batch Processing)
Use vLLM for fast batch inference without starting a server.
Basic Example
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
prompts = [
"Explain machine learning in simple terms.",
"What is the difference between AI and ML?",
"Write a Python function to reverse a string.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
print("---")Chat-style Inference
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain vLLM in simple terms."},
]
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
outputs = llm.chat(messages=[messages], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)Serving with OpenAI-Compatible API
vLLM provides an API server that is fully compatible with the OpenAI API format.
graph LR
A["vLLM Server<br/>(port 8000)"] --> B["OpenAI-compatible<br/>/v1/chat/completions"]
B --> C["curl"]
B --> D["Python requests"]
B --> E["OpenAI Python client"]
B --> F["Any OpenAI-compatible<br/>application"]
style A fill:#56cc9d,stroke:#333,color:#fff
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#f8f9fa,stroke:#333
style D fill:#f8f9fa,stroke:#333
style E fill:#f8f9fa,stroke:#333
style F fill:#f8f9fa,stroke:#333
Start the Server
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-model \
--chat-template ./chat_template.jinja \
--gpu-memory-utilization 0.90Key options explained:
--served-model-name: Sets the model name exposed in the API (clients use this name in requests instead of the full HuggingFace path)--chat-template: Path to a Jinja2 chat template file for formatting chat messages (useful for custom or fine-tuned models)--gpu-memory-utilization: Fraction of GPU memory to use (0.0–1.0, default 0.9). Increase for larger models, decrease to leave room for other processes
Start with Custom Parameters
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-model \
--chat-template ./chat_template.jinja \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype autoVerify the Server
curl http://localhost:8000/v1/modelsQuerying the API
Using curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content": "What is vLLM?"}
],
"temperature": 0.7,
"max_tokens": 256
}'Using Python (requests)
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Authorization": "Bearer your-secret-key"},
json={
"model": "my-model",
"messages": [
{"role": "user", "content": "Explain PagedAttention."}
],
"temperature": 0.7,
"max_tokens": 256,
}
)
print(response.json()["choices"][0]["message"]["content"])Using OpenAI Python Client (Recommended)
Since vLLM is OpenAI-compatible, you can use the official OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is continuous batching?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)Serving Custom / Fine-tuned Models
If you fine-tuned a small LLM with Unsloth and exported it to GGUF (e.g., gguf_model_small), here is how to serve it with vLLM.
vLLM natively supports GGUF files — no conversion required. See the official vLLM GGUF documentation for full details.
Note: GGUF support in vLLM is experimental and under-optimized. Currently, only single-file GGUF models are supported. If you have a multi-file GGUF model, use
gguf-splitto merge them first.
graph TD
A["Fine-tuned model"] --> B{"Export format?"}
B -->|"GGUF"| C["Serve GGUF directly<br/>with vLLM"]
B -->|"HF safetensors"| D["Serve HF format<br/>with vLLM"]
B -->|"LoRA adapter"| E["Serve with<br/>--enable-lora"]
C --> F["OpenAI-compatible API"]
D --> F
E --> F
style A fill:#f8f9fa,stroke:#333
style C fill:#56cc9d,stroke:#333,color:#fff
style D fill:#56cc9d,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
style F fill:#6cc3d5,stroke:#333,color:#fff
Option A: Serve a GGUF file directly
Step 1: Prepare Your GGUF Model
After fine-tuning with Unsloth and exporting to GGUF, you should have a file like:
gguf_model_small/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── unsloth.BF16.gguf
├── unsloth.Q4_K_M.gguf
└── vocab.jsonStep 2: Serve with vLLM
Point vLLM directly at the GGUF file. Use --tokenizer to specify the base model’s tokenizer (recommended over the GGUF-embedded tokenizer for stability):
vllm serve ./gguf_model_small/unsloth.Q4_K_M.gguf \
--tokenizer ./gguf_model_small \
--served-model-name my-finetuned-model \
--chat-template ./chat_template.jinja \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--gpu-memory-utilization 0.90 \
--max-model-len 2048You can also load GGUF models from HuggingFace using the repo_id:quant_type format:
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
--tokenizer Qwen/Qwen3-0.6B \
--served-model-name qwen3-gguf \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--gpu-memory-utilization 0.90Add --tensor-parallel-size 2 to distribute across multiple GPUs:
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
--tokenizer Qwen/Qwen3-0.6B \
--tensor-parallel-size 2 \
--api-key your-secret-keyStep 3: Verify and Query
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-secret-key"from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-finetuned-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is fine-tuning?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)Option B: Serve in Hugging Face format (safetensors)
If you prefer maximum compatibility (e.g., with LoRA adapters or features not yet supported with GGUF), export in HF format instead:
# During fine-tuning with Unsloth, save in HF format
model.save_pretrained_merged("hf_model_small", tokenizer)Then serve:
vllm serve ./hf_model_small \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-finetuned-model \
--chat-template ./chat_template.jinja \
--gpu-memory-utilization 0.90 \
--dtype auto \
--max-model-len 2048Serve a LoRA Adapter (Without Merging)
If you prefer to keep LoRA weights separate, vLLM supports serving LoRA adapters on top of a base model:
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--enable-lora \
--lora-modules my-lora=./lora_model \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-keyThen query with the LoRA model name:
response = client.chat.completions.create(
model="my-lora",
messages=[{"role": "user", "content": "Hello!"}],
)Docker Deployment
Deploy vLLM in a container for production environments.
graph LR
A["vLLM Docker Image<br/>(vllm/vllm-openai)"] --> B["GPU Container<br/>(--gpus all)"]
B --> C["Model loaded<br/>in container"]
C --> D["Expose port 8000"]
D --> E["Production traffic"]
style A fill:#ffce67,stroke:#333
style B fill:#6cc3d5,stroke:#333,color:#fff
style C fill:#6cc3d5,stroke:#333,color:#fff
style D fill:#56cc9d,stroke:#333,color:#fff
style E fill:#56cc9d,stroke:#333,color:#fff
Dockerfile
FROM vllm/vllm-openai:latest
ENV MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct
CMD ["--model", "${MODEL_NAME}", "--host", "0.0.0.0", "--port", "8000"]Run with Docker
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000Performance Optimization Tips
- GPU memory utilization: Set
--gpu-memory-utilization 0.90to maximize GPU usage (range: 0.0–1.0) - Served model name: Use
--served-model-namefor cleaner API model names instead of long HuggingFace paths - Chat template: Use
--chat-templateto apply a custom Jinja2 chat template for fine-tuned models - Quantization: Use AWQ or GPTQ quantized models to reduce VRAM
- Tensor parallelism: Use
--tensor-parallel-size Nfor multi-GPU setups - Max model length: Reduce
--max-model-lenif you don’t need long contexts - Continuous batching: Enabled by default, handles concurrent requests efficiently
- Streaming: Use
stream=Truefor real-time token generation
vLLM vs Other Serving Solutions
| Feature | vLLM | Ollama | TGI | llama.cpp |
|---|---|---|---|---|
| Throughput | Very High | Medium | High | Low-Medium |
| GPU Required | Yes | Optional | Yes | Optional |
| OpenAI API | Yes | Partial | Yes | Partial |
| Multi-GPU | Yes | No | Yes | No |
| Ease of Use | Medium | Easy | Medium | Medium |
| Best For | Production | Local Dev | Production | Edge/CPU |
Conclusion
vLLM is the go-to solution for high-performance LLM serving in production:
- Serves models with an OpenAI-compatible API
- Handles high-concurrency with continuous batching
- Optimizes GPU memory with PagedAttention
- Supports custom and fine-tuned models
- Deploys easily with Docker and Kubernetes
This workflow is perfect for:
- Production AI APIs
- Enterprise LLM platforms
- High-traffic chatbot backends
- Multi-model serving infrastructure
Read More
- Combine with a RAG pipeline (LangChain + vLLM)
- Add load balancing with Nginx or Traefik
- Deploy on Kubernetes with GPU node pools
- Monitor with Prometheus + Grafana
- Serve multiple models with model routing