!pip install -q vllm openaiDeploying and Serving LLM with vLLM
End-to-end guide: deploy and serve LLMs at scale with vLLM for high-throughput, low-latency inference
Table of Contents
1. Setup & Installation
Install vLLM and the OpenAI client for API interaction.
import vllm
print(f"vLLM version: {vllm.__version__}")2. Offline Inference (Batch Processing)
Use vLLM for fast batch inference without starting a server. Key features: - PagedAttention: Efficient memory management inspired by OS virtual memory - Continuous batching: Dynamically batches incoming requests for maximum throughput
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
prompts = [
"Explain machine learning in simple terms.",
"What is the difference between AI and ML?",
"Write a Python function to reverse a string.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
print("---")3. Chat-style Inference
Use the chat method for multi-turn conversations.
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain vLLM in simple terms."},
]
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
)
outputs = llm.chat(messages=[messages], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)4. Serving with OpenAI-Compatible API
vLLM provides an API server that is fully compatible with the OpenAI API format.
Start the Server
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--served-model-name my-model \
--gpu-memory-utilization 0.90Key options: - --served-model-name: Sets the model name exposed in the API - --gpu-memory-utilization: Fraction of GPU memory to use (0.0β1.0) - --tensor-parallel-size N: Distribute across N GPUs
5. Querying the API
Using Python (requests)
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Authorization": "Bearer your-secret-key"},
json={
"model": "my-model",
"messages": [
{"role": "user", "content": "Explain PagedAttention."}
],
"temperature": 0.7,
"max_tokens": 256,
}
)
print(response.json()["choices"][0]["message"]["content"])Using OpenAI Python Client (Recommended)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is continuous batching?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)6. Serving Custom / Fine-tuned Models
Serve a GGUF file directly
vllm serve ./gguf_model_small/unsloth.Q4_K_M.gguf \
--tokenizer ./gguf_model_small \
--served-model-name my-finetuned-model \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 2048Serve from HuggingFace (GGUF)
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
--tokenizer Qwen/Qwen3-0.6B \
--served-model-name qwen3-gguf \
--host 0.0.0.0 \
--port 8000# Query the fine-tuned model
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="my-finetuned-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is fine-tuning?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)7. LoRA Adapter Serving
Serve LoRA adapters on top of a base model without merging:
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--enable-lora \
--lora-modules my-lora=./lora_model \
--host 0.0.0.0 \
--port 8000# Query with the LoRA model name
response = client.chat.completions.create(
model="my-lora",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)8. Docker Deployment
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 80009. Performance Optimization Tips
| Tip | Details |
|---|---|
| GPU memory utilization | Set --gpu-memory-utilization 0.90 to maximize GPU usage |
| Quantization | Use AWQ or GPTQ quantized models to reduce VRAM |
| Tensor parallelism | Use --tensor-parallel-size N for multi-GPU setups |
| Max model length | Reduce --max-model-len if you donβt need long contexts |
| Continuous batching | Enabled by default β handles concurrent requests efficiently |
| Streaming | Use stream=True for real-time token generation |
vLLM vs Other Serving Solutions
| Feature | vLLM | Ollama | TGI | llama.cpp |
|---|---|---|---|---|
| Throughput | Very High | Medium | High | Low-Medium |
| GPU Required | Yes | Optional | Yes | Optional |
| OpenAI API | Yes | Partial | Yes | Partial |
| Multi-GPU | Yes | No | Yes | No |
| Best For | Production | Local Dev | Production | Edge/CPU |