Deploying and Serving LLM with vLLM

End-to-end guide: deploy and serve LLMs at scale with vLLM for high-throughput, low-latency inference

Open In Colab

πŸ“– Read the full article


Table of Contents

  1. Setup & Installation
  2. Offline Inference (Batch Processing)
  3. Chat-style Inference
  4. Serving with OpenAI-Compatible API
  5. Querying the API
  6. Serving Custom / Fine-tuned Models
  7. LoRA Adapter Serving
  8. Docker Deployment
  9. Performance Optimization Tips

1. Setup & Installation

Install vLLM and the OpenAI client for API interaction.

!pip install -q vllm openai
import vllm
print(f"vLLM version: {vllm.__version__}")

2. Offline Inference (Batch Processing)

Use vLLM for fast batch inference without starting a server. Key features: - PagedAttention: Efficient memory management inspired by OS virtual memory - Continuous batching: Dynamically batches incoming requests for maximum throughput

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256,
)

prompts = [
    "Explain machine learning in simple terms.",
    "What is the difference between AI and ML?",
    "Write a Python function to reverse a string.",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
    print("---")

3. Chat-style Inference

Use the chat method for multi-turn conversations.

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain vLLM in simple terms."},
]

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256,
)

outputs = llm.chat(messages=[messages], sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

4. Serving with OpenAI-Compatible API

vLLM provides an API server that is fully compatible with the OpenAI API format.

Start the Server

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --served-model-name my-model \
    --gpu-memory-utilization 0.90

Key options: - --served-model-name: Sets the model name exposed in the API - --gpu-memory-utilization: Fraction of GPU memory to use (0.0–1.0) - --tensor-parallel-size N: Distribute across N GPUs

5. Querying the API

Using Python (requests)

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    headers={"Authorization": "Bearer your-secret-key"},
    json={
        "model": "my-model",
        "messages": [
            {"role": "user", "content": "Explain PagedAttention."}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    }
)

print(response.json()["choices"][0]["message"]["content"])

6. Serving Custom / Fine-tuned Models

Serve a GGUF file directly

vllm serve ./gguf_model_small/unsloth.Q4_K_M.gguf \
    --tokenizer ./gguf_model_small \
    --served-model-name my-finetuned-model \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 2048

Serve from HuggingFace (GGUF)

vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
    --tokenizer Qwen/Qwen3-0.6B \
    --served-model-name qwen3-gguf \
    --host 0.0.0.0 \
    --port 8000
# Query the fine-tuned model
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="my-finetuned-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is fine-tuning?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

7. LoRA Adapter Serving

Serve LoRA adapters on top of a base model without merging:

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
    --enable-lora \
    --lora-modules my-lora=./lora_model \
    --host 0.0.0.0 \
    --port 8000
# Query with the LoRA model name
response = client.chat.completions.create(
    model="my-lora",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

8. Docker Deployment

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --host 0.0.0.0 \
    --port 8000

9. Performance Optimization Tips

Tip Details
GPU memory utilization Set --gpu-memory-utilization 0.90 to maximize GPU usage
Quantization Use AWQ or GPTQ quantized models to reduce VRAM
Tensor parallelism Use --tensor-parallel-size N for multi-GPU setups
Max model length Reduce --max-model-len if you don’t need long contexts
Continuous batching Enabled by default β€” handles concurrent requests efficiently
Streaming Use stream=True for real-time token generation

vLLM vs Other Serving Solutions

Feature vLLM Ollama TGI llama.cpp
Throughput Very High Medium High Low-Medium
GPU Required Yes Optional Yes Optional
OpenAI API Yes Partial Yes Partial
Multi-GPU Yes No Yes No
Best For Production Local Dev Production Edge/CPU