!pip install -q llama-cpp-python huggingface_hub openaiDeploying and Serving LLM with Llama.cpp
End-to-end guide: deploy and serve LLMs locally and at scale with llama.cpp for efficient CPU/GPU inference
Table of Contents
1. Setup & Installation
Install llama.cpp Python bindings. For GPU acceleration, set CMAKE_ARGS accordingly.
2. GGUF Model Formats & Quantization
GGUF is the file format used by llama.cpp. It stores model weights, metadata, and tokenizer in a single file.
| Quantization | Bits | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| Q2_K | 2 | ~2.7 GB | Low | Fastest |
| Q4_K_M | 4 | ~4.1 GB | Good | Fast |
| Q5_K_M | 5 | ~4.8 GB | Very Good | Medium |
| Q6_K | 6 | ~5.5 GB | Excellent | Medium |
| Q8_0 | 8 | ~7.2 GB | Near-FP16 | Slower |
| F16 | 16 | ~13.5 GB | Lossless | Slowest |
Recommendation: Q4_K_M offers the best balance of quality, size, and speed.
3. Download Pre-quantized Models
Many models are available pre-quantized in GGUF format on Hugging Face.
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/Qwen3-0.6B-GGUF",
filename="Q4_K_M.gguf",
local_dir="./models"
)
print(f"Model downloaded to: {model_path}")4. Batch Inference with Python
Use the llama-cpp-python bindings for fast inference.
from llama_cpp import Llama
llm = Llama(
model_path="./models/Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1, # -1 = offload all layers to GPU
)
prompts = [
"Explain machine learning in simple terms.",
"What is the difference between AI and ML?",
"Write a Python function to reverse a string.",
]
for prompt in prompts:
output = llm(
prompt,
max_tokens=256,
temperature=0.7,
)
print(output["choices"][0]["text"])
print("---")5. Chat-style Inference
Use create_chat_completion for multi-turn conversations with chat templates.
from llama_cpp import Llama
llm = Llama(
model_path="./models/Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1,
chat_format="chatml",
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain llama.cpp in simple terms."},
]
output = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7,
)
print(output["choices"][0]["message"]["content"])6. Serving with OpenAI-Compatible API
llama.cpp includes a built-in HTTP server that is fully compatible with the OpenAI API format.
Start the Server (C++ binary)
./build/bin/llama-server \
-m ./models/Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--chat-template chatml \
-ngl 99 \
-c 4096 \
--slots 4Start the Server (Python)
python -m llama_cpp.server \
--model ./models/Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8000 \
--n_gpu_layers -1 \
--chat_format chatml \
--n_ctx 40967. Querying the API
Using Python (requests)
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Authorization": "Bearer your-secret-key"},
json={
"model": "default",
"messages": [
{"role": "user", "content": "Explain GGUF quantization."}
],
"temperature": 0.7,
"max_tokens": 256,
}
)
print(response.json()["choices"][0]["message"]["content"])Using OpenAI Python Client (Recommended)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantization in LLMs?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)8. Serving Custom / Fine-tuned Models
If you fine-tuned a model with Unsloth and exported to GGUF, serve it directly:
./build/bin/llama-server \
-m ./gguf_model_small/unsloth.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--chat-template chatml \
-ngl 99 \
-c 2048 \
--slots 4# Query the fine-tuned model
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-key",
)
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is fine-tuning?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)9. Docker Deployment
Deploy llama.cpp in a container for production:
docker run --gpus all \
-v ./models:/models \
-p 8000:8000 \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8000 \
-ngl 99Performance Tips
- Use
-ngl 99to offload all layers to GPU - Use
--slots 4for concurrent request handling - Use
Q4_K_Mfor the best quality/size/speed balance - Reduce context length (
-c) if memory is limited