Deploying and Serving LLM with Llama.cpp

End-to-end guide: deploy and serve LLMs locally and at scale with llama.cpp for efficient CPU/GPU inference

Open In Colab

📖 Read the full article


Table of Contents

  1. Setup & Installation
  2. GGUF Model Formats & Quantization
  3. Download Pre-quantized Models
  4. Batch Inference with Python
  5. Chat-style Inference
  6. Serving with OpenAI-Compatible API
  7. Querying the API
  8. Serving Custom / Fine-tuned Models
  9. Docker Deployment

1. Setup & Installation

Install llama.cpp Python bindings. For GPU acceleration, set CMAKE_ARGS accordingly.

!pip install -q llama-cpp-python huggingface_hub openai

2. GGUF Model Formats & Quantization

GGUF is the file format used by llama.cpp. It stores model weights, metadata, and tokenizer in a single file.

Quantization Bits Size (7B) Quality Speed
Q2_K 2 ~2.7 GB Low Fastest
Q4_K_M 4 ~4.1 GB Good Fast
Q5_K_M 5 ~4.8 GB Very Good Medium
Q6_K 6 ~5.5 GB Excellent Medium
Q8_0 8 ~7.2 GB Near-FP16 Slower
F16 16 ~13.5 GB Lossless Slowest

Recommendation: Q4_K_M offers the best balance of quality, size, and speed.

3. Download Pre-quantized Models

Many models are available pre-quantized in GGUF format on Hugging Face.

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/Qwen3-0.6B-GGUF",
    filename="Q4_K_M.gguf",
    local_dir="./models"
)
print(f"Model downloaded to: {model_path}")

4. Batch Inference with Python

Use the llama-cpp-python bindings for fast inference.

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # -1 = offload all layers to GPU
)

prompts = [
    "Explain machine learning in simple terms.",
    "What is the difference between AI and ML?",
    "Write a Python function to reverse a string.",
]

for prompt in prompts:
    output = llm(
        prompt,
        max_tokens=256,
        temperature=0.7,
    )
    print(output["choices"][0]["text"])
    print("---")

5. Chat-style Inference

Use create_chat_completion for multi-turn conversations with chat templates.

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    chat_format="chatml",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain llama.cpp in simple terms."},
]

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7,
)

print(output["choices"][0]["message"]["content"])

6. Serving with OpenAI-Compatible API

llama.cpp includes a built-in HTTP server that is fully compatible with the OpenAI API format.

Start the Server (C++ binary)

./build/bin/llama-server \
    -m ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 4096 \
    --slots 4

Start the Server (Python)

python -m llama_cpp.server \
    --model ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --n_gpu_layers -1 \
    --chat_format chatml \
    --n_ctx 4096

7. Querying the API

Using Python (requests)

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    headers={"Authorization": "Bearer your-secret-key"},
    json={
        "model": "default",
        "messages": [
            {"role": "user", "content": "Explain GGUF quantization."}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    }
)

print(response.json()["choices"][0]["message"]["content"])

8. Serving Custom / Fine-tuned Models

If you fine-tuned a model with Unsloth and exported to GGUF, serve it directly:

./build/bin/llama-server \
    -m ./gguf_model_small/unsloth.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 2048 \
    --slots 4
# Query the fine-tuned model
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is fine-tuning?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

9. Docker Deployment

Deploy llama.cpp in a container for production:

docker run --gpus all \
    -v ./models:/models \
    -p 8000:8000 \
    ghcr.io/ggerganov/llama.cpp:server \
    -m /models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 99

Performance Tips

  • Use -ngl 99 to offload all layers to GPU
  • Use --slots 4 for concurrent request handling
  • Use Q4_K_M for the best quality/size/speed balance
  • Reduce context length (-c) if memory is limited