Deploying and Serving LLM with Llama.cpp

End-to-end guide: deploy and serve LLMs locally and at scale with llama.cpp for efficient CPU/GPU inference
Author
Keywords

llama.cpp, LLM serving, model deployment, GGUF, quantization, inference optimization, CPU inference, GPU inference, OpenAI API, production LLM

Deploying and serving LLMs locally with llama.cpp

Introduction

Not every LLM deployment requires a high-end GPU cluster. Many real-world use cases — edge devices, local development, cost-sensitive production — demand efficient inference on commodity hardware.

llama.cpp is an open-source C/C++ inference engine that makes LLM deployment:

  • Hardware-flexible (runs on CPU, GPU, Apple Silicon, and hybrid CPU+GPU)
  • Memory-efficient (via aggressive quantization — 2-bit to 8-bit GGUF formats)
  • Production-ready (built-in OpenAI-compatible API server)
  • Portable (no Python runtime or CUDA toolkit required at inference time)
  • Easy to deploy (single binary, Docker, embedded systems)

In this tutorial, we will walk through a complete pipeline:

  1. Install and build llama.cpp
  2. Obtain and quantize models in GGUF format
  3. Run offline inference from the CLI
  4. Serve a model with the OpenAI-compatible API server
  5. Query the model from Python
  6. Optimize for production deployment

graph LR
    A["Install<br/>llama.cpp"] --> B["Obtain GGUF<br/>model"]
    B --> C["CLI inference<br/>or API server"]
    C --> D["Query from<br/>Python / curl"]
    D --> E["Deploy to<br/>production"]

    style A fill:#ffce67,stroke:#333
    style B fill:#ffce67,stroke:#333
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#56cc9d,stroke:#333,color:#fff

What is llama.cpp?

llama.cpp is a high-performance C/C++ inference engine for LLMs, originally created by Georgi Gerganov. Key features include:

  • GGUF format: Purpose-built model format with embedded metadata, supporting a wide range of quantization levels
  • Quantization: Reduce model size and memory usage with minimal quality loss (Q2_K through Q8_0)
  • CPU optimization: AVX, AVX2, AVX-512, ARM NEON for fast inference without a GPU
  • GPU offloading: Offload layers to NVIDIA (CUDA), AMD (ROCm), Apple Metal, or Vulkan GPUs
  • Built-in server: OpenAI-compatible HTTP API with continuous batching
  • Support for many models: Llama, Mistral, Qwen, Phi, Gemma, DeepSeek, and more

graph TD
    A["llama.cpp"] --> B["GGUF Format<br/>Embedded metadata"]
    A --> C["Quantization<br/>Q2_K to Q8_0"]
    A --> D["CPU Optimized<br/>AVX / AVX2 / NEON"]
    A --> E["GPU Offloading<br/>CUDA / Metal / Vulkan"]
    A --> F["Built-in Server<br/>OpenAI-compatible"]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#6cc3d5,stroke:#333,color:#fff
    style F fill:#6cc3d5,stroke:#333,color:#fff

Hardware Requirements

llama.cpp is designed to run on a wide range of hardware, from laptops to servers:

Model Size Quantization RAM/VRAM Needed Recommended Hardware
0.5B–3B Q4_K_M 2–3 GB Any modern CPU / Raspberry Pi 5
7B–8B Q4_K_M 5–6 GB 16 GB RAM laptop / RTX 3060
13B Q4_K_M 9–10 GB 32 GB RAM / RTX 4090
70B Q4_K_M 40–45 GB 64 GB RAM / A100 (multi-GPU)

Key advantage: llama.cpp can run entirely on CPU, making it ideal for environments without GPUs.

Installation

Option B: Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

CPU-only build:

cmake -B build
cmake --build build --config Release

With CUDA support:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

With Apple Metal support:

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

Verify installation

./build/bin/llama-cli --version

Obtaining GGUF Models

graph TD
    A["Need a GGUF model"] --> B{"Source?"}
    B -->|"Pre-quantized"| C["Download from<br/>HuggingFace"]
    B -->|"Own model"| D["Convert HF model<br/>to GGUF (F16)"]
    D --> E["Quantize to<br/>Q4_K_M / Q5_K_M"]
    C --> F["Ready for<br/>inference"]
    E --> F

    style A fill:#f8f9fa,stroke:#333
    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#ffce67,stroke:#333
    style E fill:#ffce67,stroke:#333
    style F fill:#56cc9d,stroke:#333,color:#fff

Download Pre-quantized Models from HuggingFace

Many models are available pre-quantized in GGUF format:

# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download unsloth/Qwen3-0.6B-GGUF Q4_K_M.gguf --local-dir ./models

Quantize a Model Yourself

If you have a model in HuggingFace (safetensors) format, convert and quantize it:

# Step 1: Convert to GGUF (F16)
python convert_hf_to_gguf.py ./hf_model_dir --outfile model-f16.gguf --outtype f16

# Step 2: Quantize to Q4_K_M
./build/bin/llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

Common Quantization Levels

Quantization Bits Size (7B) Quality Speed
Q2_K 2 ~2.7 GB Low Fastest
Q4_K_M 4 ~4.1 GB Good Fast
Q5_K_M 5 ~4.8 GB Very Good Medium
Q6_K 6 ~5.5 GB Excellent Medium
Q8_0 8 ~7.2 GB Near-FP16 Slower
F16 16 ~13.5 GB Lossless Slowest

Recommendation: Q4_K_M offers the best balance of quality, size, and speed for most use cases.

Offline Inference (CLI)

Use llama.cpp for fast inference directly from the command line.

Basic Text Generation

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    -p "Explain machine learning in simple terms." \
    -n 256 \
    --temp 0.7

Interactive Chat Mode

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    --chat-template chatml \
    -cnv \
    --temp 0.7

With GPU Offloading

Offload layers to GPU for faster inference (use -ngl to specify number of layers):

./build/bin/llama-cli \
    -m ./models/Q4_K_M.gguf \
    -p "What is the difference between AI and ML?" \
    -n 256 \
    -ngl 99 \
    --temp 0.7

-ngl 99 offloads all layers to GPU. Use a lower number for partial offloading when VRAM is limited.

Batch Inference with Python

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # -1 = offload all layers to GPU
)

prompts = [
    "Explain machine learning in simple terms.",
    "What is the difference between AI and ML?",
    "Write a Python function to reverse a string.",
]

for prompt in prompts:
    output = llm(
        prompt,
        max_tokens=256,
        temperature=0.7,
    )
    print(output["choices"][0]["text"])
    print("---")

Chat-style Inference with Python

from llama_cpp import Llama

llm = Llama(
    model_path="./models/Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    chat_format="chatml",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain llama.cpp in simple terms."},
]

output = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7,
)

print(output["choices"][0]["message"]["content"])

Serving with OpenAI-Compatible API

llama.cpp includes a built-in HTTP server that is fully compatible with the OpenAI API format.

graph TD
    A["llama-server"] --> B["Load GGUF model"]
    B --> C["OpenAI-compatible<br/>/v1/chat/completions"]
    C --> D["curl"]
    C --> E["Python requests"]
    C --> F["OpenAI Python client"]

    G["Options"] --> H["-ngl: GPU layers"]
    G --> I["-c: Context length"]
    G --> J["--slots: Parallelism"]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#f8f9fa,stroke:#333
    style E fill:#f8f9fa,stroke:#333
    style F fill:#f8f9fa,stroke:#333
    style G fill:#ffce67,stroke:#333

Start the Server (C++ binary)

./build/bin/llama-server \
    -m ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 4096 \
    --slots 4

Key options explained:

  • -m: Path to the GGUF model file
  • -ngl 99: Number of layers to offload to GPU (99 = all layers)
  • -c 4096: Context length (max tokens for input + output)
  • --slots 4: Number of concurrent request slots (controls parallelism)
  • --chat-template: Chat template format (chatml, llama2, mistral, etc.)
  • --api-key: API key for authentication

Start the Server (Python)

python -m llama_cpp.server \
    --model ./models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --n_gpu_layers -1 \
    --chat_format chatml \
    --n_ctx 4096

Verify the Server

curl http://localhost:8000/v1/models

Querying the API

Using curl

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer your-secret-key" \
    -d '{
        "model": "default",
        "messages": [
            {"role": "user", "content": "What is llama.cpp?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256
    }'

Using Python (requests)

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    headers={"Authorization": "Bearer your-secret-key"},
    json={
        "model": "default",
        "messages": [
            {"role": "user", "content": "Explain GGUF quantization."}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    }
)

print(response.json()["choices"][0]["message"]["content"])

Serving Custom / Fine-tuned Models

If you fine-tuned a small LLM with Unsloth and exported it to GGUF (e.g., gguf_model_small), here is how to serve it with llama.cpp.

graph TD
    A["Fine-tuned model"] --> B{"Format?"}
    B -->|"Already GGUF"| C["Serve directly<br/>with llama-server"]
    B -->|"HF safetensors"| D["Convert to GGUF<br/>(convert_hf_to_gguf.py)"]
    B -->|"LoRA adapter"| E["Serve with<br/>--lora flag"]
    D --> F["Quantize<br/>(llama-quantize)"]
    F --> C
    C --> G["OpenAI-compatible API"]
    E --> G

    style A fill:#f8f9fa,stroke:#333
    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#ffce67,stroke:#333
    style F fill:#ffce67,stroke:#333
    style E fill:#6cc3d5,stroke:#333,color:#fff
    style G fill:#56cc9d,stroke:#333,color:#fff

Option A: Serve a GGUF File Directly

Step 1: Prepare Your GGUF Model

After fine-tuning with Unsloth and exporting to GGUF, you should have a file like:

gguf_model_small/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── unsloth.BF16.gguf
├── unsloth.Q4_K_M.gguf
└── vocab.json

Step 2: Serve with llama.cpp

./build/bin/llama-server \
    -m ./gguf_model_small/unsloth.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 2048 \
    --slots 4

Step 3: Verify and Query

curl http://localhost:8000/v1/models \
    -H "Authorization: Bearer your-secret-key"
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is fine-tuning?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Option B: Convert from HuggingFace Format

If your model is in HuggingFace safetensors format, convert it first:

# Convert to GGUF
python convert_hf_to_gguf.py ./hf_model_small --outfile my-model-f16.gguf --outtype f16

# Quantize
./build/bin/llama-quantize my-model-f16.gguf my-model-Q4_K_M.gguf Q4_K_M

# Serve
./build/bin/llama-server \
    -m my-model-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --chat-template chatml \
    -ngl 99 \
    -c 2048

Serve with LoRA Adapters

llama.cpp supports applying LoRA adapters at inference time without merging:

./build/bin/llama-server \
    -m ./models/base-model-Q4_K_M.gguf \
    --lora ./lora-adapter.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    -ngl 99

Then query normally:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)

Docker Deployment

Deploy llama.cpp in a container for production environments.

graph LR
    subgraph cpu["CPU Deployment"]
        A1["llama.cpp:server"] --> B1["Mount models<br/>volume"]
        B1 --> C1["Expose port 8000"]
    end
    subgraph gpu["GPU Deployment"]
        A2["llama.cpp:server-cuda"] --> B2["--gpus all<br/>+ mount models"]
        B2 --> C2["Expose port 8000"]
    end

    style cpu fill:#6cc3d5,stroke:#333,color:#fff
    style gpu fill:#56cc9d,stroke:#333,color:#fff

Run with Docker (CPU)

docker run -p 8000:8000 \
    -v ./models:/models \
    ghcr.io/ggerganov/llama.cpp:server \
    -m /models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -c 4096

Run with Docker (CUDA GPU)

docker run --gpus all -p 8000:8000 \
    -v ./models:/models \
    ghcr.io/ggerganov/llama.cpp:server-cuda \
    -m /models/Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -ngl 99 \
    -c 4096

Docker Compose

version: '3.8'
services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      -m /models/Q4_K_M.gguf
      --host 0.0.0.0
      --port 8000
      -ngl 99
      -c 4096
      --slots 4

Performance Optimization Tips

  • GPU offloading: Use -ngl 99 to offload all model layers to GPU; use a lower value for partial offloading when VRAM is limited
  • Context length: Set -c to the minimum context length you need — larger context uses more memory
  • Concurrent slots: Use --slots N to control how many requests can be processed in parallel
  • Quantization choice: Q4_K_M is the sweet spot for most use cases; use Q5_K_M or Q6_K if quality matters more than speed
  • Memory mapping: llama.cpp uses mmap by default, allowing models to be loaded without duplicating in RAM
  • Batch size: Use -b to tune the prompt processing batch size (default 2048); larger values can speed up prompt processing
  • Flash attention: Use --flash-attn to enable Flash Attention for faster inference (if supported)
  • Streaming: Use stream=True for real-time token generation

llama.cpp vs Other Serving Solutions

Feature llama.cpp vLLM Ollama TGI
CPU Inference Excellent No Good No
GPU Inference Good Excellent Good Excellent
Throughput Low-Medium Very High Medium High
GPU Required No Yes No Yes
OpenAI API Yes Yes Partial Yes
Multi-GPU Limited Yes No Yes
Quantization Extensive (GGUF) AWQ/GPTQ GGUF AWQ/GPTQ
Ease of Use Medium Medium Easy Medium
Best For Edge/CPU/Local Production GPU Local Dev Production GPU

Conclusion

llama.cpp is the go-to solution for flexible, hardware-efficient LLM deployment:

  • Runs on CPU, GPU, Apple Silicon, and hybrid configurations
  • Supports extensive quantization for reduced memory and faster inference
  • Provides an OpenAI-compatible API server for seamless integration
  • Handles custom and fine-tuned models in GGUF format
  • Deploys easily with Docker or as a single binary

This workflow is perfect for:

  • Local development and prototyping
  • Edge and embedded AI deployments
  • Cost-sensitive production environments
  • CPU-only server deployments
  • Laptop and desktop AI applications

Read More

  • Combine with a RAG pipeline (LangChain + llama.cpp)
  • Add load balancing with Nginx or Traefik
  • Deploy on Kubernetes with mixed CPU/GPU node pools
  • Monitor with Prometheus + Grafana (llama.cpp exposes /metrics)
  • Explore speculative decoding for faster inference
  • Use grammar-constrained generation for structured output (JSON mode)

Explore LLMs Home