Quantization Methods for LLMs

A hands-on guide to GPTQ, AWQ, GGUF, and bitsandbytes — reducing LLM memory footprint while preserving quality

Open In Colab

📖 Read the full article


Table of Contents

  1. Setup & Installation
  2. Why Quantize?
  3. GPTQ Quantization
  4. Load Pre-quantized GPTQ
  5. AWQ Quantization
  6. GGUF in Transformers
  7. bitsandbytes 8-bit
  8. bitsandbytes 4-bit NF4 (QLoRA)
  9. Memory Usage Comparison
  10. When to Use What

1. Setup & Installation

Install the required packages for LLM quantization.

!pip install -q transformers torch gptqmodel autoawq gguf bitsandbytes

2. Why Quantize?

Quantization reduces the numerical precision of model weights, dramatically cutting memory usage and often improving inference speed.

Precision vs Memory

Precision Bytes/Param 7B Model 13B Model 70B Model
FP32 4 28 GB 52 GB 280 GB
FP16 / BF16 2 14 GB 26 GB 140 GB
INT8 1 7 GB 13 GB 70 GB
INT4 0.5 3.5 GB 6.5 GB 35 GB

Key Trade-offs

  • Lower precision = less memory + faster inference
  • Higher precision = better accuracy + more memory
  • Modern quantization methods (GPTQ, AWQ) minimize quality loss

3. GPTQ Quantization

GPTQ is a post-training quantization method that uses a calibration dataset to find optimal quantized weights layer by layer, minimizing the output error.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Configure GPTQ quantization
gptq_config = GPTQConfig(
    bits=4,                    # Quantize to 4 bits
    dataset="c4",              # Calibration dataset
    tokenizer=tokenizer,
)

# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto",
)

print(f"Model loaded with GPTQ 4-bit quantization")
print(f"Model dtype: {model.dtype}")

# Save the quantized model
# model.save_pretrained("./opt-125m-gptq-4bit")
# tokenizer.save_pretrained("./opt-125m-gptq-4bit")

4. Load Pre-quantized GPTQ

Many GPTQ-quantized models are available on the Hugging Face Hub, ready to use without running quantization yourself.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-quantized GPTQ model
quantized_model_id = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_id)
model = AutoModelForCausalLM.from_pretrained(
    quantized_model_id,
    device_map="auto",
)

# Generate text
inputs = tokenizer("Quantization reduces model size by", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

5. AWQ Quantization

AWQ (Activation-aware Weight Quantization) identifies the most important weights by analyzing activations and protects them during quantization. It supports kernel fusion for faster inference.

from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

# Load a pre-quantized AWQ model
awq_model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"

# AWQ with kernel fusion for faster inference
awq_config = AwqConfig(
    do_fuse=True,
    fuse_max_seq_len=512,
)

tokenizer = AutoTokenizer.from_pretrained(awq_model_id)
model = AutoModelForCausalLM.from_pretrained(
    awq_model_id,
    quantization_config=awq_config,
    device_map="auto",
)

print(f"AWQ model loaded with kernel fusion enabled")

# Generate text
inputs = tokenizer("The benefits of quantization include", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

6. GGUF in Transformers

GGUF is the file format used by llama.cpp. Transformers can now load GGUF files directly, making it easy to use community-quantized models.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a GGUF quantized model directly in Transformers
gguf_model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
gguf_file = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(gguf_model_id, gguf_file=gguf_file)
model = AutoModelForCausalLM.from_pretrained(
    gguf_model_id,
    gguf_file=gguf_file,
    device_map="auto",
)

print(f"GGUF model loaded: {gguf_file}")

# Generate text
inputs = tokenizer("Explain quantization in simple terms:", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

7. bitsandbytes 8-bit

bitsandbytes provides on-the-fly quantization when loading models. 8-bit mode uses LLM.int8() which keeps outlier features in FP16 for quality.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
)

model_id = "facebook/opt-350m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

print(f"Model loaded in 8-bit precision")

# Generate text
inputs = tokenizer("Machine learning models can be compressed by", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

8. bitsandbytes 4-bit NF4 (QLoRA)

NF4 (NormalFloat 4-bit) is a quantization type optimized for normally distributed weights. Combined with double quantization, it powers QLoRA — enabling fine-tuning of quantized models.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit NF4 quantization with double quantization (QLoRA config)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
    bnb_4bit_use_double_quant=True,         # Double quantization saves ~0.4 bits/param
)

model_id = "facebook/opt-350m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config_4bit,
    device_map="auto",
)

print(f"Model loaded in 4-bit NF4 precision (QLoRA-ready)")

# Generate text
inputs = tokenizer("Neural network quantization", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

9. Memory Usage Comparison

A practical comparison of memory footprints across different quantization methods using the same base model.

import torch
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "facebook/opt-125m"
prompt = "Artificial intelligence is transforming"


def get_model_memory_mb(model):
    """Estimate model memory usage in MB."""
    total_bytes = sum(
        p.nelement() * p.element_size() for p in model.parameters()
    )
    return total_bytes / (1024 * 1024)


def load_and_test(model_id, prompt, config_name, quantization_config=None, dtype=None):
    """Load model, measure memory, generate text."""
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    load_kwargs = {"device_map": "auto"}
    if quantization_config:
        load_kwargs["quantization_config"] = quantization_config
    if dtype:
        load_kwargs["torch_dtype"] = dtype

    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)

    memory_mb = get_model_memory_mb(model)

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=30)
    text = tokenizer.decode(output[0], skip_special_tokens=True)

    print(f"\n{'='*60}")
    print(f"Config: {config_name}")
    print(f"Memory: {memory_mb:.1f} MB")
    print(f"Output: {text}")

    # Cleanup
    del model
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    return {"config": config_name, "memory_mb": memory_mb, "output": text}


results = []

# 1. FP16
results.append(load_and_test(
    model_id, prompt, "FP16",
    dtype=torch.float16,
))

# 2. 8-bit
results.append(load_and_test(
    model_id, prompt, "INT8 (bitsandbytes)",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
))

# 3. 4-bit NF4
results.append(load_and_test(
    model_id, prompt, "NF4 4-bit (bitsandbytes)",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    ),
))

# Summary table
print(f"\n\n{'='*60}")
print(f"{'Config':<30} {'Memory (MB)':>12}")
print(f"{'-'*30} {'-'*12}")
for r in results:
    print(f"{r['config']:<30} {r['memory_mb']:>10.1f} MB")

10. When to Use What

Decision Rules

  1. Quick experimentationbitsandbytes (zero-setup, just add a config)
  2. Production inference servingGPTQ or AWQ (optimized kernels, best throughput)
  3. CPU / edge deploymentGGUF via llama.cpp (ubiquitous, cross-platform)
  4. Fine-tuning a large modelbitsandbytes 4-bit NF4 + LoRA (QLoRA)
  5. Maximum quality preservationAWQ (activation-aware, best quality/size ratio)

Comparison Table

Method Bits Calibration Speed Quality Fine-tuning Best For
GPTQ 2-8 Required Fast (GPU) Good Limited GPU inference serving
AWQ 4 Required Fastest (fused) Best Limited High-throughput inference
GGUF 2-8 Pre-computed Moderate Good No CPU / edge / llama.cpp
bnb 8-bit 8 None Good Very Good Yes Quick experiments
bnb 4-bit NF4 4 None Good Good Yes (QLoRA) Fine-tuning large models