Run LLM Locally with Ollama

A practical guide to running large language models on your own machine using Ollama

Open In Colab

πŸ“– Read the full article


Table of Contents

  1. Setup & Installation
  2. Run Your First Model
  3. Serve Models via API
  4. Using Ollama with Requests
  5. Using the Official Ollama Package
  6. Custom Models with Modelfile
  7. Deployment Options & Performance Tips

1. Setup & Installation

Ollama lets you run large language models locally with a single command. Download it from ollama.com for your platform (macOS, Linux, or Windows).

Linux installation:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version

Once installed, you can pull and run models directly from the command line.

!pip install -q ollama requests

2. Run Your First Model

Running a model with Ollama is as simple as a single command. Ollama will automatically download the model if it’s not already available locally.

Run Llama 2:

ollama run llama2

Run Mistral:

ollama run mistral

Useful commands: - Type /bye to exit the chat session - ollama list β€” list all downloaded models - ollama rm <model> β€” remove a downloaded model - ollama pull <model> β€” download a model without running it

3. Serve Models via API

Ollama exposes a local HTTP API on port 11434 that you can use to interact with models programmatically. This is ideal for integrating LLMs into your applications.

Start the Ollama server:

ollama serve

Once the server is running, you can send requests to http://localhost:11434 using any HTTP client.

Key API endpoints: - POST /api/generate β€” generate a completion - POST /api/chat β€” chat with a model - GET /api/tags β€” list available models

4. Using Ollama with Requests

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "phi",
        "prompt": "Explain what a neural network is in simple terms.",
        "stream": False
    }
)

result = response.json()
print(result["response"])

5. Using the Official Ollama Package

import ollama

response = ollama.generate(
    model='mistral',
    prompt='Explain Python in a few sentences.'
)

print(response['response'])

6. Custom Models with Modelfile

You can create custom models by defining a Modelfile that specifies the base model, system prompt, and other parameters.

Example Modelfile:

FROM llama3.2

SYSTEM "You are a pirate. Always respond in pirate speak."

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Create and run the custom model:

ollama create pirate-bot -f Modelfile
ollama run pirate-bot

This lets you package specific behaviors, system prompts, and settings into reusable model configurations.

7. Deployment Options & Performance Tips

GPU Acceleration: Ollama automatically detects and uses available GPUs. Ensure your NVIDIA/AMD drivers are up to date for best performance.

Quantized Models: Use quantized model variants to reduce memory usage while maintaining quality: - Q4 β€” 4-bit quantization, smallest size, moderate quality loss - Q5 β€” 5-bit quantization, good balance of size and quality - Q8 β€” 8-bit quantization, near-original quality

Docker Deployment:

docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Kubernetes Deployment: Ollama can be deployed on Kubernetes clusters for scalable inference serving. Use Helm charts or custom manifests to manage replicas, GPU scheduling, and load balancing across nodes.