!pip install -q ollama requestsRun LLM Locally with Ollama
A practical guide to running large language models on your own machine using Ollama
Table of Contents
1. Setup & Installation
Ollama lets you run large language models locally with a single command. Download it from ollama.com for your platform (macOS, Linux, or Windows).
Linux installation:
curl -fsSL https://ollama.com/install.sh | shVerify the installation:
ollama --versionOnce installed, you can pull and run models directly from the command line.
2. Run Your First Model
Running a model with Ollama is as simple as a single command. Ollama will automatically download the model if itβs not already available locally.
Run Llama 2:
ollama run llama2Run Mistral:
ollama run mistralUseful commands: - Type /bye to exit the chat session - ollama list β list all downloaded models - ollama rm <model> β remove a downloaded model - ollama pull <model> β download a model without running it
3. Serve Models via API
Ollama exposes a local HTTP API on port 11434 that you can use to interact with models programmatically. This is ideal for integrating LLMs into your applications.
Start the Ollama server:
ollama serveOnce the server is running, you can send requests to http://localhost:11434 using any HTTP client.
Key API endpoints: - POST /api/generate β generate a completion - POST /api/chat β chat with a model - GET /api/tags β list available models
4. Using Ollama with Requests
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "phi",
"prompt": "Explain what a neural network is in simple terms.",
"stream": False
}
)
result = response.json()
print(result["response"])5. Using the Official Ollama Package
import ollama
response = ollama.generate(
model='mistral',
prompt='Explain Python in a few sentences.'
)
print(response['response'])6. Custom Models with Modelfile
You can create custom models by defining a Modelfile that specifies the base model, system prompt, and other parameters.
Example Modelfile:
FROM llama3.2
SYSTEM "You are a pirate. Always respond in pirate speak."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096Create and run the custom model:
ollama create pirate-bot -f Modelfile
ollama run pirate-botThis lets you package specific behaviors, system prompts, and settings into reusable model configurations.
7. Deployment Options & Performance Tips
GPU Acceleration: Ollama automatically detects and uses available GPUs. Ensure your NVIDIA/AMD drivers are up to date for best performance.
Quantized Models: Use quantized model variants to reduce memory usage while maintaining quality: - Q4 β 4-bit quantization, smallest size, moderate quality loss - Q5 β 5-bit quantization, good balance of size and quality - Q8 β 8-bit quantization, near-original quality
Docker Deployment:
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaKubernetes Deployment: Ollama can be deployed on Kubernetes clusters for scalable inference serving. Use Helm charts or custom manifests to manage replicas, GPU scheduling, and load balancing across nodes.