Ollama – Run AI Models Locally

Use at your own risk. All guides and scripts are provided for educational purposes only. Always review and understand any code before running it — especially with administrative privileges. Test in a safe environment before using in production. Your system, your responsibility.

Ollama makes it easy to run large language models (LLMs) locally on your own hardware — no internet required, no API keys, no data sent to the cloud. Just download a model and start chatting.

This guide covers installation on Linux and macOS, running your first model, and useful commands for managing models.

Requirements

Linux (Ubuntu/Debian) or macOS
Minimum 8 GB RAM (16 GB+ recommended)
For GPU acceleration: NVIDIA GPU with CUDA support (Linux) or Apple Silicon Mac

Install Ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

The installer automatically detects your GPU and sets up CUDA support if available.

macOS:

brew install ollama

Or download the app directly from ollama.com.

Start Ollama

Linux – Ollama runs as a systemd service after installation:

sudo systemctl start ollama
sudo systemctl enable ollama  # start on boot

macOS – Start from terminal:

ollama serve

Or just open the Ollama app from Applications.

Run Your First Model

Pull and run a model in one command:

ollama run llama3.2

The first run downloads the model (a few GB depending on the model). After that it loads from local storage — no internet needed.

You’re now in an interactive chat. Type your message and press Enter. Type /bye to exit.

Popular Models

Model	Size	Good for
`llama3.2`	2B / 3B	Fast, general purpose, low RAM
`llama3.1`	8B	Good balance of speed and quality
`mistral`	7B	Fast, great for coding and reasoning
`codellama`	7B	Code generation and explanation
`phi3`	3.8B	Microsoft model, very fast
`deepseek-r1`	7B	Strong reasoning and math
`gemma2`	9B	Google model, good quality

Browse all models at ollama.com/library.

Useful Commands

Command	What it does
`ollama run llama3.2`	Pull and run a model
`ollama pull llama3.2`	Download a model without running it
`ollama list`	List all downloaded models
`ollama rm llama3.2`	Delete a model
`ollama show llama3.2`	Show model details
`ollama ps`	Show currently running models
`ollama serve`	Start the Ollama server manually

Run a Model Non-Interactively

Send a single prompt from the terminal:

ollama run llama3.2 "Explain what Docker is in simple terms"

Pipe input to a model:

cat script.sh | ollama run codellama "Review this bash script for errors"

Ollama REST API

Ollama exposes a local REST API on port 11434 — useful for integrating with other tools:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is Proxmox?",
  "stream": false
}'

This API is compatible with the OpenAI API format, making it easy to use Ollama as a drop-in replacement in apps that support OpenAI.

Add a Web Interface with Open WebUI

For a ChatGPT-like interface in your browser, run Open WebUI alongside Ollama:

docker run -d \
  --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open your browser at http://localhost:8080 and connect to your local Ollama instance.

Tips

RAM usage — each model loads fully into RAM (or VRAM with GPU). Make sure you have enough before pulling large models.
GPU acceleration — if you have an NVIDIA GPU, Ollama uses it automatically after installation. Speeds up inference significantly.
Apple Silicon — Ollama uses Metal on Apple Silicon Macs for GPU acceleration out of the box.
Model storage — models are stored in ~/.ollama/models on Linux and macOS.