Skip to content

Part 6 – Ollama and Open WebUI

Use at your own risk. All guides and scripts are provided for educational purposes only. Always review and understand any code before running it — especially with administrative privileges. Test in a safe environment before using in production. Your system, your responsibility.

Ollama runs large language models locally on your own hardware. Combined with Open WebUI, you get a ChatGPT-like interface running entirely on your machine — no internet required, no API costs, no data leaving your home.

With an RTX 4080 Super and 16 GB VRAM, models respond in seconds rather than minutes.


Prerequisites

  • ✅ Docker with NVIDIA Container Toolkit installed
  • ✅ NVIDIA drivers running (nvidia-smi works)
  • ✅ Caddy configured with ai.wcp
  • ✅ The wcp-network Docker network created

→ Follow Part 2 first for Docker and NVIDIA Container Toolkit setup.


Step 1 – Create the folder

mkdir -p /opt/docker/ollama
cd /opt/docker/ollama

Step 2 – Create the compose.yml

nano compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - /mnt/ai/ollama:/root/.ollama
    networks:
      - wcp-network
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    depends_on:
      - ollama
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui-data:/app/backend/data
    networks:
      - wcp-network
    restart: unless-stopped

volumes:
  open-webui-data:

networks:
  wcp-network:
    external: true

Key settings:

  • runtime: nvidia — enables GPU access for the Ollama container
  • /mnt/ai/ollama:/root/.ollama — stores models on the NVMe disk outside the container
  • OLLAMA_BASE_URL — tells Open WebUI where to find Ollama

Step 3 – Start the services

docker compose up -d
docker compose logs -f

Wait until both containers are running. Press CTRL+C to exit logs.


Step 4 – Pull your first model

docker exec ollama ollama pull llama3.1

This downloads the model to /mnt/ai/ollama. With an RTX 4080 Super and 16 GB VRAM, llama3.1:8b runs entirely on the GPU for fast responses.

Other recommended models:

# Fast and capable
docker exec ollama ollama pull mistral

# Great for coding
docker exec ollama ollama pull codellama

# Lightweight, very fast
docker exec ollama ollama pull phi3

# Strong reasoning
docker exec ollama ollama pull deepseek-r1

Step 5 – Access Open WebUI

Open your browser:

http://ai.wcp

Create an admin account on the first visit. Open WebUI connects to Ollama automatically and shows all downloaded models.


Step 6 – Verify GPU is being used

While a model is running, check GPU usage:

nvidia-smi

You should see memory allocated to the Ollama process and GPU utilization above 0%.


Model recommendations for RTX 4080 Super

With 16 GB VRAM you can run large models comfortably:

Model VRAM needed Best for
phi3 ~2 GB Fast answers, low resource
llama3.2:3b ~3 GB Quick everyday use
mistral:7b ~5 GB Balanced speed and quality
llama3.1:8b ~6 GB Great general purpose
codellama:13b ~10 GB Code generation
llama3.1:13b ~10 GB High quality responses

Useful commands

# List downloaded models
docker exec ollama ollama list

# Pull a new model
docker exec ollama ollama pull modelname

# Remove a model
docker exec ollama ollama rm modelname

# Chat directly in terminal
docker exec -it ollama ollama run llama3.1

What’s next

Part 7 deploys ComfyUI — a node-based creative AI studio for image, video, and music generation, also using the RTX 4080 Super.

Up next: Part 7 – ComfyUI (coming soon)


Related guides