Ollama on Linux with NVIDIA GPU
Running Ollama with a NVIDIA GPU dramatically speeds up AI model inference. Instead of waiting 30+ seconds for a response on CPU, a decent GPU can respond in seconds. This guide covers setting up CUDA drivers and verifying GPU acceleration with Ollama on Linux.
Requirements
- Linux (Ubuntu 20.04+ or Debian 11+)
- NVIDIA GPU with CUDA support (GTX 900 series or newer)
- Ollama installed (see Ollama – Run AI Models Locally)
Step 1 – Check Your GPU
Verify your NVIDIA GPU is detected:
lspci | grep -i nvidiaCheck if NVIDIA drivers are already installed:
nvidia-smiIf nvidia-smi works and shows your GPU — skip to Step 3.
Step 2 – Install NVIDIA Drivers
Add the NVIDIA driver repository and install:
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstallOr install a specific driver version:
sudo apt install -y nvidia-driver-535Reboot after installation:
sudo rebootVerify after reboot:
nvidia-smiExample output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.x Driver Version: 535.x CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence | Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 RTX 4080 Super Off | 00000000:01:00.0 Off | N/A |
+-----------------------------------------------------------------------------+Step 3 – Install CUDA Toolkit
Ollama’s installer handles CUDA automatically, but installing the toolkit gives you additional tools:
sudo apt install -y nvidia-cuda-toolkitVerify CUDA:
nvcc --versionStep 4 – Install or Reinstall Ollama
If you already have Ollama installed, the installer will update it and detect your GPU:
curl -fsSL https://ollama.com/install.sh | shThe installer automatically:
- Detects your NVIDIA GPU
- Configures CUDA support
- Sets up the systemd service
Step 5 – Verify GPU Acceleration
Start Ollama and run a model:
sudo systemctl restart ollama
ollama run llama3.1While the model is running, open a second terminal and check GPU usage:
nvidia-smiYou should see GPU memory being used and GPU utilization above 0%. This confirms Ollama is using your GPU.
Choosing Models for Your GPU
GPU VRAM determines which models you can run efficiently:
| VRAM | Recommended models |
|---|---|
| 4 GB | llama3.2:3b, phi3, gemma2:2b |
| 8 GB | llama3.1:8b, mistral:7b, codellama:7b |
| 12 GB | llama3.1:8b comfortably, some 13B models |
| 16 GB+ | llama3.1:13b, larger models |
| 24 GB (RTX 4080 Super) | llama3.1:13b, codellama:34b, most 30B models |
With an RTX 4080 Super and 16 GB VRAM you can run very capable models at full GPU speed.
Monitor GPU Usage
Watch GPU usage in real time while running models:
watch -n 1 nvidia-smiOr use nvtop for a more detailed view:
sudo apt install -y nvtop
nvtopTroubleshooting
Ollama not using GPU:
Check Ollama logs for GPU detection:
sudo journalctl -u ollama -fLook for lines mentioning CUDA or your GPU model. If not found, reinstall Ollama after installing NVIDIA drivers.
Out of VRAM:
If a model is too large for your VRAM, Ollama automatically offloads layers to CPU RAM. Performance drops significantly but it still works. Use a smaller model or one with fewer parameters.
Related Links
- Ollama – Run AI Models Locally — basic Ollama setup guide
- NVIDIA CUDA Documentation — official CUDA docs
- Ollama GPU Documentation — Ollama GPU support details