2
Ollama Setup
daniel edited this page 2026-01-10 03:42:58 +00:00

Ollama Setup

Set up local AI models with Ollama for offline inference. Ollama runs on your host machine (outside Docker) and GT AI OS containers connect to it.

Table of Contents


Model Size VRAM Required Best For
llama3.1:8b ~4.7GB 6GB+ General chat, coding help
qwen3-coder:30b ~19GB 24GB+ Code generation, agentic coding
gemma3:27b ~17GB 20GB+ General tasks, multilingual

Quick Reference

Platform Model Endpoint URL
Ubuntu Linux 24.04 (x86_64) http://ollama-host:11434/v1/chat/completions
NVIDIA DGX OS 7 http://ollama-host:11434/v1/chat/completions

Ubuntu Linux 24.04 (x86_64)

Step 1: Ensure Your NVIDIA Drivers are Properly Installed

If your system has an NVIDIA GPU, you need working drivers for GPU-accelerated inference. If you don't have an NVIDIA GPU, skip to Step 2.

1. Check if NVIDIA drivers are already installed:

nvidia-smi

If this command shows your GPU info, skip to Step 3. If not, continue below.

2. Install NVIDIA drivers:

# Update package list
sudo apt update

# Install the recommended NVIDIA driver
sudo ubuntu-drivers install

# Reboot to load the new driver
sudo reboot

After reboot, verify the driver is working:

nvidia-smi

You should see your GPU model, driver version, and CUDA version.

3. Install nvtop:

Now install the nvtop utility that allows you to monitor your GPU utilization:

sudo apt install nvtop

Now run nvtop to see your GPU metrics by copying and pasting the below command:

nvtop

Note: Ollama automatically detects and uses NVIDIA GPUs when drivers are installed. No additional configuration is needed.

Step 2: Install Ollama

Install Ollama using the command below. Other installation methods may not function correctly.

curl -fsSL https://ollama.com/install.sh | sh

When the Ollama install is completed it will confirm that your GPU is present. If Ollama does not detect your GPU, check your GPU driver configuration.

Step 3: Configure Systemd

Create the override configuration based on your GPU's VRAM. Choose the configuration that matches your GPU:

These Configurations are required for GT AI OS to connect properly to Ollama.

4GB VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=4096"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

6GB VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

8GB VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

12GB VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=32768"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF

16GB VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=65536"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF

32GB+ VRAM:

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
EOF

Configuration explained:

  • OLLAMA_HOST=0.0.0.0:11434 - Listen on all network interfaces (required for Docker)
  • OLLAMA_CONTEXT_LENGTH - Maximum context window size (adjust based on VRAM)
  • OLLAMA_FLASH_ATTENTION=1 - Enable flash attention for better performance
  • OLLAMA_KEEP_ALIVE=4h - Keep models loaded for 4 hours
  • OLLAMA_MAX_LOADED_MODELS - Number of models loaded simultaneously (adjust based on VRAM)

Step 4: Start Service

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl restart ollama

Step 5: Pull a Model

ollama pull llama3.1:8b

Step 6: Add Model to GT AI OS

  1. Open Control Panel: http://localhost:3001
  2. Log in with gtadmin@test.com / Test@123
  3. Go to ModelsAdd Model
  4. Fill in:
    • Model ID: llama3.1:8b (must match exactly what you pulled)
    • Provider: Local Ollama (Ubuntu x86 / DGX ARM)
    • Endpoint URL: http://ollama-host:11434/v1/chat/completions
    • Model Type: LLM (Language Model - this is the most common type for AI agents)
    • Context Length: Use the value from your systemd config (e.g., 8192 for 6GB VRAM)
    • Max Tokens: 4096
  5. Click Save
  6. Go to Tenant AccessAssign Model to Tenant
  7. Select your model, tenant, and rate limit

⚠️ Critical: Model ID Must Match Exactly

The Model ID in GT AI OS must match the Ollama model name exactly - character for character. Run ollama list to see the exact model names. Common mistakes:

  • Extra spaces before or after the ID
  • Missing version tags (e.g., qwen3-coder vs qwen3-coder:30b)
  • Typos in the model name

Example: If ollama list shows llama3.1:8b, use llama3.1:8b exactly as shown.


NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7)

DGX systems come with NVIDIA drivers and CUDA pre-installed. Ollama will automatically use the GPUs.

Step 1: Install Ollama (Clean Install)

Copy and paste the command below to perform a complete clean install of Ollama.

Important: The configuration settings in this script are required for GT AI OS integration on DGX OS 7 Systems:

  • OLLAMA_HOST=0.0.0.0:11434 - Allows Docker containers to connect (required)
  • OLLAMA_CONTEXT_LENGTH=131072 - 128K context window for long conversations
  • OLLAMA_FLASH_ATTENTION=1 - Enables flash attention for better GPU performance
  • OLLAMA_KEEP_ALIVE=4h - Keeps models loaded to avoid cold start delays
  • OLLAMA_MAX_LOADED_MODELS=3 - DGX has enough VRAM for multiple models

Do not skip or modify these settings unless you understand the implications.

⚠️ Warning: This command performs a clean reinstallation of Ollama. Any existing Ollama installation will be removed, including downloaded models. If you wish to preserve your models, back up /usr/share/ollama/.ollama/models before proceeding.

# Cleanup
sudo systemctl stop ollama 2>/dev/null; sudo pkill ollama 2>/dev/null; sleep 2; \
snap list ollama &>/dev/null && sudo snap remove ollama; \
sudo systemctl disable ollama 2>/dev/null; \
sudo rm -f /etc/systemd/system/ollama.service; \
sudo rm -rf /etc/systemd/system/ollama.service.d; \
sudo rm -f /usr/local/bin/ollama /usr/bin/ollama; \
sudo rm -rf /usr/local/lib/ollama; \
id ollama &>/dev/null && sudo userdel -r ollama 2>/dev/null; \
getent group ollama &>/dev/null && sudo groupdel ollama 2>/dev/null; \
sudo systemctl daemon-reload && \
# Install
curl -fsSL https://ollama.com/install.sh | sh && \
if [ ! -f /etc/systemd/system/ollama.service ]; then
    sudo tee /etc/systemd/system/ollama.service > /dev/null <<'EOF'
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=default.target
EOF
    sudo systemctl daemon-reload
fi && \
# Configure
sudo mkdir -p /etc/systemd/system/ollama.service.d && \
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
EOF
# Start
sudo systemctl daemon-reload && \
sudo systemctl enable ollama && \
sudo systemctl start ollama && \
sudo systemctl restart ollama && \
# Verify
sleep 3 && \
systemctl is-active ollama && echo "✓ Service running" && \
curl -s http://localhost:11434/api/version && echo -e "\n✓ API responding" && \
systemctl show ollama --property=Environment | tr ' ' '\n'

Step 2: Pull Models

DGX systems have more VRAM, so you can run larger models:

ollama pull llama3.1:8b
ollama pull qwen3-coder:30b
ollama pull gemma3:27b

Step 3: Add Model to GT AI OS

  1. Open Control Panel: http://localhost:3001
  2. Log in with gtadmin@test.com / Test@123
  3. Go to ModelsAdd Model
  4. Fill in:
    • Model ID: llama3.1:8b (or qwen3-coder:30b, gemma3:27b)
    • Provider: Local Ollama (Ubuntu x86 / DGX ARM)
    • Endpoint URL: http://ollama-host:11434/v1/chat/completions
    • Model Type: LLM (Language Model - this is the most common type for AI agents)
    • Context Length: 131072
    • Max Tokens: 4096
  5. Click Save
  6. Go to Tenant AccessAssign Model to Tenant
  7. Select your model, tenant, and rate limit

⚠️ Critical: Model ID Must Match Exactly

The Model ID in GT AI OS must match the Ollama model name exactly - character for character. Run ollama list to see the exact model names. Common mistakes:

  • Extra spaces before or after the ID
  • Missing version tags (e.g., qwen3-coder vs qwen3-coder:30b)
  • Typos in the model name

Example: If ollama list shows llama3.1:8b, use llama3.1:8b exactly as shown.


Verify Ollama is Working

After completing the setup for your platform, follow these verification steps to ensure Ollama is properly configured and accessible by GT AI OS.

Step 1: Verify Ollama Service is Running

All Platforms (Ubuntu and DGX):

Run these commands on your host machine (not inside Docker) to confirm Ollama is running and responding:

ollama list

This shows all models you have pulled. You should see llama3.1:8b (or other models you installed).

curl http://localhost:11434/api/version

This tests the Ollama API. You should see a JSON response with version information like {"version":"0.x.x"}.

Step 2: Verify GPU Acceleration

Ubuntu x86 and DGX Only:

While a model is running, check that your NVIDIA GPU is being utilized:

nvtop

or

nvidia-smi

You should see ollama or ollama_llama_server processes using GPU memory. If you only see CPU usage, revisit Step 1 (NVIDIA driver installation) in your platform's setup.

Step 3: Verify GT AI OS Can Reach Ollama

This step confirms that the Docker containers running GT AI OS can communicate with Ollama on your host machine.

Ubuntu x86 and DGX:

docker exec gentwo-resource-cluster curl http://ollama-host:11434/api/version

You should see the same JSON version response. If you get a connection error, check that:

  • Ollama is running (ollama list works)
  • On Ubuntu/DGX: The systemd config has OLLAMA_HOST=0.0.0.0:11434
  • GT AI OS containers are running (docker ps | grep gentwo)

Step 4: Test in the Application

Once all verification steps pass, test the full integration:

  1. Open Tenant App: http://localhost:3002
  2. Create a new agent or edit an existing one
  3. Select your Ollama model (e.g., llama3.1:8b) from the model dropdown
  4. Send a test message and verify you get a response

If the agent doesn't respond, check the model configuration in Control Panel → Models and ensure the Model ID matches exactly what ollama list shows.