Table of Contents
- Ollama Setup
- Table of Contents
- Recommended Models
- Quick Reference
- Ubuntu Linux 24.04 (x86_64)
- Step 1: Ensure Your NVIDIA Drivers are Properly Installed
- Step 2: Install Ollama
- Step 3: Configure Systemd
- Step 4: Start Service
- Step 5: Pull a Model
- Step 6: Add Model to GT AI OS
- NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7)
- Verify Ollama is Working
Ollama Setup
Set up local AI models with Ollama for offline inference. Ollama runs on your host machine (outside Docker) and GT AI OS containers connect to it.
Table of Contents
- Recommended Models
- Quick Reference
- Ubuntu Linux 24.04 (x86_64)
- NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7)
- Verify Ollama is Working
Recommended Models
| Model | Size | VRAM Required | Best For |
|---|---|---|---|
| llama3.1:8b | ~4.7GB | 6GB+ | General chat, coding help |
| qwen3-coder:30b | ~19GB | 24GB+ | Code generation, agentic coding |
| gemma3:27b | ~17GB | 20GB+ | General tasks, multilingual |
Quick Reference
| Platform | Model Endpoint URL |
|---|---|
| Ubuntu Linux 24.04 (x86_64) | http://ollama-host:11434/v1/chat/completions |
| NVIDIA DGX OS 7 | http://ollama-host:11434/v1/chat/completions |
Ubuntu Linux 24.04 (x86_64)
Step 1: Ensure Your NVIDIA Drivers are Properly Installed
If your system has an NVIDIA GPU, you need working drivers for GPU-accelerated inference. If you don't have an NVIDIA GPU, skip to Step 2.
1. Check if NVIDIA drivers are already installed:
nvidia-smi
If this command shows your GPU info, skip to Step 3. If not, continue below.
2. Install NVIDIA drivers:
# Update package list
sudo apt update
# Install the recommended NVIDIA driver
sudo ubuntu-drivers install
# Reboot to load the new driver
sudo reboot
After reboot, verify the driver is working:
nvidia-smi
You should see your GPU model, driver version, and CUDA version.
3. Install nvtop:
Now install the nvtop utility that allows you to monitor your GPU utilization:
sudo apt install nvtop
Now run nvtop to see your GPU metrics by copying and pasting the below command:
nvtop
Note: Ollama automatically detects and uses NVIDIA GPUs when drivers are installed. No additional configuration is needed.
Step 2: Install Ollama
Install Ollama using the command below. Other installation methods may not function correctly.
curl -fsSL https://ollama.com/install.sh | sh
When the Ollama install is completed it will confirm that your GPU is present. If Ollama does not detect your GPU, check your GPU driver configuration.
Step 3: Configure Systemd
Create the override configuration based on your GPU's VRAM. Choose the configuration that matches your GPU:
These Configurations are required for GT AI OS to connect properly to Ollama.
4GB VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=4096"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
6GB VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
8GB VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
12GB VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=32768"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF
16GB VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=65536"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF
32GB+ VRAM:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
EOF
Configuration explained:
OLLAMA_HOST=0.0.0.0:11434- Listen on all network interfaces (required for Docker)OLLAMA_CONTEXT_LENGTH- Maximum context window size (adjust based on VRAM)OLLAMA_FLASH_ATTENTION=1- Enable flash attention for better performanceOLLAMA_KEEP_ALIVE=4h- Keep models loaded for 4 hoursOLLAMA_MAX_LOADED_MODELS- Number of models loaded simultaneously (adjust based on VRAM)
Step 4: Start Service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl restart ollama
Step 5: Pull a Model
ollama pull llama3.1:8b
Step 6: Add Model to GT AI OS
- Open Control Panel: http://localhost:3001
- Log in with
gtadmin@test.com/Test@123 - Go to Models → Add Model
- Fill in:
- Model ID:
llama3.1:8b(must match exactly what you pulled) - Provider:
Local Ollama (Ubuntu x86 / DGX ARM) - Endpoint URL:
http://ollama-host:11434/v1/chat/completions - Model Type:
LLM(Language Model - this is the most common type for AI agents) - Context Length: Use the value from your systemd config (e.g.,
8192for 6GB VRAM) - Max Tokens:
4096
- Model ID:
- Click Save
- Go to Tenant Access → Assign Model to Tenant
- Select your model, tenant, and rate limit
⚠️ Critical: Model ID Must Match Exactly
The Model ID in GT AI OS must match the Ollama model name exactly - character for character. Run
ollama listto see the exact model names. Common mistakes:
- Extra spaces before or after the ID
- Missing version tags (e.g.,
qwen3-codervsqwen3-coder:30b)- Typos in the model name
Example: If
ollama listshowsllama3.1:8b, usellama3.1:8bexactly as shown.
NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7)
DGX systems come with NVIDIA drivers and CUDA pre-installed. Ollama will automatically use the GPUs.
Step 1: Install Ollama (Clean Install)
Copy and paste the command below to perform a complete clean install of Ollama.
Important: The configuration settings in this script are required for GT AI OS integration on DGX OS 7 Systems:
OLLAMA_HOST=0.0.0.0:11434- Allows Docker containers to connect (required)OLLAMA_CONTEXT_LENGTH=131072- 128K context window for long conversationsOLLAMA_FLASH_ATTENTION=1- Enables flash attention for better GPU performanceOLLAMA_KEEP_ALIVE=4h- Keeps models loaded to avoid cold start delaysOLLAMA_MAX_LOADED_MODELS=3- DGX has enough VRAM for multiple modelsDo not skip or modify these settings unless you understand the implications.
⚠️ Warning: This command performs a clean reinstallation of Ollama. Any existing Ollama installation will be removed, including downloaded models. If you wish to preserve your models, back up
/usr/share/ollama/.ollama/modelsbefore proceeding.
# Cleanup
sudo systemctl stop ollama 2>/dev/null; sudo pkill ollama 2>/dev/null; sleep 2; \
snap list ollama &>/dev/null && sudo snap remove ollama; \
sudo systemctl disable ollama 2>/dev/null; \
sudo rm -f /etc/systemd/system/ollama.service; \
sudo rm -rf /etc/systemd/system/ollama.service.d; \
sudo rm -f /usr/local/bin/ollama /usr/bin/ollama; \
sudo rm -rf /usr/local/lib/ollama; \
id ollama &>/dev/null && sudo userdel -r ollama 2>/dev/null; \
getent group ollama &>/dev/null && sudo groupdel ollama 2>/dev/null; \
sudo systemctl daemon-reload && \
# Install
curl -fsSL https://ollama.com/install.sh | sh && \
if [ ! -f /etc/systemd/system/ollama.service ]; then
sudo tee /etc/systemd/system/ollama.service > /dev/null <<'EOF'
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
fi && \
# Configure
sudo mkdir -p /etc/systemd/system/ollama.service.d && \
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=4h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
EOF
# Start
sudo systemctl daemon-reload && \
sudo systemctl enable ollama && \
sudo systemctl start ollama && \
sudo systemctl restart ollama && \
# Verify
sleep 3 && \
systemctl is-active ollama && echo "✓ Service running" && \
curl -s http://localhost:11434/api/version && echo -e "\n✓ API responding" && \
systemctl show ollama --property=Environment | tr ' ' '\n'
Step 2: Pull Models
DGX systems have more VRAM, so you can run larger models:
ollama pull llama3.1:8b
ollama pull qwen3-coder:30b
ollama pull gemma3:27b
Step 3: Add Model to GT AI OS
- Open Control Panel: http://localhost:3001
- Log in with
gtadmin@test.com/Test@123 - Go to Models → Add Model
- Fill in:
- Model ID:
llama3.1:8b(orqwen3-coder:30b,gemma3:27b) - Provider:
Local Ollama (Ubuntu x86 / DGX ARM) - Endpoint URL:
http://ollama-host:11434/v1/chat/completions - Model Type:
LLM(Language Model - this is the most common type for AI agents) - Context Length:
131072 - Max Tokens:
4096
- Model ID:
- Click Save
- Go to Tenant Access → Assign Model to Tenant
- Select your model, tenant, and rate limit
⚠️ Critical: Model ID Must Match Exactly
The Model ID in GT AI OS must match the Ollama model name exactly - character for character. Run
ollama listto see the exact model names. Common mistakes:
- Extra spaces before or after the ID
- Missing version tags (e.g.,
qwen3-codervsqwen3-coder:30b)- Typos in the model name
Example: If
ollama listshowsllama3.1:8b, usellama3.1:8bexactly as shown.
Verify Ollama is Working
After completing the setup for your platform, follow these verification steps to ensure Ollama is properly configured and accessible by GT AI OS.
Step 1: Verify Ollama Service is Running
All Platforms (Ubuntu and DGX):
Run these commands on your host machine (not inside Docker) to confirm Ollama is running and responding:
ollama list
This shows all models you have pulled. You should see llama3.1:8b (or other models you installed).
curl http://localhost:11434/api/version
This tests the Ollama API. You should see a JSON response with version information like {"version":"0.x.x"}.
Step 2: Verify GPU Acceleration
Ubuntu x86 and DGX Only:
While a model is running, check that your NVIDIA GPU is being utilized:
nvtop
or
nvidia-smi
You should see ollama or ollama_llama_server processes using GPU memory. If you only see CPU usage, revisit Step 1 (NVIDIA driver installation) in your platform's setup.
Step 3: Verify GT AI OS Can Reach Ollama
This step confirms that the Docker containers running GT AI OS can communicate with Ollama on your host machine.
Ubuntu x86 and DGX:
docker exec gentwo-resource-cluster curl http://ollama-host:11434/api/version
You should see the same JSON version response. If you get a connection error, check that:
- Ollama is running (
ollama listworks) - On Ubuntu/DGX: The systemd config has
OLLAMA_HOST=0.0.0.0:11434 - GT AI OS containers are running (
docker ps | grep gentwo)
Step 4: Test in the Application
Once all verification steps pass, test the full integration:
- Open Tenant App: http://localhost:3002
- Create a new agent or edit an existing one
- Select your Ollama model (e.g.,
llama3.1:8b) from the model dropdown - Send a test message and verify you get a response
If the agent doesn't respond, check the model configuration in Control Panel → Models and ensure the Model ID matches exactly what ollama list shows.