From 33cbd137602dc9d67f13859574260af65f4b95bd Mon Sep 17 00:00:00 2001 From: daniel Date: Sat, 10 Jan 2026 03:26:09 +0000 Subject: [PATCH] Add Ollama Setup --- Ollama-Setup.md | 473 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 473 insertions(+) create mode 100644 Ollama-Setup.md diff --git a/Ollama-Setup.md b/Ollama-Setup.md new file mode 100644 index 0000000..b18bd0b --- /dev/null +++ b/Ollama-Setup.md @@ -0,0 +1,473 @@ +# Ollama Setup + +Set up local AI models with Ollama for offline inference. Ollama runs on your host machine (outside Docker) and GT AI OS containers connect to it. + +## Table of Contents + +- [Recommended Models](#recommended-models) +- [Quick Reference](#quick-reference) +- [Ubuntu Linux 24.04 (x86_64)](#ubuntu-linux-2404-x86_64) + - [Step 1: Ensure Your NVIDIA Drivers are Properly Installed](#step-1-ensure-your-nvidia-drivers-are-properly-installed) + - [Step 2: Install Ollama](#step-2-install-ollama) + - [Step 3: Configure Systemd](#step-3-configure-systemd) + - [Step 4: Start Service](#step-4-start-service) + - [Step 5: Pull a Model](#step-5-pull-a-model) + - [Step 6: Add Model to GT AI OS](#step-6-add-model-to-gt-ai-os) +- [NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7)](#nvidia-dgx-spark-and-rtx-pro-systems-dgx-os-7) + - [Step 1: Install Ollama (Clean Install)](#step-1-install-ollama-clean-install) + - [Step 2: Pull Models](#step-2-pull-models) + - [Step 3: Add Model to GT AI OS](#step-3-add-model-to-gt-ai-os) +- [macOS (Apple Silicon M1+)](#macos-apple-silicon-m1) + - [Step 1: Install Ollama](#step-1-install-ollama-2) + - [Step 2: Pull a Model](#step-2-pull-a-model) + - [Step 3: Add Model to GT AI OS](#step-3-add-model-to-gt-ai-os-1) +- [Verify Ollama is Working](#verify-ollama-is-working) + +--- + +## Recommended Models + +| Model | Size | VRAM Required | Best For | +|-------|------|---------------|----------| +| llama3.1:8b | ~4.7GB | 6GB+ | General chat, coding help | +| qwen3-coder:30b | ~19GB | 24GB+ | Code generation, agentic coding | +| gemma3:27b | ~17GB | 20GB+ | General tasks, multilingual | + +## Quick Reference + +| Platform | Model Endpoint URL | +|----------|-------------------| +| Ubuntu Linux 24.04 (x86_64) | `http://ollama-host:11434/v1/chat/completions` | +| NVIDIA DGX OS 7 | `http://ollama-host:11434/v1/chat/completions` | +| macOS (Apple Silicon M1+) | `http://host.docker.internal:11434/v1/chat/completions` | + +--- + +## Ubuntu Linux 24.04 (x86_64) + +### Step 1: Ensure Your NVIDIA Drivers are Properly Installed + +If your system has an NVIDIA GPU, you need working drivers for GPU-accelerated inference. If you don't have an NVIDIA GPU, skip to Step 2. + +**1. Check if NVIDIA drivers are already installed:** + +```bash +nvidia-smi +``` + +If this command shows your GPU info, skip to Step 3. If not, continue below. + +**2. Install NVIDIA drivers:** + +```bash +# Update package list +sudo apt update + +# Install the recommended NVIDIA driver +sudo ubuntu-drivers install + +# Reboot to load the new driver +sudo reboot +``` + +After reboot, verify the driver is working: + +```bash +nvidia-smi +``` + +You should see your GPU model, driver version, and CUDA version. + +**3. Install nvtop:** + +Now install the nvtop utility that allows you to monitor your GPU utilization: + +```bash +sudo apt install nvtop +``` +Now run nvtop to see your GPU metrics by copying and pasting the below command: +```bash +nvtop +``` + +> **Note:** Ollama automatically detects and uses NVIDIA GPUs when drivers are installed. No additional configuration is needed. + +### Step 2: Install Ollama + +Install Ollama using the command below. Other installation methods may not function correctly. + +```bash +curl -fsSL https://ollama.com/install.sh | sh +``` + +When the Ollama install is completed it will confirm that your GPU is present. If Ollama does not detect your GPU, check your GPU driver configuration. + +### Step 3: Configure Systemd + +Create the override configuration based on your GPU's VRAM. Choose the configuration that matches your GPU: + +#### These Configurations are required for GT AI OS to connect properly to Ollama. + +**4GB VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=4096" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=1" +EOF +``` + +**6GB VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=8192" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=1" +EOF +``` + +**8GB VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=16384" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=1" +EOF +``` + +**12GB VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=32768" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=2" +EOF +``` + +**16GB VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=65536" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=2" +EOF +``` + +**32GB+ VRAM:** +```bash +sudo mkdir -p /etc/systemd/system/ollama.service.d + +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=131072" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=3" +EOF +``` + +**Configuration explained:** +- `OLLAMA_HOST=0.0.0.0:11434` - Listen on all network interfaces (required for Docker) +- `OLLAMA_CONTEXT_LENGTH` - Maximum context window size (adjust based on VRAM) +- `OLLAMA_FLASH_ATTENTION=1` - Enable flash attention for better performance +- `OLLAMA_KEEP_ALIVE=4h` - Keep models loaded for 4 hours +- `OLLAMA_MAX_LOADED_MODELS` - Number of models loaded simultaneously (adjust based on VRAM) + +### Step 4: Start Service + +```bash +sudo systemctl daemon-reload +sudo systemctl enable ollama +sudo systemctl start ollama +sudo systemctl restart ollama +``` + +### Step 5: Pull a Model + +```bash +ollama pull llama3.1:8b +``` + +### Step 6: Add Model to GT AI OS + +1. Open Control Panel: http://localhost:3001 +2. Log in with `gtadmin@test.com` / `Test@123` +3. Go to **Models** → **Add Model** +4. Fill in: + - **Model ID:** `llama3.1:8b` (must match exactly what you pulled) + - **Provider:** `Local Ollama (Ubuntu x86 / DGX ARM)` + - **Endpoint URL:** `http://ollama-host:11434/v1/chat/completions` + - **Model Type:** `LLM` (Language Model - this is the most common type for AI agents) + - **Context Length:** Use the value from your systemd config (e.g., `8192` for 6GB VRAM) + - **Max Tokens:** `4096` +5. Click **Save** +6. Go to **Tenant Access** → **Assign Model to Tenant** +7. Select your model, tenant, and rate limit + +> ⚠️ **Critical: Model ID Must Match Exactly** +> +> The **Model ID** in GT AI OS must match the Ollama model name **exactly** - character for character. Run `ollama list` to see the exact model names. Common mistakes: +> - Extra spaces before or after the ID +> - Missing version tags (e.g., `qwen3-coder` vs `qwen3-coder:30b`) +> - Typos in the model name +> +> **Example:** If `ollama list` shows `llama3.1:8b`, use `llama3.1:8b` exactly as shown. + +--- + +## NVIDIA DGX Spark and RTX Pro Systems (DGX OS 7) + +DGX systems come with NVIDIA drivers and CUDA pre-installed. Ollama will automatically use the GPUs. + +### Step 1: Install Ollama (Clean Install) + +Copy and paste the command below to perform a complete clean install of Ollama. + +> **Important:** The configuration settings in this script are required for GT AI OS integration on DGX OS 7 Systems: +> - `OLLAMA_HOST=0.0.0.0:11434` - Allows Docker containers to connect (required) +> - `OLLAMA_CONTEXT_LENGTH=131072` - 128K context window for long conversations +> - `OLLAMA_FLASH_ATTENTION=1` - Enables flash attention for better GPU performance +> - `OLLAMA_KEEP_ALIVE=4h` - Keeps models loaded to avoid cold start delays +> - `OLLAMA_MAX_LOADED_MODELS=3` - DGX has enough VRAM for multiple models +> +> Do not skip or modify these settings unless you understand the implications. + +> ⚠️ **Warning:** This command performs a clean reinstallation of Ollama. Any existing Ollama installation will be removed, including downloaded models. If you wish to preserve your models, back up `/usr/share/ollama/.ollama/models` before proceeding. + +```bash +# Cleanup +sudo systemctl stop ollama 2>/dev/null; sudo pkill ollama 2>/dev/null; sleep 2; \ +snap list ollama &>/dev/null && sudo snap remove ollama; \ +sudo systemctl disable ollama 2>/dev/null; \ +sudo rm -f /etc/systemd/system/ollama.service; \ +sudo rm -rf /etc/systemd/system/ollama.service.d; \ +sudo rm -f /usr/local/bin/ollama /usr/bin/ollama; \ +sudo rm -rf /usr/local/lib/ollama; \ +id ollama &>/dev/null && sudo userdel -r ollama 2>/dev/null; \ +getent group ollama &>/dev/null && sudo groupdel ollama 2>/dev/null; \ +sudo systemctl daemon-reload && \ +# Install +curl -fsSL https://ollama.com/install.sh | sh && \ +if [ ! -f /etc/systemd/system/ollama.service ]; then + sudo tee /etc/systemd/system/ollama.service > /dev/null <<'EOF' +[Unit] +Description=Ollama Service +After=network-online.target + +[Service] +ExecStart=/usr/local/bin/ollama serve +User=ollama +Group=ollama +Restart=always +RestartSec=3 +Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" + +[Install] +WantedBy=default.target +EOF + sudo systemctl daemon-reload +fi && \ +# Configure +sudo mkdir -p /etc/systemd/system/ollama.service.d && \ +sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<'EOF' +[Service] +Environment="OLLAMA_HOST=0.0.0.0:11434" +Environment="OLLAMA_CONTEXT_LENGTH=131072" +Environment="OLLAMA_FLASH_ATTENTION=1" +Environment="OLLAMA_KEEP_ALIVE=4h" +Environment="OLLAMA_MAX_LOADED_MODELS=3" +EOF +# Start +sudo systemctl daemon-reload && \ +sudo systemctl enable ollama && \ +sudo systemctl start ollama && \ +sudo systemctl restart ollama && \ +# Verify +sleep 3 && \ +systemctl is-active ollama && echo "✓ Service running" && \ +curl -s http://localhost:11434/api/version && echo -e "\n✓ API responding" && \ +systemctl show ollama --property=Environment | tr ' ' '\n' +``` + +### Step 2: Pull Models + +DGX systems have more VRAM, so you can run larger models: + +```bash +ollama pull llama3.1:8b +ollama pull qwen3-coder:30b +ollama pull gemma3:27b +``` + +### Step 3: Add Model to GT AI OS + +1. Open Control Panel: http://localhost:3001 +2. Log in with `gtadmin@test.com` / `Test@123` +3. Go to **Models** → **Add Model** +4. Fill in: + - **Model ID:** `llama3.1:8b` (or `qwen3-coder:30b`, `gemma3:27b`) + - **Provider:** `Local Ollama (Ubuntu x86 / DGX ARM)` + - **Endpoint URL:** `http://ollama-host:11434/v1/chat/completions` + - **Model Type:** `LLM` (Language Model - this is the most common type for AI agents) + - **Context Length:** `131072` + - **Max Tokens:** `4096` +5. Click **Save** +6. Go to **Tenant Access** → **Assign Model to Tenant** +7. Select your model, tenant, and rate limit + +> ⚠️ **Critical: Model ID Must Match Exactly** +> +> The **Model ID** in GT AI OS must match the Ollama model name **exactly** - character for character. Run `ollama list` to see the exact model names. Common mistakes: +> - Extra spaces before or after the ID +> - Missing version tags (e.g., `qwen3-coder` vs `qwen3-coder:30b`) +> - Typos in the model name +> +> **Example:** If `ollama list` shows `llama3.1:8b`, use `llama3.1:8b` exactly as shown. + +--- + +## macOS (Apple Silicon M1+) + +### Step 1: Install Ollama + +Download from https://ollama.com/download or run: + +```bash +curl -fsSL https://ollama.com/install.sh | sh +``` + +### Step 2: Pull a Model + +```bash +ollama pull llama3.1:8b +``` + +### Step 3: Add Model to GT AI OS + +1. Open Control Panel: http://localhost:3001 +2. Log in with `gtadmin@test.com` / `Test@123` +3. Go to **Models** → **Add Model** +4. Fill in: + - **Model ID:** `llama3.1:8b` (must match exactly what you pulled) + - **Provider:** `Local Ollama (macOS Apple Silicon)` + - **Endpoint URL:** `http://host.docker.internal:11434/v1/chat/completions` + - **Model Type:** `LLM` (Language Model - this is the most common type for AI agents) + - **Context Length:** Based on your Mac's unified memory (see table below) + - **Max Tokens:** `4096` +5. Click **Save** +6. Go to **Tenant Access** → **Assign Model to Tenant** +7. Select your model, tenant, and rate limit + +**Context Length by Mac Memory:** + +| Unified Memory | Context Length | +|----------------|----------------| +| 8GB | `8192` | +| 16GB | `32768` | +| 32GB | `65536` | +| 64GB+ | `131072` | + +> ⚠️ **Critical: Model ID Must Match Exactly** +> +> The **Model ID** in GT AI OS must match the Ollama model name **exactly** - character for character. Run `ollama list` to see the exact model names. Common mistakes: +> - Extra spaces before or after the ID +> - Missing version tags (e.g., `qwen3-coder` vs `qwen3-coder:30b`) +> - Typos in the model name +> +> **Example:** If `ollama list` shows `llama3.1:8b`, use `llama3.1:8b` exactly as shown. + +--- + +## Verify Ollama is Working + +After completing the setup for your platform, follow these verification steps to ensure Ollama is properly configured and accessible by GT AI OS. + +### Step 1: Verify Ollama Service is Running + +**All Platforms (Ubuntu, DGX, macOS):** + +Run these commands on your host machine (not inside Docker) to confirm Ollama is running and responding: + +```bash +ollama list +``` +This shows all models you have pulled. You should see `llama3.1:8b` (or other models you installed). + +```bash +curl http://localhost:11434/api/version +``` +This tests the Ollama API. You should see a JSON response with version information like `{"version":"0.x.x"}`. + +### Step 2: Verify GPU Acceleration + +**Ubuntu x86 and DGX Only** (skip this step on macOS): + +While a model is running, check that your NVIDIA GPU is being utilized: + +```bash +nvtop +``` +or +```bash +nvidia-smi +``` + +You should see `ollama` or `ollama_llama_server` processes using GPU memory. If you only see CPU usage, revisit Step 1 (NVIDIA driver installation) in your platform's setup. + +**macOS:** Apple Silicon Macs automatically use the GPU via Metal. No verification needed. + +### Step 3: Verify GT AI OS Can Reach Ollama + +This step confirms that the Docker containers running GT AI OS can communicate with Ollama on your host machine. + +**macOS (Apple Silicon M1+):** +```bash +docker exec gentwo-resource-cluster curl http://host.docker.internal:11434/api/version +``` + +**Ubuntu x86 and DGX:** +```bash +docker exec gentwo-resource-cluster curl http://ollama-host:11434/api/version +``` + +You should see the same JSON version response. If you get a connection error, check that: +- Ollama is running (`ollama list` works) +- On Ubuntu/DGX: The systemd config has `OLLAMA_HOST=0.0.0.0:11434` +- GT AI OS containers are running (`docker ps | grep gentwo`) + +### Step 4: Test in the Application + +Once all verification steps pass, test the full integration: + +1. Open Tenant App: http://localhost:3002 +2. Create a new agent or edit an existing one +3. Select your Ollama model (e.g., `llama3.1:8b`) from the model dropdown +4. Send a test message and verify you get a response + +If the agent doesn't respond, check the model configuration in Control Panel → Models and ensure the Model ID matches exactly what `ollama list` shows.