homelab-docs/CLAUDE.md

# Homelab Infrastructure

## Quick Reference - Common Tasks

| Task | Section | Quick Command |
|------|---------|---------------|
| **Add new public service** | [Reverse Proxy](#reverse-proxy-architecture-traefik) | Create Traefik config + Cloudflare DNS |
| **Add Cloudflare DNS** | [Cloudflare API](#cloudflare-api-access) | `curl -X POST cloudflare.com/...` |
| **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` |
| **Syncthing issues** | [Troubleshooting](#troubleshooting-runbooks) | Check API connections |
| **SSL cert issues** | [Traefik DNS Challenge](#ssl-certificates) | Use `cloudflare` resolver |

**Key Credentials (see sections for full details):**
- Cloudflare: `cloudflare@htsn.io` / API Key in [Cloudflare API](#cloudflare-api-access)
- SSH Password: `GrilledCh33s3#`
- Traefik: CT 202 @ 10.10.10.250

---

## Role

You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include:

- **Infrastructure Management**: Proxmox servers, VMs, containers, networking
- **File Sync**: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android)
- **Network Administration**: Router config, SSH access, Tailscale, device management
- **Power Optimization**: CPU governors, GPU power states, service tuning
- **Documentation**: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date
- **Automation**: Shell aliases, startup scripts, scheduled tasks

You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure.

### Proactive Behaviors

When the user mentions issues or asks questions, proactively:
- **"sync not working"** → Check Syncthing status on ALL devices, identify which is offline
- **"device offline"** → Ping both local and Tailscale IPs, check if service is running
- **"slow"** → Check CPU usage, running processes, Syncthing rescan activity
- **"check status"** → Run full health check across all systems
- **"something's wrong"** → Run diagnostics on likely culprits based on context

### Quick Health Checks

Run these to get a quick overview of the homelab:

```bash
# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"

# Ping critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"

# Check Windows PC Syncthing (often goes offline)
nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN"
```

### Troubleshooting Runbooks

| Symptom | Check | Fix |
|---------|-------|-----|
| Device not syncing | `curl Syncthing API → connections` | Check if device online, restart Syncthing |
| Windows PC offline | `ping 10.10.10.150` then `nc -z 22000` | SSH in, `Start-ScheduledTask -TaskName "Syncthing"` |
| Phone not syncing | Phone Syncthing app in background? | User must open app, keep screen on |
| High CPU on TrueNAS | Syncthing rescan? KSM? | Check rescan intervals, disable KSM |
| VM won't start | Storage available? RAM free? | `ssh pve 'qm start VMID'`, check logs |
| Tailscale offline | `tailscale status` | `tailscale up` or restart service |
| Tailscale no subnet access | Check subnet routers | Verify pve or ucg-fiber advertising routes |
| Sync stuck at X% | Folder errors? Conflicts? | Check `rest/folder/errors?folder=NAME` |
| Server running hot | Check KSM, check CPU processes | Disable KSM, identify runaway process |
| Storage enclosure loud | Check fan speed via SES | See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
| Drives not detected | Check SAS link, LCC status | Switch LCC, rescan SCSI hosts |

### Server Temperature Check
```bash
# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy temps**: 70-80°C under load. **Warning**: >85°C. **Throttle**: 90°C.

### Service Dependencies

```
TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks between devices
├── NFS/SMB shares for VMs
└── Media storage for Plex

PiHole (CT 200)
└── DNS for entire network - if down, name resolution fails

Traefik (CT 202)
└── Reverse proxy - if down, external access to services fails

Router (10.10.10.1)
└── Everything - gateway for all traffic
```

### API Quick Reference

| Service | Device | Endpoint | Auth |
|---------|--------|----------|------|
| Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` |
| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` (via SSH) | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
| Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` |
| Proxmox | PVE | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
| Proxmox | PVE2 | `https://10.10.10.102:8006/api2/json/` | SSH key auth |

### Common Maintenance Tasks

When user asks for maintenance or you notice issues:

1. **Check Syncthing sync status** - Any folders behind? Errors?
2. **Verify all devices connected** - Run connection check
3. **Check disk space** - `ssh pve 'df -h'`, `ssh pve2 'df -h'`
4. **Review ZFS pool health** - `ssh pve 'zpool status'`
5. **Check for stuck processes** - High CPU? Memory pressure?
6. **Verify backups** - Are critical folders syncing?

### Emergency Commands

```bash
# Restart VM on Proxmox
ssh pve 'qm stop VMID && qm start VMID'

# Check what's using CPU
ssh pve 'ps aux --sort=-%cpu | head -10'

# Check ZFS pool status (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Check EMC enclosure fans
ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"'

# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"

# Restart Syncthing on Windows (when stuck)
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'

# Get all device IPs from router
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```

## Overview

Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads.

## Servers

### PVE (10.10.10.120) - Primary
- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
- **RAM**: 128 GB
- **Storage**:
  - `nvme-mirror1`: 2x Sabrent Rocket Q NVMe (3.6TB usable)
  - `nvme-mirror2`: 2x Kingston SFYRD 2TB (1.8TB usable)
  - `rpool`: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable)
- **GPUs**:
  - NVIDIA Quadro P2000 (75W TDP) - Plex transcoding
  - NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1
- **Role**: Primary VM host, TrueNAS, media services

### PVE2 (10.10.10.102) - Secondary
- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
- **RAM**: 128 GB
- **Storage**:
  - `nvme-mirror3`: 2x NVMe mirror
  - `local-zfs2`: 2x WD Red 6TB HDD mirror
- **GPUs**:
  - NVIDIA RTX A6000 (300W TDP) - passed to trading-vm
- **Role**: Trading platform, development

## SSH Access

### SSH Key Authentication (All Hosts)

SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. Use the `~/.ssh/homelab` key.

| Host Alias | IP | User | Type | Notes |
|------------|-----|------|------|-------|
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |

**Usage examples:**
```bash
ssh pve 'qm list'                    # List VMs
ssh truenas 'zpool status vault'     # Check ZFS pool
ssh saltbox 'docker ps'              # List containers
ssh pihole 'pihole status'           # Check Pi-hole
```

### Password Auth (Special Cases)

| Device | IP | User | Auth Method | Notes |
|--------|-----|------|-------------|-------|
| UniFi Router | 10.10.10.1 | root | expect (keyboard-interactive) | Gateway |
| Windows PC | 10.10.10.150 | claude | sshpass | PowerShell, use `;` not `&&` |
| HomeAssistant | 10.10.10.110 | - | QEMU agent only | No SSH server |

**Router access (requires expect):**
```bash
# Run command on router
expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

# Get ARP table (all device IPs)
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```

**Windows PC access:**
```bash
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
```

**HomeAssistant (no SSH, use QEMU agent):**
```bash
ssh pve 'qm guest exec 110 -- bash -c "ha core info"'
```

## VMs and Containers

### PVE (10.10.10.120)
| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-------|-----|---------|-----------------|------------|
| 100 | truenas | 8 | 32GB | NAS, storage | LSI SAS2308 HBA, Samsung NVMe | Yes |
| 101 | saltbox | 16 | 16GB | Media automation | TITAN RTX | Yes |
| 105 | fs-dev | 10 | 8GB | Development | - | Yes |
| 110 | homeassistant | 2 | 2GB | Home automation | - | No |
| 111 | lmdev1 | 8 | 32GB | AI/LLM development | TITAN RTX | Yes |
| 201 | copyparty | 2 | 2GB | File sharing | - | Yes |
| 206 | docker-host | 2 | 4GB | Docker services | - | Yes |
| 200 | pihole (CT) | - | - | DNS/Ad blocking | - | N/A |
| 202 | traefik (CT) | - | - | Reverse proxy | - | N/A |
| 205 | findshyt (CT) | - | - | Custom app | - | N/A |

### PVE2 (10.10.10.102)
| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-------|-----|---------|-----------------|------------|
| 300 | gitea-vm | 2 | 4GB | Git server | - | Yes |
| 301 | trading-vm | 16 | 32GB | AI trading platform | RTX A6000 | Yes |

### QEMU Guest Agent
VMs with QEMU agent can be managed via `qm guest exec`:
```bash
# Execute command in VM
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
```
Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead.

## Power Management

### Estimated Power Draw
- **PVE**: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs)
- **PVE2**: 450-600W (CPU + RTX A6000 + storage)
- **Combined**: ~1000-1350W under load

### Optimizations Applied
1. **KSMD Disabled** (2024-12-17 updated)
   - Was consuming 44-57% CPU on PVE with negative profit
   - Caused CPU temp to rise from 74°C to 83°C
   - Savings: ~7-10W + significant temp reduction
   - Made permanent via:
     - systemd service: `/etc/systemd/system/disable-ksm.service`
     - **ksmtuned masked**: `systemctl mask ksmtuned` (prevents re-enabling)
   - **Note**: KSM can get re-enabled by Proxmox updates. If CPU is hot, check:
     ```bash
     cat /sys/kernel/mm/ksm/run  # Should be 0
     ps aux | grep ksmd          # Should show 0% CPU
     # If KSM is running (run=1), disable it:
     echo 0 > /sys/kernel/mm/ksm/run
     systemctl mask ksmtuned
     ```

2. **Syncthing Rescan Intervals** (2024-12-16)
   - Changed aggressive 60s rescans to 3600s for large folders
   - Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
   - Savings: ~60-80W (TrueNAS VM was at constant 86% CPU)

3. **CPU Governor Optimization** (2024-12-16)
   - PVE: `powersave` governor + `balance_power` EPP (amd-pstate-epp driver)
   - PVE2: `schedutil` governor (acpi-cpufreq driver)
   - Made permanent via systemd service: `/etc/systemd/system/cpu-powersave.service`
   - Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz)

4. **GPU Power States** (2024-12-16) - Verified optimal
   - RTX A6000: 11W idle (P8 state)
   - TITAN RTX: 2-3W idle (P8 state)
   - Quadro P2000: 25W (P0 - Plex keeps it active)

5. **ksmtuned Disabled** (2024-12-16)
   - KSM tuning daemon was still running after KSMD disabled
   - Stopped and disabled on both servers
   - Savings: ~2-5W

6. **HDD Spindown on PVE2** (2024-12-16)
   - local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7
   - Set 30-minute spindown via `hdparm -S 241`
   - Persistent via udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
   - Savings: ~10-16W when spun down

### Potential Optimizations
- [ ] PCIe ASPM power management
- [ ] NMI watchdog disable

## Memory Configuration
- Ballooning enabled on most VMs but not actively used
- No memory overcommit (98GB allocated on 128GB physical for PVE)
- KSMD was wasting CPU with no benefit (negative general_profit)

## Network

See [NETWORK.md](NETWORK.md) for full details.

### Network Ranges
| Network | Range | Purpose |
|---------|-------|---------|
| LAN | 10.10.10.0/24 | Primary network, all external access |
| Internal | 10.10.20.0/24 | Inter-VM only (storage, NFS/iSCSI) |

### PVE Bridges (10.10.10.120)
| Bridge | NIC | Speed | Purpose | Use For |
|--------|-----|-------|---------|---------|
| vmbr0 | enp1s0 | 1 Gb | Management | General VMs/CTs |
| vmbr1 | enp35s0f0 | 10 Gb | High-speed LXC | Bandwidth-heavy containers |
| vmbr2 | enp35s0f1 | 10 Gb | High-speed VM | TrueNAS, Saltbox, storage VMs |
| vmbr3 | (none) | Virtual | Internal only | NFS/iSCSI traffic, no internet |

### Quick Reference
```bash
# Add VM to standard network (1Gb)
qm set VMID --net0 virtio,bridge=vmbr0

# Add VM to high-speed network (10Gb)
qm set VMID --net0 virtio,bridge=vmbr2

# Add secondary NIC for internal storage network
qm set VMID --net1 virtio,bridge=vmbr3
```

### MTU 9000 (Jumbo Frames)

Jumbo frames are enabled across the network for improved throughput on large transfers.

| Device | Interface | MTU | Persistent |
|--------|-----------|-----|------------|
| Mac Mini | en0 | 9000 | Yes (networksetup) |
| PVE | vmbr0, enp1s0 | 9000 | Yes (/etc/network/interfaces) |
| PVE2 | vmbr0, nic1 | 9000 | Yes (/etc/network/interfaces) |
| TrueNAS | enp6s18, enp6s19 | 9000 | Yes |
| UCG-Fiber | br0 | 9216 | Yes (default) |

**Verify MTU:**
```bash
# Mac Mini
ifconfig en0 | grep mtu

# PVE/PVE2
ssh pve 'ip link show vmbr0 | grep mtu'
ssh pve2 'ip link show vmbr0 | grep mtu'

# Test jumbo frames
ping -c 1 -D -s 8000 10.10.10.120  # 8000 + 8 byte header = 8008 bytes
```

**Important:** When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped.

### Tailscale VPN

Tailscale provides secure remote access to the homelab from anywhere.

**Subnet Routers (HA Failover)**

Two devices advertise the `10.10.10.0/24` subnet for redundancy:

| Device | Tailscale IP | Role | Notes |
|--------|--------------|------|-------|
| pve | 100.113.177.80 | Primary | Proxmox host |
| ucg-fiber | 100.94.246.32 | Failover | UniFi router (always on) |

If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec).

**Router Tailscale Setup (UCG-Fiber)**
- Installed via: `curl -fsSL https://tailscale.com/install.sh | sh`
- Config: `tailscale up --advertise-routes=10.10.10.0/24 --accept-routes`
- Survives reboots (systemd service)
- Routes must be approved in [Tailscale Admin Console](https://login.tailscale.com/admin/machines)

**Tailscale IPs Quick Reference**

| Device | Tailscale IP | Local IP |
|--------|--------------|----------|
| Mac Mini | 100.108.89.58 | 10.10.10.125 |
| PVE | 100.113.177.80 | 10.10.10.120 |
| UCG-Fiber | 100.94.246.32 | 10.10.10.1 |
| TrueNAS | 100.100.94.71 | 10.10.10.200 |
| Pi-hole | 100.112.59.128 | 10.10.10.10 |

**Check Tailscale Status**
```bash
# From Mac Mini
/Applications/Tailscale.app/Contents/MacOS/Tailscale status

# From router
expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```

## Common Commands
```bash
# Check VM status
ssh pve 'qm list'
ssh pve2 'qm list'

# Check container status
ssh pve 'pct list'

# Monitor CPU/power
ssh pve 'top -bn1 | head -20'

# Check ZFS pools
ssh pve 'zpool status'

# Check GPU (if nvidia-smi installed in VM)
ssh pve 'lspci | grep -i nvidia'
```

## Remote Claude Code Sessions (Mac Mini)

### Overview
The Mac Mini (`hutson-mac-mini.local`) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required.

### Architecture
```
Mac Mini (100.108.89.58 via Tailscale)
├── launchd (auto-starts on boot)
│   └── com.hutson.happy-daemon.plist (starts Happy daemon)
├── Happy Coder daemon (manages remote sessions)
└── Tailscale (secure remote access)
```

### How It Works
1. Happy daemon runs on Mac Mini (auto-starts on boot)
2. Open Happy Coder app on phone/tablet
3. Start a new Claude session from the app
4. Session runs in any working directory you choose
5. Session ends when you're done - no cleanup needed

### Quick Commands
```bash
# Check daemon status
happy daemon list

# Start a new session manually (from Mac Mini terminal)
cd ~/Projects/homelab && happy claude

# Check active sessions
happy daemon list
```

### Mobile Access Setup (One-time)
1. Download Happy Coder app:
   - iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505
   - Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy
2. On Mac Mini, run: `happy auth` and scan QR code with the app
3. Daemon auto-starts on boot via launchd

### Daemon Management
```bash
happy daemon start    # Start daemon
happy daemon stop     # Stop daemon
happy daemon status   # Check status
happy daemon list     # List active sessions
```

### Remote Access via SSH + Tailscale
From any device on Tailscale network:
```bash
# SSH to Mac Mini
ssh hutson@100.108.89.58

# Or via hostname
ssh hutson@mac-mini

# Start Claude in desired directory
cd ~/Projects/homelab && happy claude
```

### Files & Configuration
| File | Purpose |
|------|---------|
| `~/Library/LaunchAgents/com.hutson.happy-daemon.plist` | launchd auto-start Happy daemon |
| `~/.happy/` | Happy Coder config and logs |

### Troubleshooting
```bash
# Check if daemon is running
pgrep -f "happy.*daemon"

# Check launchd status
launchctl list | grep happy

# List active sessions
happy daemon list

# Restart daemon
happy daemon stop && happy daemon start

# If Tailscale is disconnected
/Applications/Tailscale.app/Contents/MacOS/Tailscale up
```

## Agent and Tool Guidelines

### Background Agents
- **Always spin up background agents when doing multiple independent tasks**
- Background agents allow parallel execution of tasks that don't depend on each other
- This improves efficiency and reduces total execution time
- Use background agents for tasks like running tests, builds, or searches simultaneously

### MCP Tools for Web Searches

#### ref.tools - Documentation Lookups
- **`mcp__Ref__ref_search_documentation`**: Search through documentation for specific topics
- **`mcp__Ref__ref_read_url`**: Read and parse content from documentation URLs

#### Exa MCP - General Web and Code Searches
- **`mcp__exa__web_search_exa`**: General web searches for current information
- **`mcp__exa__get_code_context_exa`**: Code-related searches and repository lookups

### MCP Tools Reference Table

| Tool Name | Provider | Purpose | Use Case |
|-----------|----------|---------|----------|
| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | Finding specific topics in official docs |
| `mcp__Ref__ref_read_url` | ref.tools | Read documentation URLs | Parsing and extracting content from doc pages |
| `mcp__exa__web_search_exa` | Exa MCP | General web search | Current events, general information lookup |
| `mcp__exa__get_code_context_exa` | Exa MCP | Code-specific search | Finding code examples, repository searches |

## Reverse Proxy Architecture (Traefik)

### Overview
There are **TWO separate Traefik instances** handling different services:

| Instance | Location | IP | Purpose | Manages |
|----------|----------|-----|---------|---------|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | All non-Saltbox services |
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services | Plex, *arr apps, media stack |

### ⚠️ CRITICAL RULE: Which Traefik to Use

**When adding ANY new service:**
- ✅ **Use Traefik-Primary (10.10.10.250)** - Unless service lives inside Saltbox VM
- ❌ **DO NOT touch Traefik-Saltbox** - It manages Saltbox services with their own certificates

**Why this matters:**
- Traefik-Saltbox has complex Saltbox-managed configs
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
- Each Traefik has its own Let's Encrypt certificates
- Mixing them causes certificate conflicts

### Traefik-Primary (CT 202) - For New Services

**Location**: `/etc/traefik/` on Container 202
**Config**: `/etc/traefik/traefik.yaml`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`

**Services using Traefik-Primary (10.10.10.250):**
- excalidraw.htsn.io → 10.10.10.206:8080 (docker-host)
- findshyt.htsn.io → 10.10.10.205 (CT 205)
- gitea (git.htsn.io) → 10.10.10.220:3000
- homeassistant → 10.10.10.110
- lmdev → 10.10.10.111
- pihole → 10.10.10.200
- truenas → 10.10.10.200
- proxmox → 10.10.10.120
- copyparty → 10.10.10.201
- aitrade → trading server
- pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring)

**Access Traefik config:**
```bash
# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'

# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
```

### Traefik-Saltbox (VM 101) - DO NOT MODIFY

**Location**: `/opt/traefik/` inside Saltbox VM
**Managed by**: Saltbox Ansible playbooks
**Mounts**: Docker bind mount from `/opt/traefik` → `/etc/traefik` in container

**Services using Traefik-Saltbox (10.10.10.100):**
- Plex (plex.htsn.io)
- Sonarr, Radarr, Lidarr
- SABnzbd, NZBGet, qBittorrent
- Overseerr, Tautulli, Organizr
- Jackett, NZBHydra2
- Authelia (SSO)
- All other Saltbox-managed containers

**View Saltbox Traefik (read-only):**
```bash
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
```

### Adding a New Public Service - Complete Workflow

Follow these steps to deploy a new service and make it publicly accessible at `servicename.htsn.io`.

#### Step 0. Deploy Your Service

First, deploy your service on the appropriate host:

**Option A: Docker on docker-host (10.10.10.206)**
```bash
ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
  myservice:
    image: myimage:latest
    ports:
      - "8080:80"
    restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d
```

**Option B: New LXC Container on PVE**
```bash
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname myservice --memory 2048 --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
  --rootfs local-zfs:8 --unprivileged 1 --start 1'
```

**Option C: New VM on PVE**
```bash
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
```

#### Step 1. Create Traefik Config File

Use this template for new services on **Traefik-Primary (CT 202)**:

```yaml
# /etc/traefik/conf.d/myservice.yaml
http:
  routers:
    # HTTPS router
    myservice-secure:
      entryPoints:
        - websecure
      rule: "Host(`myservice.htsn.io`)"
      service: myservice
      tls:
        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
      priority: 50

    # HTTP → HTTPS redirect
    myservice-redirect:
      entryPoints:
        - web
      rule: "Host(`myservice.htsn.io`)"
      middlewares:
        - myservice-https-redirect
      service: myservice
      priority: 50

  services:
    myservice:
      loadBalancer:
        servers:
          - url: "http://10.10.10.XXX:PORT"

  middlewares:
    myservice-https-redirect:
      redirectScheme:
        scheme: https
        permanent: true
```

### SSL Certificates

Traefik has **two certificate resolvers** configured:

| Resolver | Use When | Challenge Type | Notes |
|----------|----------|----------------|-------|
| `letsencrypt` | Cloudflare DNS-only (gray cloud) | HTTP-01 | Requires port 80 reachable |
| `cloudflare` | Cloudflare Proxied (orange cloud) | DNS-01 | Works with Cloudflare proxy |

**⚠️ Important:** If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use `cloudflare` resolver instead.

**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
```bash
Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
```

**Certificate storage:**
- HTTP challenge certs: `/etc/traefik/acme.json`
- DNS challenge certs: `/etc/traefik/acme-cf.json`

**Deploy the config:**
```bash
# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'

# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```

#### 2. Add Cloudflare DNS Entry

**Cloudflare Credentials:**
- Email: `cloudflare@htsn.io`
- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`

**Manual method (via Cloudflare Dashboard):**
1. Go to https://dash.cloudflare.com/
2. Select `htsn.io` domain
3. DNS → Add Record
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️

**Automated method (CLI script):**

Save this as `~/bin/add-cloudflare-dns.sh`:
```bash
#!/bin/bash
# Add DNS record to Cloudflare for htsn.io

SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"  # htsn.io zone
PUBLIC_IP="70.237.94.174"  # Update if IP changes: curl -s ifconfig.me

if [ -z "$SUBDOMAIN" ]; then
  echo "Usage: $0 <subdomain>"
  echo "Example: $0 myservice  # Creates myservice.htsn.io"
  exit 1
fi

curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data "{
    \"type\":\"A\",
    \"name\":\"$SUBDOMAIN\",
    \"content\":\"$PUBLIC_IP\",
    \"ttl\":1,
    \"proxied\":true
  }" | jq .
```

**Usage:**
```bash
chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh excalidraw  # Creates excalidraw.htsn.io
```

#### 3. Testing

```bash
# Check if DNS resolves
dig myservice.htsn.io

# Test HTTP redirect
curl -I http://myservice.htsn.io

# Test HTTPS
curl -I https://myservice.htsn.io

# Check Traefik dashboard (if enabled)
# Access: http://10.10.10.250:8080/dashboard/
```

#### Step 4. Update Documentation

After deploying, update these files:

1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
2. **CLAUDE.md** - Add to "Services using Traefik-Primary" list (line ~495)

### Quick Reference - One-Liner Commands

```bash
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===

# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
  routers:
    myservice-secure:
      entryPoints: [websecure]
      rule: Host(\\\`myservice.htsn.io\\\`)
      service: myservice
      tls: {certResolver: letsencrypt}
  services:
    myservice:
      loadBalancer:
        servers:
          - url: http://10.10.10.206:8080
EOF"'

# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
  -H "X-Auth-Email: cloudflare@htsn.io" \
  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'

# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io
```

### Traefik Troubleshooting

```bash
# View Traefik logs (CT 202)
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'

# Check if config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'

# List all dynamic configs
ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/'

# Check certificate
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq'

# Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'
```

### Certificate Management

**Let's Encrypt certificates** are automatically managed by Traefik.

**Certificate storage:**
- Traefik-Primary: `/etc/traefik/acme.json` on CT 202
- Traefik-Saltbox: `/opt/traefik/acme.json` on VM 101

**Certificate renewal:**
- Automatic via HTTP-01 challenge
- Traefik checks every 24h
- Renews 30 days before expiry

**If certificates fail:**
```bash
# Check acme.json permissions (must be 600)
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json'

# Check Traefik can reach Let's Encrypt
ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory'

# Delete bad certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'
```

### Docker Service with Traefik Labels (Alternative)

If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files:

```yaml
# docker-compose.yml
services:
  myservice:
    image: myimage:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
      - "traefik.http.routers.myservice.entrypoints=websecure"
      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
    networks:
      - traefik

networks:
  traefik:
    external: true
```

**Note**: This requires Traefik to have access to Docker socket and be on same network.

## Cloudflare API Access

**Credentials** (stored in Saltbox config):
- Email: `cloudflare@htsn.io`
- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
- Domain: `htsn.io`

**Retrieve from Saltbox:**
```bash
ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"'
```

**Cloudflare API Documentation:**
- API Docs: https://developers.cloudflare.com/api/
- DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record

**Common API operations:**

```bash
# Set credentials
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"

# List all DNS records
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" | jq

# Add A record
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}'

# Delete record
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY"
```

## Git Repository

This documentation is stored at:
- **Gitea**: https://git.htsn.io/hutson/homelab-docs
- **Local**: `~/Projects/homelab`
- **Notes**: `~/Notes/05_Homelab` (symlink)

```bash
# Clone
git clone git@git.htsn.io:hutson/homelab-docs.git

# Push changes
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push
```

## Related Documentation

| File | Description |
|------|-------------|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | EMC storage enclosure (SES commands, LCC troubleshooting, maintenance) |
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API access, automations, integrations |
| [NETWORK.md](NETWORK.md) | Network bridges, VLANs, which bridge to use for new VMs |
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices and services |
| [SYNCTHING.md](SYNCTHING.md) | Syncthing setup, API access, device list, troubleshooting |
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.) |
| [configs/](configs/) | Symlinks to shared shell configs |

---

## Backlog

Future improvements and maintenance tasks:

| Priority | Task | Notes |
|----------|------|-------|
| Medium | **Re-IP all devices** | Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254 |
| Low | Install SSH on HomeAssistant | Currently only accessible via QEMU agent |
| Low | Set up SSH key for router | Currently requires expect/password |

---

## Changelog

### 2024-12-20

**Git Repository Setup**
- Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs)
- Set up SSH key authentication for git@git.htsn.io
- Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab
- Added Gitea API token for future automation

**SSH Key Deployment - All Systems**
- Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key)
- Updated `~/.ssh/config` with complete host aliases
- Fixed permissions: FindShyt LXC `.ssh` ownership, enabled PermitRootLogin on LXCs
- Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt

**Documentation Updates**
- Rewrote SSH Access section with complete host table
- Added Password Auth section for router/Windows/HomeAssistant
- Added Backlog section with re-IP task
- Added Git Repository section with clone/push instructions

### 2024-12-19

**EMC Storage Enclosure - LCC B Failure**
- Diagnosed loud fan issue (speed code 5 → 4160 RPM)
- Root cause: Faulty LCC B controller causing false readings
- Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM)
- Replacement ordered: EMC 303-108-000E ($14.95 eBay)
- Created [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) with full documentation

**SSH Key Consolidation**
- Renamed `~/.ssh/ai_trading_ed25519` → `~/.ssh/homelab`
- Updated `~/.ssh/config` on MacBook with all homelab hosts
- SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm
- No more sshpass needed for PVE servers

**QEMU Guest Agent Deployment**
- Installed on: docker-host (206), fs-dev (105), copyparty (201)
- All PVE VMs now have agent except homeassistant (110)
- Can now use `qm guest exec` for remote commands

**VM Configuration Updates**
- docker-host: Fixed SSH key in cloud-init
- fs-dev: Fixed `.ssh` directory ownership (1000 → 1001)
- copyparty: Changed from DHCP to static IP (10.10.10.201)

**Documentation Updates**
- Updated CLAUDE.md SSH section (removed sshpass examples)
- Added QEMU Agent column to VM tables
- Added storage enclosure troubleshooting to runbooks