# Homelab Infrastructure ## Quick Reference - Common Tasks | Task | Section | Quick Command | |------|---------|---------------| | **Add new public service** | [Reverse Proxy](#reverse-proxy-architecture-traefik) | Create Traefik config + Cloudflare DNS | | **Add Cloudflare DNS** | [Cloudflare API](#cloudflare-api-access) | `curl -X POST cloudflare.com/...` | | **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` | | **Syncthing issues** | [Troubleshooting](#troubleshooting-runbooks) | Check API connections | | **SSL cert issues** | [Traefik DNS Challenge](#ssl-certificates) | Use `cloudflare` resolver | **Key Credentials (see sections for full details):** - Cloudflare: `cloudflare@htsn.io` / API Key in [Cloudflare API](#cloudflare-api-access) - SSH Password: `GrilledCh33s3#` - Traefik: CT 202 @ 10.10.10.250 --- ## Role You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include: - **Infrastructure Management**: Proxmox servers, VMs, containers, networking - **File Sync**: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android) - **Network Administration**: Router config, SSH access, Tailscale, device management - **Power Optimization**: CPU governors, GPU power states, service tuning - **Documentation**: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date - **Automation**: Shell aliases, startup scripts, scheduled tasks You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure. ### Proactive Behaviors When the user mentions issues or asks questions, proactively: - **"sync not working"** → Check Syncthing status on ALL devices, identify which is offline - **"device offline"** → Ping both local and Tailscale IPs, check if service is running - **"slow"** → Check CPU usage, running processes, Syncthing rescan activity - **"check status"** → Run full health check across all systems - **"something's wrong"** → Run diagnostics on likely culprits based on context ### Quick Health Checks Run these to get a quick overview of the homelab: ```bash # === FULL HEALTH CHECK === # Syncthing connections (Mac Mini) curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" # Proxmox VMs ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable" ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable" # Ping critical devices ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN" ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN" # Check Windows PC Syncthing (often goes offline) nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN" ``` ### Troubleshooting Runbooks | Symptom | Check | Fix | |---------|-------|-----| | Device not syncing | `curl Syncthing API → connections` | Check if device online, restart Syncthing | | Windows PC offline | `ping 10.10.10.150` then `nc -z 22000` | SSH in, `Start-ScheduledTask -TaskName "Syncthing"` | | Phone not syncing | Phone Syncthing app in background? | User must open app, keep screen on | | High CPU on TrueNAS | Syncthing rescan? KSM? | Check rescan intervals, disable KSM | | VM won't start | Storage available? RAM free? | `ssh pve 'qm start VMID'`, check logs | | Tailscale offline | `tailscale status` | `tailscale up` or restart service | | Tailscale no subnet access | Check subnet routers | Verify pve or ucg-fiber advertising routes | | Sync stuck at X% | Folder errors? Conflicts? | Check `rest/folder/errors?folder=NAME` | | Server running hot | Check KSM, check CPU processes | Disable KSM, identify runaway process | | Storage enclosure loud | Check fan speed via SES | See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | | Drives not detected | Check SAS link, LCC status | Switch LCC, rescan SCSI hosts | ### Server Temperature Check ```bash # Check temps on both servers (Threadripper PRO max safe: 90°C Tctl) ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' ``` **Healthy temps**: 70-80°C under load. **Warning**: >85°C. **Throttle**: 90°C. ### Service Dependencies ``` TrueNAS (10.10.10.200) ├── Central Syncthing hub - if down, sync breaks between devices ├── NFS/SMB shares for VMs └── Media storage for Plex PiHole (CT 200) └── DNS for entire network - if down, name resolution fails Traefik (CT 202) └── Reverse proxy - if down, external access to services fails Router (10.10.10.1) └── Everything - gateway for all traffic ``` ### API Quick Reference | Service | Device | Endpoint | Auth | |---------|--------|----------|------| | Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` | | Syncthing | MacBook | `http://127.0.0.1:8384/rest/` (via SSH) | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` | | Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` | | Proxmox | PVE | `https://10.10.10.120:8006/api2/json/` | SSH key auth | | Proxmox | PVE2 | `https://10.10.10.102:8006/api2/json/` | SSH key auth | ### Common Maintenance Tasks When user asks for maintenance or you notice issues: 1. **Check Syncthing sync status** - Any folders behind? Errors? 2. **Verify all devices connected** - Run connection check 3. **Check disk space** - `ssh pve 'df -h'`, `ssh pve2 'df -h'` 4. **Review ZFS pool health** - `ssh pve 'zpool status'` 5. **Check for stuck processes** - High CPU? Memory pressure? 6. **Verify backups** - Are critical folders syncing? ### Emergency Commands ```bash # Restart VM on Proxmox ssh pve 'qm stop VMID && qm start VMID' # Check what's using CPU ssh pve 'ps aux --sort=-%cpu | head -10' # Check ZFS pool status (via QEMU agent) ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"' # Check EMC enclosure fans ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"' # Force Syncthing rescan curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY" # Restart Syncthing on Windows (when stuck) sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"' # Get all device IPs from router expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' ``` ## Overview Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads. ## Servers ### PVE (10.10.10.120) - Primary - **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP) - **RAM**: 128 GB - **Storage**: - `nvme-mirror1`: 2x Sabrent Rocket Q NVMe (3.6TB usable) - `nvme-mirror2`: 2x Kingston SFYRD 2TB (1.8TB usable) - `rpool`: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable) - **GPUs**: - NVIDIA Quadro P2000 (75W TDP) - Plex transcoding - NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1 - **Role**: Primary VM host, TrueNAS, media services ### PVE2 (10.10.10.102) - Secondary - **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP) - **RAM**: 128 GB - **Storage**: - `nvme-mirror3`: 2x NVMe mirror - `local-zfs2`: 2x WD Red 6TB HDD mirror - **GPUs**: - NVIDIA RTX A6000 (300W TDP) - passed to trading-vm - **Role**: Trading platform, development ## SSH Access ### SSH Key Authentication (All Hosts) SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. Use the `~/.ssh/homelab` key. | Host Alias | IP | User | Type | Notes | |------------|-----|------|------|-------| | `pve` | 10.10.10.120 | root | Proxmox | Primary server | | `pve2` | 10.10.10.102 | root | Proxmox | Secondary server | | `truenas` | 10.10.10.200 | root | VM | NAS/storage | | `saltbox` | 10.10.10.100 | hutson | VM | Media automation | | `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development | | `docker-host` | 10.10.10.206 | hutson | VM | Docker services | | `fs-dev` | 10.10.10.5 | hutson | VM | Development | | `copyparty` | 10.10.10.201 | hutson | VM | File sharing | | `gitea-vm` | 10.10.10.220 | hutson | VM | Git server | | `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform | | `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking | | `traefik` | 10.10.10.250 | root | LXC | Reverse proxy | | `findshyt` | 10.10.10.8 | root | LXC | Custom app | **Usage examples:** ```bash ssh pve 'qm list' # List VMs ssh truenas 'zpool status vault' # Check ZFS pool ssh saltbox 'docker ps' # List containers ssh pihole 'pihole status' # Check Pi-hole ``` ### Password Auth (Special Cases) | Device | IP | User | Auth Method | Notes | |--------|-----|------|-------------|-------| | UniFi Router | 10.10.10.1 | root | expect (keyboard-interactive) | Gateway | | Windows PC | 10.10.10.150 | claude | sshpass | PowerShell, use `;` not `&&` | | HomeAssistant | 10.10.10.110 | - | QEMU agent only | No SSH server | **Router access (requires expect):** ```bash # Run command on router expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' # Get ARP table (all device IPs) expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' ``` **Windows PC access:** ```bash sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5' ``` **HomeAssistant (no SSH, use QEMU agent):** ```bash ssh pve 'qm guest exec 110 -- bash -c "ha core info"' ``` ## VMs and Containers ### PVE (10.10.10.120) | VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent | |------|------|-------|-----|---------|-----------------|------------| | 100 | truenas | 8 | 32GB | NAS, storage | LSI SAS2308 HBA, Samsung NVMe | Yes | | 101 | saltbox | 16 | 16GB | Media automation | TITAN RTX | Yes | | 105 | fs-dev | 10 | 8GB | Development | - | Yes | | 110 | homeassistant | 2 | 2GB | Home automation | - | No | | 111 | lmdev1 | 8 | 32GB | AI/LLM development | TITAN RTX | Yes | | 201 | copyparty | 2 | 2GB | File sharing | - | Yes | | 206 | docker-host | 2 | 4GB | Docker services | - | Yes | | 200 | pihole (CT) | - | - | DNS/Ad blocking | - | N/A | | 202 | traefik (CT) | - | - | Reverse proxy | - | N/A | | 205 | findshyt (CT) | - | - | Custom app | - | N/A | ### PVE2 (10.10.10.102) | VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent | |------|------|-------|-----|---------|-----------------|------------| | 300 | gitea-vm | 2 | 4GB | Git server | - | Yes | | 301 | trading-vm | 16 | 32GB | AI trading platform | RTX A6000 | Yes | ### QEMU Guest Agent VMs with QEMU agent can be managed via `qm guest exec`: ```bash # Execute command in VM ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"' # Get VM IP addresses ssh pve 'qm guest exec 100 -- bash -c "ip addr"' ``` Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead. ## Power Management ### Estimated Power Draw - **PVE**: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs) - **PVE2**: 450-600W (CPU + RTX A6000 + storage) - **Combined**: ~1000-1350W under load ### Optimizations Applied 1. **KSMD Disabled** (2024-12-17 updated) - Was consuming 44-57% CPU on PVE with negative profit - Caused CPU temp to rise from 74°C to 83°C - Savings: ~7-10W + significant temp reduction - Made permanent via: - systemd service: `/etc/systemd/system/disable-ksm.service` - **ksmtuned masked**: `systemctl mask ksmtuned` (prevents re-enabling) - **Note**: KSM can get re-enabled by Proxmox updates. If CPU is hot, check: ```bash cat /sys/kernel/mm/ksm/run # Should be 0 ps aux | grep ksmd # Should show 0% CPU # If KSM is running (run=1), disable it: echo 0 > /sys/kernel/mm/ksm/run systemctl mask ksmtuned ``` 2. **Syncthing Rescan Intervals** (2024-12-16) - Changed aggressive 60s rescans to 3600s for large folders - Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config - Savings: ~60-80W (TrueNAS VM was at constant 86% CPU) 3. **CPU Governor Optimization** (2024-12-16) - PVE: `powersave` governor + `balance_power` EPP (amd-pstate-epp driver) - PVE2: `schedutil` governor (acpi-cpufreq driver) - Made permanent via systemd service: `/etc/systemd/system/cpu-powersave.service` - Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz) 4. **GPU Power States** (2024-12-16) - Verified optimal - RTX A6000: 11W idle (P8 state) - TITAN RTX: 2-3W idle (P8 state) - Quadro P2000: 25W (P0 - Plex keeps it active) 5. **ksmtuned Disabled** (2024-12-16) - KSM tuning daemon was still running after KSMD disabled - Stopped and disabled on both servers - Savings: ~2-5W 6. **HDD Spindown on PVE2** (2024-12-16) - local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7 - Set 30-minute spindown via `hdparm -S 241` - Persistent via udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules` - Savings: ~10-16W when spun down ### Potential Optimizations - [ ] PCIe ASPM power management - [ ] NMI watchdog disable ## Memory Configuration - Ballooning enabled on most VMs but not actively used - No memory overcommit (98GB allocated on 128GB physical for PVE) - KSMD was wasting CPU with no benefit (negative general_profit) ## Network See [NETWORK.md](NETWORK.md) for full details. ### Network Ranges | Network | Range | Purpose | |---------|-------|---------| | LAN | 10.10.10.0/24 | Primary network, all external access | | Internal | 10.10.20.0/24 | Inter-VM only (storage, NFS/iSCSI) | ### PVE Bridges (10.10.10.120) | Bridge | NIC | Speed | Purpose | Use For | |--------|-----|-------|---------|---------| | vmbr0 | enp1s0 | 1 Gb | Management | General VMs/CTs | | vmbr1 | enp35s0f0 | 10 Gb | High-speed LXC | Bandwidth-heavy containers | | vmbr2 | enp35s0f1 | 10 Gb | High-speed VM | TrueNAS, Saltbox, storage VMs | | vmbr3 | (none) | Virtual | Internal only | NFS/iSCSI traffic, no internet | ### Quick Reference ```bash # Add VM to standard network (1Gb) qm set VMID --net0 virtio,bridge=vmbr0 # Add VM to high-speed network (10Gb) qm set VMID --net0 virtio,bridge=vmbr2 # Add secondary NIC for internal storage network qm set VMID --net1 virtio,bridge=vmbr3 ``` ### MTU 9000 (Jumbo Frames) Jumbo frames are enabled across the network for improved throughput on large transfers. | Device | Interface | MTU | Persistent | |--------|-----------|-----|------------| | Mac Mini | en0 | 9000 | Yes (networksetup) | | PVE | vmbr0, enp1s0 | 9000 | Yes (/etc/network/interfaces) | | PVE2 | vmbr0, nic1 | 9000 | Yes (/etc/network/interfaces) | | TrueNAS | enp6s18, enp6s19 | 9000 | Yes | | UCG-Fiber | br0 | 9216 | Yes (default) | **Verify MTU:** ```bash # Mac Mini ifconfig en0 | grep mtu # PVE/PVE2 ssh pve 'ip link show vmbr0 | grep mtu' ssh pve2 'ip link show vmbr0 | grep mtu' # Test jumbo frames ping -c 1 -D -s 8000 10.10.10.120 # 8000 + 8 byte header = 8008 bytes ``` **Important:** When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped. ### Tailscale VPN Tailscale provides secure remote access to the homelab from anywhere. **Subnet Routers (HA Failover)** Two devices advertise the `10.10.10.0/24` subnet for redundancy: | Device | Tailscale IP | Role | Notes | |--------|--------------|------|-------| | pve | 100.113.177.80 | Primary | Proxmox host | | ucg-fiber | 100.94.246.32 | Failover | UniFi router (always on) | If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec). **Router Tailscale Setup (UCG-Fiber)** - Installed via: `curl -fsSL https://tailscale.com/install.sh | sh` - Config: `tailscale up --advertise-routes=10.10.10.0/24 --accept-routes` - Survives reboots (systemd service) - Routes must be approved in [Tailscale Admin Console](https://login.tailscale.com/admin/machines) **Tailscale IPs Quick Reference** | Device | Tailscale IP | Local IP | |--------|--------------|----------| | Mac Mini | 100.108.89.58 | 10.10.10.125 | | PVE | 100.113.177.80 | 10.10.10.120 | | UCG-Fiber | 100.94.246.32 | 10.10.10.1 | | TrueNAS | 100.100.94.71 | 10.10.10.200 | | Pi-hole | 100.112.59.128 | 10.10.10.10 | **Check Tailscale Status** ```bash # From Mac Mini /Applications/Tailscale.app/Contents/MacOS/Tailscale status # From router expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' ``` ## Common Commands ```bash # Check VM status ssh pve 'qm list' ssh pve2 'qm list' # Check container status ssh pve 'pct list' # Monitor CPU/power ssh pve 'top -bn1 | head -20' # Check ZFS pools ssh pve 'zpool status' # Check GPU (if nvidia-smi installed in VM) ssh pve 'lspci | grep -i nvidia' ``` ## Remote Claude Code Sessions (Mac Mini) ### Overview The Mac Mini (`hutson-mac-mini.local`) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required. ### Architecture ``` Mac Mini (100.108.89.58 via Tailscale) ├── launchd (auto-starts on boot) │ └── com.hutson.happy-daemon.plist (starts Happy daemon) ├── Happy Coder daemon (manages remote sessions) └── Tailscale (secure remote access) ``` ### How It Works 1. Happy daemon runs on Mac Mini (auto-starts on boot) 2. Open Happy Coder app on phone/tablet 3. Start a new Claude session from the app 4. Session runs in any working directory you choose 5. Session ends when you're done - no cleanup needed ### Quick Commands ```bash # Check daemon status happy daemon list # Start a new session manually (from Mac Mini terminal) cd ~/Projects/homelab && happy claude # Check active sessions happy daemon list ``` ### Mobile Access Setup (One-time) 1. Download Happy Coder app: - iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505 - Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy 2. On Mac Mini, run: `happy auth` and scan QR code with the app 3. Daemon auto-starts on boot via launchd ### Daemon Management ```bash happy daemon start # Start daemon happy daemon stop # Stop daemon happy daemon status # Check status happy daemon list # List active sessions ``` ### Remote Access via SSH + Tailscale From any device on Tailscale network: ```bash # SSH to Mac Mini ssh hutson@100.108.89.58 # Or via hostname ssh hutson@mac-mini # Start Claude in desired directory cd ~/Projects/homelab && happy claude ``` ### Files & Configuration | File | Purpose | |------|---------| | `~/Library/LaunchAgents/com.hutson.happy-daemon.plist` | launchd auto-start Happy daemon | | `~/.happy/` | Happy Coder config and logs | ### Troubleshooting ```bash # Check if daemon is running pgrep -f "happy.*daemon" # Check launchd status launchctl list | grep happy # List active sessions happy daemon list # Restart daemon happy daemon stop && happy daemon start # If Tailscale is disconnected /Applications/Tailscale.app/Contents/MacOS/Tailscale up ``` ## Agent and Tool Guidelines ### Background Agents - **Always spin up background agents when doing multiple independent tasks** - Background agents allow parallel execution of tasks that don't depend on each other - This improves efficiency and reduces total execution time - Use background agents for tasks like running tests, builds, or searches simultaneously ### MCP Tools for Web Searches #### ref.tools - Documentation Lookups - **`mcp__Ref__ref_search_documentation`**: Search through documentation for specific topics - **`mcp__Ref__ref_read_url`**: Read and parse content from documentation URLs #### Exa MCP - General Web and Code Searches - **`mcp__exa__web_search_exa`**: General web searches for current information - **`mcp__exa__get_code_context_exa`**: Code-related searches and repository lookups ### MCP Tools Reference Table | Tool Name | Provider | Purpose | Use Case | |-----------|----------|---------|----------| | `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | Finding specific topics in official docs | | `mcp__Ref__ref_read_url` | ref.tools | Read documentation URLs | Parsing and extracting content from doc pages | | `mcp__exa__web_search_exa` | Exa MCP | General web search | Current events, general information lookup | | `mcp__exa__get_code_context_exa` | Exa MCP | Code-specific search | Finding code examples, repository searches | ## Reverse Proxy Architecture (Traefik) ### Overview There are **TWO separate Traefik instances** handling different services: | Instance | Location | IP | Purpose | Manages | |----------|----------|-----|---------|---------| | **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | All non-Saltbox services | | **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services | Plex, *arr apps, media stack | ### ⚠️ CRITICAL RULE: Which Traefik to Use **When adding ANY new service:** - ✅ **Use Traefik-Primary (10.10.10.250)** - Unless service lives inside Saltbox VM - ❌ **DO NOT touch Traefik-Saltbox** - It manages Saltbox services with their own certificates **Why this matters:** - Traefik-Saltbox has complex Saltbox-managed configs - Messing with it breaks Plex, Sonarr, Radarr, and all media services - Each Traefik has its own Let's Encrypt certificates - Mixing them causes certificate conflicts ### Traefik-Primary (CT 202) - For New Services **Location**: `/etc/traefik/` on Container 202 **Config**: `/etc/traefik/traefik.yaml` **Dynamic Configs**: `/etc/traefik/conf.d/*.yaml` **Services using Traefik-Primary (10.10.10.250):** - excalidraw.htsn.io → 10.10.10.206:8080 (docker-host) - findshyt.htsn.io → 10.10.10.205 (CT 205) - gitea (git.htsn.io) → 10.10.10.220:3000 - homeassistant → 10.10.10.110 - lmdev → 10.10.10.111 - pihole → 10.10.10.200 - truenas → 10.10.10.200 - proxmox → 10.10.10.120 - copyparty → 10.10.10.201 - aitrade → trading server - pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring) **Access Traefik config:** ```bash # From Mac Mini: ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml' ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/' # Edit a service config: ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml' ``` ### Traefik-Saltbox (VM 101) - DO NOT MODIFY **Location**: `/opt/traefik/` inside Saltbox VM **Managed by**: Saltbox Ansible playbooks **Mounts**: Docker bind mount from `/opt/traefik` → `/etc/traefik` in container **Services using Traefik-Saltbox (10.10.10.100):** - Plex (plex.htsn.io) - Sonarr, Radarr, Lidarr - SABnzbd, NZBGet, qBittorrent - Overseerr, Tautulli, Organizr - Jackett, NZBHydra2 - Authelia (SSO) - All other Saltbox-managed containers **View Saltbox Traefik (read-only):** ```bash ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"' ``` ### Adding a New Public Service - Complete Workflow Follow these steps to deploy a new service and make it publicly accessible at `servicename.htsn.io`. #### Step 0. Deploy Your Service First, deploy your service on the appropriate host: **Option A: Docker on docker-host (10.10.10.206)** ```bash ssh hutson@10.10.10.206 sudo mkdir -p /opt/myservice cat > /opt/myservice/docker-compose.yml << 'EOF' version: "3.8" services: myservice: image: myimage:latest ports: - "8080:80" restart: unless-stopped EOF cd /opt/myservice && sudo docker-compose up -d ``` **Option B: New LXC Container on PVE** ```bash ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \ --hostname myservice --memory 2048 --cores 2 \ --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \ --rootfs local-zfs:8 --unprivileged 1 --start 1' ``` **Option C: New VM on PVE** ```bash ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \ --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci' ``` #### Step 1. Create Traefik Config File Use this template for new services on **Traefik-Primary (CT 202)**: ```yaml # /etc/traefik/conf.d/myservice.yaml http: routers: # HTTPS router myservice-secure: entryPoints: - websecure rule: "Host(`myservice.htsn.io`)" service: myservice tls: certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only priority: 50 # HTTP → HTTPS redirect myservice-redirect: entryPoints: - web rule: "Host(`myservice.htsn.io`)" middlewares: - myservice-https-redirect service: myservice priority: 50 services: myservice: loadBalancer: servers: - url: "http://10.10.10.XXX:PORT" middlewares: myservice-https-redirect: redirectScheme: scheme: https permanent: true ``` ### SSL Certificates Traefik has **two certificate resolvers** configured: | Resolver | Use When | Challenge Type | Notes | |----------|----------|----------------|-------| | `letsencrypt` | Cloudflare DNS-only (gray cloud) | HTTP-01 | Requires port 80 reachable | | `cloudflare` | Cloudflare Proxied (orange cloud) | DNS-01 | Works with Cloudflare proxy | **⚠️ Important:** If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use `cloudflare` resolver instead. **Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`: ```bash Environment="CF_API_EMAIL=cloudflare@htsn.io" Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc" ``` **Certificate storage:** - HTTP challenge certs: `/etc/traefik/acme.json` - DNS challenge certs: `/etc/traefik/acme-cf.json` **Deploy the config:** ```bash # Create file on CT 202 ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\'' EOF"' # Traefik auto-reloads (watches conf.d directory) # Check logs: ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' ``` #### 2. Add Cloudflare DNS Entry **Cloudflare Credentials:** - Email: `cloudflare@htsn.io` - API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc` **Manual method (via Cloudflare Dashboard):** 1. Go to https://dash.cloudflare.com/ 2. Select `htsn.io` domain 3. DNS → Add Record 4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️ **Automated method (CLI script):** Save this as `~/bin/add-cloudflare-dns.sh`: ```bash #!/bin/bash # Add DNS record to Cloudflare for htsn.io SUBDOMAIN="$1" CF_EMAIL="cloudflare@htsn.io" CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" # htsn.io zone PUBLIC_IP="70.237.94.174" # Update if IP changes: curl -s ifconfig.me if [ -z "$SUBDOMAIN" ]; then echo "Usage: $0 " echo "Example: $0 myservice # Creates myservice.htsn.io" exit 1 fi curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ -H "X-Auth-Email: $CF_EMAIL" \ -H "X-Auth-Key: $CF_API_KEY" \ -H "Content-Type: application/json" \ --data "{ \"type\":\"A\", \"name\":\"$SUBDOMAIN\", \"content\":\"$PUBLIC_IP\", \"ttl\":1, \"proxied\":true }" | jq . ``` **Usage:** ```bash chmod +x ~/bin/add-cloudflare-dns.sh ~/bin/add-cloudflare-dns.sh excalidraw # Creates excalidraw.htsn.io ``` #### 3. Testing ```bash # Check if DNS resolves dig myservice.htsn.io # Test HTTP redirect curl -I http://myservice.htsn.io # Test HTTPS curl -I https://myservice.htsn.io # Check Traefik dashboard (if enabled) # Access: http://10.10.10.250:8080/dashboard/ ``` #### Step 4. Update Documentation After deploying, update these files: 1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table 2. **CLAUDE.md** - Add to "Services using Traefik-Primary" list (line ~495) ### Quick Reference - One-Liner Commands ```bash # === DEPLOY SERVICE (example: myservice on docker-host port 8080) === # 1. Create Traefik config ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF http: routers: myservice-secure: entryPoints: [websecure] rule: Host(\\\`myservice.htsn.io\\\`) service: myservice tls: {certResolver: letsencrypt} services: myservice: loadBalancer: servers: - url: http://10.10.10.206:8080 EOF"' # 2. Add Cloudflare DNS curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \ -H "X-Auth-Email: cloudflare@htsn.io" \ -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \ -H "Content-Type: application/json" \ --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}' # 3. Test (wait a few seconds for DNS propagation) curl -I https://myservice.htsn.io ``` ### Traefik Troubleshooting ```bash # View Traefik logs (CT 202) ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' # Check if config is valid ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml' # List all dynamic configs ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/' # Check certificate ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq' # Restart Traefik (if needed) ssh pve 'pct exec 202 -- systemctl restart traefik' ``` ### Certificate Management **Let's Encrypt certificates** are automatically managed by Traefik. **Certificate storage:** - Traefik-Primary: `/etc/traefik/acme.json` on CT 202 - Traefik-Saltbox: `/opt/traefik/acme.json` on VM 101 **Certificate renewal:** - Automatic via HTTP-01 challenge - Traefik checks every 24h - Renews 30 days before expiry **If certificates fail:** ```bash # Check acme.json permissions (must be 600) ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json' # Check Traefik can reach Let's Encrypt ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory' # Delete bad certificate (Traefik will re-request) ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json' ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json' ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json' ssh pve 'pct exec 202 -- systemctl restart traefik' ``` ### Docker Service with Traefik Labels (Alternative) If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files: ```yaml # docker-compose.yml services: myservice: image: myimage:latest labels: - "traefik.enable=true" - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)" - "traefik.http.routers.myservice.entrypoints=websecure" - "traefik.http.routers.myservice.tls.certresolver=letsencrypt" - "traefik.http.services.myservice.loadbalancer.server.port=8080" networks: - traefik networks: traefik: external: true ``` **Note**: This requires Traefik to have access to Docker socket and be on same network. ## Cloudflare API Access **Credentials** (stored in Saltbox config): - Email: `cloudflare@htsn.io` - API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc` - Domain: `htsn.io` **Retrieve from Saltbox:** ```bash ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"' ``` **Cloudflare API Documentation:** - API Docs: https://developers.cloudflare.com/api/ - DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record **Common API operations:** ```bash # Set credentials CF_EMAIL="cloudflare@htsn.io" CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" # List all DNS records curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ -H "X-Auth-Email: $CF_EMAIL" \ -H "X-Auth-Key: $CF_API_KEY" | jq # Add A record curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ -H "X-Auth-Email: $CF_EMAIL" \ -H "X-Auth-Key: $CF_API_KEY" \ -H "Content-Type: application/json" \ --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}' # Delete record curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \ -H "X-Auth-Email: $CF_EMAIL" \ -H "X-Auth-Key: $CF_API_KEY" ``` ## Git Repository This documentation is stored at: - **Gitea**: https://git.htsn.io/hutson/homelab-docs - **Local**: `~/Projects/homelab` - **Notes**: `~/Notes/05_Homelab` (symlink) ```bash # Clone git clone git@git.htsn.io:hutson/homelab-docs.git # Push changes cd ~/Projects/homelab git add -A && git commit -m "Update docs" && git push ``` ## Related Documentation | File | Description | |------|-------------| | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | EMC storage enclosure (SES commands, LCC troubleshooting, maintenance) | | [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API access, automations, integrations | | [NETWORK.md](NETWORK.md) | Network bridges, VLANs, which bridge to use for new VMs | | [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices and services | | [SYNCTHING.md](SYNCTHING.md) | Syncthing setup, API access, device list, troubleshooting | | [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.) | | [configs/](configs/) | Symlinks to shared shell configs | --- ## Backlog Future improvements and maintenance tasks: | Priority | Task | Notes | |----------|------|-------| | Medium | **Re-IP all devices** | Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254 | | Low | Install SSH on HomeAssistant | Currently only accessible via QEMU agent | | Low | Set up SSH key for router | Currently requires expect/password | --- ## Changelog ### 2024-12-20 **Git Repository Setup** - Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs) - Set up SSH key authentication for git@git.htsn.io - Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab - Added Gitea API token for future automation **SSH Key Deployment - All Systems** - Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key) - Updated `~/.ssh/config` with complete host aliases - Fixed permissions: FindShyt LXC `.ssh` ownership, enabled PermitRootLogin on LXCs - Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt **Documentation Updates** - Rewrote SSH Access section with complete host table - Added Password Auth section for router/Windows/HomeAssistant - Added Backlog section with re-IP task - Added Git Repository section with clone/push instructions ### 2024-12-19 **EMC Storage Enclosure - LCC B Failure** - Diagnosed loud fan issue (speed code 5 → 4160 RPM) - Root cause: Faulty LCC B controller causing false readings - Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM) - Replacement ordered: EMC 303-108-000E ($14.95 eBay) - Created [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) with full documentation **SSH Key Consolidation** - Renamed `~/.ssh/ai_trading_ed25519` → `~/.ssh/homelab` - Updated `~/.ssh/config` on MacBook with all homelab hosts - SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm - No more sshpass needed for PVE servers **QEMU Guest Agent Deployment** - Installed on: docker-host (206), fs-dev (105), copyparty (201) - All PVE VMs now have agent except homeassistant (110) - Can now use `qm guest exec` for remote commands **VM Configuration Updates** - docker-host: Fixed SSH key in cloud-init - fs-dev: Fixed `.ssh` directory ownership (1000 → 1001) - copyparty: Changed from DHCP to static IP (10.10.10.201) **Documentation Updates** - Updated CLAUDE.md SSH section (removed sshpass examples) - Added QEMU Agent column to VM tables - Added storage enclosure troubleshooting to runbooks