Files
homelab-docs/CLAUDE.md
Hutson 9e887b15a4 Document MTU 9000 jumbo frames configuration
- Added MTU 9000 table showing all configured devices
- Added verification commands for checking MTU
- Added important note about bridge + physical interface MTU sync
- Mac Mini, PVE, PVE2, TrueNAS, and router all support jumbo frames

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 11:16:42 -05:00

36 KiB

Homelab Infrastructure

Quick Reference - Common Tasks

Task Section Quick Command
Add new public service Reverse Proxy Create Traefik config + Cloudflare DNS
Add Cloudflare DNS Cloudflare API curl -X POST cloudflare.com/...
Check server temps Temperature Check ssh pve 'grep Tctl ...'
Syncthing issues Troubleshooting Check API connections
SSL cert issues Traefik DNS Challenge Use cloudflare resolver

Key Credentials (see sections for full details):

  • Cloudflare: cloudflare@htsn.io / API Key in Cloudflare API
  • SSH Password: GrilledCh33s3#
  • Traefik: CT 202 @ 10.10.10.250

Role

You are the Homelab Assistant - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include:

  • Infrastructure Management: Proxmox servers, VMs, containers, networking
  • File Sync: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android)
  • Network Administration: Router config, SSH access, Tailscale, device management
  • Power Optimization: CPU governors, GPU power states, service tuning
  • Documentation: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date
  • Automation: Shell aliases, startup scripts, scheduled tasks

You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure.

Proactive Behaviors

When the user mentions issues or asks questions, proactively:

  • "sync not working" → Check Syncthing status on ALL devices, identify which is offline
  • "device offline" → Ping both local and Tailscale IPs, check if service is running
  • "slow" → Check CPU usage, running processes, Syncthing rescan activity
  • "check status" → Run full health check across all systems
  • "something's wrong" → Run diagnostics on likely culprits based on context

Quick Health Checks

Run these to get a quick overview of the homelab:

# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"

# Ping critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"

# Check Windows PC Syncthing (often goes offline)
nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN"

Troubleshooting Runbooks

Symptom Check Fix
Device not syncing curl Syncthing API → connections Check if device online, restart Syncthing
Windows PC offline ping 10.10.10.150 then nc -z 22000 SSH in, Start-ScheduledTask -TaskName "Syncthing"
Phone not syncing Phone Syncthing app in background? User must open app, keep screen on
High CPU on TrueNAS Syncthing rescan? KSM? Check rescan intervals, disable KSM
VM won't start Storage available? RAM free? ssh pve 'qm start VMID', check logs
Tailscale offline tailscale status tailscale up or restart service
Tailscale no subnet access Check subnet routers Verify pve or ucg-fiber advertising routes
Sync stuck at X% Folder errors? Conflicts? Check rest/folder/errors?folder=NAME
Server running hot Check KSM, check CPU processes Disable KSM, identify runaway process
Storage enclosure loud Check fan speed via SES See EMC-ENCLOSURE.md
Drives not detected Check SAS link, LCC status Switch LCC, rescan SCSI hosts

Server Temperature Check

# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Healthy temps: 70-80°C under load. Warning: >85°C. Throttle: 90°C.

Service Dependencies

TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks between devices
├── NFS/SMB shares for VMs
└── Media storage for Plex

PiHole (CT 200)
└── DNS for entire network - if down, name resolution fails

Traefik (CT 202)
└── Reverse proxy - if down, external access to services fails

Router (10.10.10.1)
└── Everything - gateway for all traffic

API Quick Reference

Service Device Endpoint Auth
Syncthing Mac Mini http://127.0.0.1:8384/rest/ X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5
Syncthing MacBook http://127.0.0.1:8384/rest/ (via SSH) X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ
Syncthing Phone https://10.10.10.54:8384/rest/ X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM
Proxmox PVE https://10.10.10.120:8006/api2/json/ SSH key auth
Proxmox PVE2 https://10.10.10.102:8006/api2/json/ SSH key auth

Common Maintenance Tasks

When user asks for maintenance or you notice issues:

  1. Check Syncthing sync status - Any folders behind? Errors?
  2. Verify all devices connected - Run connection check
  3. Check disk space - ssh pve 'df -h', ssh pve2 'df -h'
  4. Review ZFS pool health - ssh pve 'zpool status'
  5. Check for stuck processes - High CPU? Memory pressure?
  6. Verify backups - Are critical folders syncing?

Emergency Commands

# Restart VM on Proxmox
ssh pve 'qm stop VMID && qm start VMID'

# Check what's using CPU
ssh pve 'ps aux --sort=-%cpu | head -10'

# Check ZFS pool status (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Check EMC enclosure fans
ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"'

# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"

# Restart Syncthing on Windows (when stuck)
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'

# Get all device IPs from router
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Overview

Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads.

Servers

PVE (10.10.10.120) - Primary

  • CPU: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
  • RAM: 128 GB
  • Storage:
    • nvme-mirror1: 2x Sabrent Rocket Q NVMe (3.6TB usable)
    • nvme-mirror2: 2x Kingston SFYRD 2TB (1.8TB usable)
    • rpool: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable)
  • GPUs:
    • NVIDIA Quadro P2000 (75W TDP) - Plex transcoding
    • NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1
  • Role: Primary VM host, TrueNAS, media services

PVE2 (10.10.10.102) - Secondary

  • CPU: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
  • RAM: 128 GB
  • Storage:
    • nvme-mirror3: 2x NVMe mirror
    • local-zfs2: 2x WD Red 6TB HDD mirror
  • GPUs:
    • NVIDIA RTX A6000 (300W TDP) - passed to trading-vm
  • Role: Trading platform, development

SSH Access

SSH Key Authentication (All Hosts)

SSH keys are configured in ~/.ssh/config on both Mac Mini and MacBook. Use the ~/.ssh/homelab key.

Host Alias IP User Type Notes
pve 10.10.10.120 root Proxmox Primary server
pve2 10.10.10.102 root Proxmox Secondary server
truenas 10.10.10.200 root VM NAS/storage
saltbox 10.10.10.100 hutson VM Media automation
lmdev1 10.10.10.111 hutson VM AI/LLM development
docker-host 10.10.10.206 hutson VM Docker services
fs-dev 10.10.10.5 hutson VM Development
copyparty 10.10.10.201 hutson VM File sharing
gitea-vm 10.10.10.220 hutson VM Git server
trading-vm 10.10.10.221 hutson VM AI trading platform
pihole 10.10.10.10 root LXC DNS/Ad blocking
traefik 10.10.10.250 root LXC Reverse proxy
findshyt 10.10.10.8 root LXC Custom app

Usage examples:

ssh pve 'qm list'                    # List VMs
ssh truenas 'zpool status vault'     # Check ZFS pool
ssh saltbox 'docker ps'              # List containers
ssh pihole 'pihole status'           # Check Pi-hole

Password Auth (Special Cases)

Device IP User Auth Method Notes
UniFi Router 10.10.10.1 root expect (keyboard-interactive) Gateway
Windows PC 10.10.10.150 claude sshpass PowerShell, use ; not &&
HomeAssistant 10.10.10.110 - QEMU agent only No SSH server

Router access (requires expect):

# Run command on router
expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

# Get ARP table (all device IPs)
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Windows PC access:

sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'

HomeAssistant (no SSH, use QEMU agent):

ssh pve 'qm guest exec 110 -- bash -c "ha core info"'

VMs and Containers

PVE (10.10.10.120)

VMID Name vCPUs RAM Purpose GPU/Passthrough QEMU Agent
100 truenas 8 32GB NAS, storage LSI SAS2308 HBA, Samsung NVMe Yes
101 saltbox 16 16GB Media automation TITAN RTX Yes
105 fs-dev 10 8GB Development - Yes
110 homeassistant 2 2GB Home automation - No
111 lmdev1 8 32GB AI/LLM development TITAN RTX Yes
201 copyparty 2 2GB File sharing - Yes
206 docker-host 2 4GB Docker services - Yes
200 pihole (CT) - - DNS/Ad blocking - N/A
202 traefik (CT) - - Reverse proxy - N/A
205 findshyt (CT) - - Custom app - N/A

PVE2 (10.10.10.102)

VMID Name vCPUs RAM Purpose GPU/Passthrough QEMU Agent
300 gitea-vm 2 4GB Git server - Yes
301 trading-vm 16 32GB AI trading platform RTX A6000 Yes

QEMU Guest Agent

VMs with QEMU agent can be managed via qm guest exec:

# Execute command in VM
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'

Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead.

Power Management

Estimated Power Draw

  • PVE: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs)
  • PVE2: 450-600W (CPU + RTX A6000 + storage)
  • Combined: ~1000-1350W under load

Optimizations Applied

  1. KSMD Disabled (2024-12-17 updated)

    • Was consuming 44-57% CPU on PVE with negative profit
    • Caused CPU temp to rise from 74°C to 83°C
    • Savings: ~7-10W + significant temp reduction
    • Made permanent via:
      • systemd service: /etc/systemd/system/disable-ksm.service
      • ksmtuned masked: systemctl mask ksmtuned (prevents re-enabling)
    • Note: KSM can get re-enabled by Proxmox updates. If CPU is hot, check:
      cat /sys/kernel/mm/ksm/run  # Should be 0
      ps aux | grep ksmd          # Should show 0% CPU
      # If KSM is running (run=1), disable it:
      echo 0 > /sys/kernel/mm/ksm/run
      systemctl mask ksmtuned
      
  2. Syncthing Rescan Intervals (2024-12-16)

    • Changed aggressive 60s rescans to 3600s for large folders
    • Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
    • Savings: ~60-80W (TrueNAS VM was at constant 86% CPU)
  3. CPU Governor Optimization (2024-12-16)

    • PVE: powersave governor + balance_power EPP (amd-pstate-epp driver)
    • PVE2: schedutil governor (acpi-cpufreq driver)
    • Made permanent via systemd service: /etc/systemd/system/cpu-powersave.service
    • Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz)
  4. GPU Power States (2024-12-16) - Verified optimal

    • RTX A6000: 11W idle (P8 state)
    • TITAN RTX: 2-3W idle (P8 state)
    • Quadro P2000: 25W (P0 - Plex keeps it active)
  5. ksmtuned Disabled (2024-12-16)

    • KSM tuning daemon was still running after KSMD disabled
    • Stopped and disabled on both servers
    • Savings: ~2-5W
  6. HDD Spindown on PVE2 (2024-12-16)

    • local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7
    • Set 30-minute spindown via hdparm -S 241
    • Persistent via udev rule: /etc/udev/rules.d/69-hdd-spindown.rules
    • Savings: ~10-16W when spun down

Potential Optimizations

  • PCIe ASPM power management
  • NMI watchdog disable

Memory Configuration

  • Ballooning enabled on most VMs but not actively used
  • No memory overcommit (98GB allocated on 128GB physical for PVE)
  • KSMD was wasting CPU with no benefit (negative general_profit)

Network

See NETWORK.md for full details.

Network Ranges

Network Range Purpose
LAN 10.10.10.0/24 Primary network, all external access
Internal 10.10.20.0/24 Inter-VM only (storage, NFS/iSCSI)

PVE Bridges (10.10.10.120)

Bridge NIC Speed Purpose Use For
vmbr0 enp1s0 1 Gb Management General VMs/CTs
vmbr1 enp35s0f0 10 Gb High-speed LXC Bandwidth-heavy containers
vmbr2 enp35s0f1 10 Gb High-speed VM TrueNAS, Saltbox, storage VMs
vmbr3 (none) Virtual Internal only NFS/iSCSI traffic, no internet

Quick Reference

# Add VM to standard network (1Gb)
qm set VMID --net0 virtio,bridge=vmbr0

# Add VM to high-speed network (10Gb)
qm set VMID --net0 virtio,bridge=vmbr2

# Add secondary NIC for internal storage network
qm set VMID --net1 virtio,bridge=vmbr3

MTU 9000 (Jumbo Frames)

Jumbo frames are enabled across the network for improved throughput on large transfers.

Device Interface MTU Persistent
Mac Mini en0 9000 Yes (networksetup)
PVE vmbr0, enp1s0 9000 Yes (/etc/network/interfaces)
PVE2 vmbr0, nic1 9000 Yes (/etc/network/interfaces)
TrueNAS enp6s18, enp6s19 9000 Yes
UCG-Fiber br0 9216 Yes (default)

Verify MTU:

# Mac Mini
ifconfig en0 | grep mtu

# PVE/PVE2
ssh pve 'ip link show vmbr0 | grep mtu'
ssh pve2 'ip link show vmbr0 | grep mtu'

# Test jumbo frames
ping -c 1 -D -s 8000 10.10.10.120  # 8000 + 8 byte header = 8008 bytes

Important: When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped.

Tailscale VPN

Tailscale provides secure remote access to the homelab from anywhere.

Subnet Routers (HA Failover)

Two devices advertise the 10.10.10.0/24 subnet for redundancy:

Device Tailscale IP Role Notes
pve 100.113.177.80 Primary Proxmox host
ucg-fiber 100.94.246.32 Failover UniFi router (always on)

If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec).

Router Tailscale Setup (UCG-Fiber)

  • Installed via: curl -fsSL https://tailscale.com/install.sh | sh
  • Config: tailscale up --advertise-routes=10.10.10.0/24 --accept-routes
  • Survives reboots (systemd service)
  • Routes must be approved in Tailscale Admin Console

Tailscale IPs Quick Reference

Device Tailscale IP Local IP
Mac Mini 100.108.89.58 10.10.10.125
PVE 100.113.177.80 10.10.10.120
UCG-Fiber 100.94.246.32 10.10.10.1
TrueNAS 100.100.94.71 10.10.10.200
Pi-hole 100.112.59.128 10.10.10.10

Check Tailscale Status

# From Mac Mini
/Applications/Tailscale.app/Contents/MacOS/Tailscale status

# From router
expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Common Commands

# Check VM status
ssh pve 'qm list'
ssh pve2 'qm list'

# Check container status
ssh pve 'pct list'

# Monitor CPU/power
ssh pve 'top -bn1 | head -20'

# Check ZFS pools
ssh pve 'zpool status'

# Check GPU (if nvidia-smi installed in VM)
ssh pve 'lspci | grep -i nvidia'

Remote Claude Code Sessions (Mac Mini)

Overview

The Mac Mini (hutson-mac-mini.local) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required.

Architecture

Mac Mini (100.108.89.58 via Tailscale)
├── launchd (auto-starts on boot)
│   └── com.hutson.happy-daemon.plist (starts Happy daemon)
├── Happy Coder daemon (manages remote sessions)
└── Tailscale (secure remote access)

How It Works

  1. Happy daemon runs on Mac Mini (auto-starts on boot)
  2. Open Happy Coder app on phone/tablet
  3. Start a new Claude session from the app
  4. Session runs in any working directory you choose
  5. Session ends when you're done - no cleanup needed

Quick Commands

# Check daemon status
happy daemon list

# Start a new session manually (from Mac Mini terminal)
cd ~/Projects/homelab && happy claude

# Check active sessions
happy daemon list

Mobile Access Setup (One-time)

  1. Download Happy Coder app:
  2. On Mac Mini, run: happy auth and scan QR code with the app
  3. Daemon auto-starts on boot via launchd

Daemon Management

happy daemon start    # Start daemon
happy daemon stop     # Stop daemon
happy daemon status   # Check status
happy daemon list     # List active sessions

Remote Access via SSH + Tailscale

From any device on Tailscale network:

# SSH to Mac Mini
ssh hutson@100.108.89.58

# Or via hostname
ssh hutson@mac-mini

# Start Claude in desired directory
cd ~/Projects/homelab && happy claude

Files & Configuration

File Purpose
~/Library/LaunchAgents/com.hutson.happy-daemon.plist launchd auto-start Happy daemon
~/.happy/ Happy Coder config and logs

Troubleshooting

# Check if daemon is running
pgrep -f "happy.*daemon"

# Check launchd status
launchctl list | grep happy

# List active sessions
happy daemon list

# Restart daemon
happy daemon stop && happy daemon start

# If Tailscale is disconnected
/Applications/Tailscale.app/Contents/MacOS/Tailscale up

Agent and Tool Guidelines

Background Agents

  • Always spin up background agents when doing multiple independent tasks
  • Background agents allow parallel execution of tasks that don't depend on each other
  • This improves efficiency and reduces total execution time
  • Use background agents for tasks like running tests, builds, or searches simultaneously

MCP Tools for Web Searches

ref.tools - Documentation Lookups

  • mcp__Ref__ref_search_documentation: Search through documentation for specific topics
  • mcp__Ref__ref_read_url: Read and parse content from documentation URLs

Exa MCP - General Web and Code Searches

  • mcp__exa__web_search_exa: General web searches for current information
  • mcp__exa__get_code_context_exa: Code-related searches and repository lookups

MCP Tools Reference Table

Tool Name Provider Purpose Use Case
mcp__Ref__ref_search_documentation ref.tools Search documentation Finding specific topics in official docs
mcp__Ref__ref_read_url ref.tools Read documentation URLs Parsing and extracting content from doc pages
mcp__exa__web_search_exa Exa MCP General web search Current events, general information lookup
mcp__exa__get_code_context_exa Exa MCP Code-specific search Finding code examples, repository searches

Reverse Proxy Architecture (Traefik)

Overview

There are TWO separate Traefik instances handling different services:

Instance Location IP Purpose Manages
Traefik-Primary CT 202 10.10.10.250 General services All non-Saltbox services
Traefik-Saltbox VM 101 (Docker) 10.10.10.100 Saltbox services Plex, *arr apps, media stack

⚠️ CRITICAL RULE: Which Traefik to Use

When adding ANY new service:

  • Use Traefik-Primary (10.10.10.250) - Unless service lives inside Saltbox VM
  • DO NOT touch Traefik-Saltbox - It manages Saltbox services with their own certificates

Why this matters:

  • Traefik-Saltbox has complex Saltbox-managed configs
  • Messing with it breaks Plex, Sonarr, Radarr, and all media services
  • Each Traefik has its own Let's Encrypt certificates
  • Mixing them causes certificate conflicts

Traefik-Primary (CT 202) - For New Services

Location: /etc/traefik/ on Container 202 Config: /etc/traefik/traefik.yaml Dynamic Configs: /etc/traefik/conf.d/*.yaml

Services using Traefik-Primary (10.10.10.250):

  • excalidraw.htsn.io → 10.10.10.206:8080 (docker-host)
  • findshyt.htsn.io → 10.10.10.205 (CT 205)
  • gitea (git.htsn.io) → 10.10.10.220:3000
  • homeassistant → 10.10.10.110
  • lmdev → 10.10.10.111
  • pihole → 10.10.10.200
  • truenas → 10.10.10.200
  • proxmox → 10.10.10.120
  • copyparty → 10.10.10.201
  • aitrade → trading server
  • pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring)

Access Traefik config:

# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'

# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'

Traefik-Saltbox (VM 101) - DO NOT MODIFY

Location: /opt/traefik/ inside Saltbox VM Managed by: Saltbox Ansible playbooks Mounts: Docker bind mount from /opt/traefik/etc/traefik in container

Services using Traefik-Saltbox (10.10.10.100):

  • Plex (plex.htsn.io)
  • Sonarr, Radarr, Lidarr
  • SABnzbd, NZBGet, qBittorrent
  • Overseerr, Tautulli, Organizr
  • Jackett, NZBHydra2
  • Authelia (SSO)
  • All other Saltbox-managed containers

View Saltbox Traefik (read-only):

ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'

Adding a New Public Service - Complete Workflow

Follow these steps to deploy a new service and make it publicly accessible at servicename.htsn.io.

Step 0. Deploy Your Service

First, deploy your service on the appropriate host:

Option A: Docker on docker-host (10.10.10.206)

ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
  myservice:
    image: myimage:latest
    ports:
      - "8080:80"
    restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d

Option B: New LXC Container on PVE

ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname myservice --memory 2048 --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
  --rootfs local-zfs:8 --unprivileged 1 --start 1'

Option C: New VM on PVE

ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'

Step 1. Create Traefik Config File

Use this template for new services on Traefik-Primary (CT 202):

# /etc/traefik/conf.d/myservice.yaml
http:
  routers:
    # HTTPS router
    myservice-secure:
      entryPoints:
        - websecure
      rule: "Host(`myservice.htsn.io`)"
      service: myservice
      tls:
        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
      priority: 50

    # HTTP → HTTPS redirect
    myservice-redirect:
      entryPoints:
        - web
      rule: "Host(`myservice.htsn.io`)"
      middlewares:
        - myservice-https-redirect
      service: myservice
      priority: 50

  services:
    myservice:
      loadBalancer:
        servers:
          - url: "http://10.10.10.XXX:PORT"

  middlewares:
    myservice-https-redirect:
      redirectScheme:
        scheme: https
        permanent: true

SSL Certificates

Traefik has two certificate resolvers configured:

Resolver Use When Challenge Type Notes
letsencrypt Cloudflare DNS-only (gray cloud) HTTP-01 Requires port 80 reachable
cloudflare Cloudflare Proxied (orange cloud) DNS-01 Works with Cloudflare proxy

⚠️ Important: If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use cloudflare resolver instead.

Cloudflare API credentials are configured in /etc/systemd/system/traefik.service:

Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"

Certificate storage:

  • HTTP challenge certs: /etc/traefik/acme.json
  • DNS challenge certs: /etc/traefik/acme-cf.json

Deploy the config:

# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'

# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'

2. Add Cloudflare DNS Entry

Cloudflare Credentials:

  • Email: cloudflare@htsn.io
  • API Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc

Manual method (via Cloudflare Dashboard):

  1. Go to https://dash.cloudflare.com/
  2. Select htsn.io domain
  3. DNS → Add Record
  4. Type: A, Name: myservice, IPv4: 70.237.94.174, Proxied: ☑️

Automated method (CLI script):

Save this as ~/bin/add-cloudflare-dns.sh:

#!/bin/bash
# Add DNS record to Cloudflare for htsn.io

SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"  # htsn.io zone
PUBLIC_IP="70.237.94.174"  # Update if IP changes: curl -s ifconfig.me

if [ -z "$SUBDOMAIN" ]; then
  echo "Usage: $0 <subdomain>"
  echo "Example: $0 myservice  # Creates myservice.htsn.io"
  exit 1
fi

curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data "{
    \"type\":\"A\",
    \"name\":\"$SUBDOMAIN\",
    \"content\":\"$PUBLIC_IP\",
    \"ttl\":1,
    \"proxied\":true
  }" | jq .

Usage:

chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh excalidraw  # Creates excalidraw.htsn.io

3. Testing

# Check if DNS resolves
dig myservice.htsn.io

# Test HTTP redirect
curl -I http://myservice.htsn.io

# Test HTTPS
curl -I https://myservice.htsn.io

# Check Traefik dashboard (if enabled)
# Access: http://10.10.10.250:8080/dashboard/

Step 4. Update Documentation

After deploying, update these files:

  1. IP-ASSIGNMENTS.md - Add to Services & Reverse Proxy Mapping table
  2. CLAUDE.md - Add to "Services using Traefik-Primary" list (line ~495)

Quick Reference - One-Liner Commands

# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===

# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
  routers:
    myservice-secure:
      entryPoints: [websecure]
      rule: Host(\\\`myservice.htsn.io\\\`)
      service: myservice
      tls: {certResolver: letsencrypt}
  services:
    myservice:
      loadBalancer:
        servers:
          - url: http://10.10.10.206:8080
EOF"'

# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
  -H "X-Auth-Email: cloudflare@htsn.io" \
  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'

# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io

Traefik Troubleshooting

# View Traefik logs (CT 202)
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'

# Check if config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'

# List all dynamic configs
ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/'

# Check certificate
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq'

# Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'

Certificate Management

Let's Encrypt certificates are automatically managed by Traefik.

Certificate storage:

  • Traefik-Primary: /etc/traefik/acme.json on CT 202
  • Traefik-Saltbox: /opt/traefik/acme.json on VM 101

Certificate renewal:

  • Automatic via HTTP-01 challenge
  • Traefik checks every 24h
  • Renews 30 days before expiry

If certificates fail:

# Check acme.json permissions (must be 600)
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json'

# Check Traefik can reach Let's Encrypt
ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory'

# Delete bad certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'

Docker Service with Traefik Labels (Alternative)

If deploying a service via Docker on docker-host (VM 206), you can use Traefik labels instead of config files:

# docker-compose.yml
services:
  myservice:
    image: myimage:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
      - "traefik.http.routers.myservice.entrypoints=websecure"
      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
    networks:
      - traefik

networks:
  traefik:
    external: true

Note: This requires Traefik to have access to Docker socket and be on same network.

Cloudflare API Access

Credentials (stored in Saltbox config):

  • Email: cloudflare@htsn.io
  • API Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc
  • Domain: htsn.io

Retrieve from Saltbox:

ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"'

Cloudflare API Documentation:

Common API operations:

# Set credentials
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"

# List all DNS records
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" | jq

# Add A record
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}'

# Delete record
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY"

Git Repository

This documentation is stored at:

# Clone
git clone git@git.htsn.io:hutson/homelab-docs.git

# Push changes
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push
File Description
EMC-ENCLOSURE.md EMC storage enclosure (SES commands, LCC troubleshooting, maintenance)
HOMEASSISTANT.md Home Assistant API access, automations, integrations
NETWORK.md Network bridges, VLANs, which bridge to use for new VMs
IP-ASSIGNMENTS.md Complete IP address assignments for all devices and services
SYNCTHING.md Syncthing setup, API access, device list, troubleshooting
SHELL-ALIASES.md ZSH aliases for Claude Code (chomelab, ctrading, etc.)
configs/ Symlinks to shared shell configs

Backlog

Future improvements and maintenance tasks:

Priority Task Notes
Medium Re-IP all devices Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254
Low Install SSH on HomeAssistant Currently only accessible via QEMU agent
Low Set up SSH key for router Currently requires expect/password

Changelog

2024-12-20

Git Repository Setup

  • Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs)
  • Set up SSH key authentication for git@git.htsn.io
  • Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab
  • Added Gitea API token for future automation

SSH Key Deployment - All Systems

  • Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key)
  • Updated ~/.ssh/config with complete host aliases
  • Fixed permissions: FindShyt LXC .ssh ownership, enabled PermitRootLogin on LXCs
  • Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt

Documentation Updates

  • Rewrote SSH Access section with complete host table
  • Added Password Auth section for router/Windows/HomeAssistant
  • Added Backlog section with re-IP task
  • Added Git Repository section with clone/push instructions

2024-12-19

EMC Storage Enclosure - LCC B Failure

  • Diagnosed loud fan issue (speed code 5 → 4160 RPM)
  • Root cause: Faulty LCC B controller causing false readings
  • Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM)
  • Replacement ordered: EMC 303-108-000E ($14.95 eBay)
  • Created EMC-ENCLOSURE.md with full documentation

SSH Key Consolidation

  • Renamed ~/.ssh/ai_trading_ed25519~/.ssh/homelab
  • Updated ~/.ssh/config on MacBook with all homelab hosts
  • SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm
  • No more sshpass needed for PVE servers

QEMU Guest Agent Deployment

  • Installed on: docker-host (206), fs-dev (105), copyparty (201)
  • All PVE VMs now have agent except homeassistant (110)
  • Can now use qm guest exec for remote commands

VM Configuration Updates

  • docker-host: Fixed SSH key in cloud-init
  • fs-dev: Fixed .ssh directory ownership (1000 → 1001)
  • copyparty: Changed from DHCP to static IP (10.10.10.201)

Documentation Updates

  • Updated CLAUDE.md SSH section (removed sshpass examples)
  • Added QEMU Agent column to VM tables
  • Added storage enclosure troubleshooting to runbooks