Files
homelab-docs/CLAUDE.md
Hutson 9e887b15a4 Document MTU 9000 jumbo frames configuration
- Added MTU 9000 table showing all configured devices
- Added verification commands for checking MTU
- Added important note about bridge + physical interface MTU sync
- Mac Mini, PVE, PVE2, TrueNAS, and router all support jumbo frames

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 11:16:42 -05:00

1052 lines
36 KiB
Markdown

# Homelab Infrastructure
## Quick Reference - Common Tasks
| Task | Section | Quick Command |
|------|---------|---------------|
| **Add new public service** | [Reverse Proxy](#reverse-proxy-architecture-traefik) | Create Traefik config + Cloudflare DNS |
| **Add Cloudflare DNS** | [Cloudflare API](#cloudflare-api-access) | `curl -X POST cloudflare.com/...` |
| **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` |
| **Syncthing issues** | [Troubleshooting](#troubleshooting-runbooks) | Check API connections |
| **SSL cert issues** | [Traefik DNS Challenge](#ssl-certificates) | Use `cloudflare` resolver |
**Key Credentials (see sections for full details):**
- Cloudflare: `cloudflare@htsn.io` / API Key in [Cloudflare API](#cloudflare-api-access)
- SSH Password: `GrilledCh33s3#`
- Traefik: CT 202 @ 10.10.10.250
---
## Role
You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include:
- **Infrastructure Management**: Proxmox servers, VMs, containers, networking
- **File Sync**: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android)
- **Network Administration**: Router config, SSH access, Tailscale, device management
- **Power Optimization**: CPU governors, GPU power states, service tuning
- **Documentation**: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date
- **Automation**: Shell aliases, startup scripts, scheduled tasks
You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure.
### Proactive Behaviors
When the user mentions issues or asks questions, proactively:
- **"sync not working"** → Check Syncthing status on ALL devices, identify which is offline
- **"device offline"** → Ping both local and Tailscale IPs, check if service is running
- **"slow"** → Check CPU usage, running processes, Syncthing rescan activity
- **"check status"** → Run full health check across all systems
- **"something's wrong"** → Run diagnostics on likely culprits based on context
### Quick Health Checks
Run these to get a quick overview of the homelab:
```bash
# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"
# Ping critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"
# Check Windows PC Syncthing (often goes offline)
nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN"
```
### Troubleshooting Runbooks
| Symptom | Check | Fix |
|---------|-------|-----|
| Device not syncing | `curl Syncthing API → connections` | Check if device online, restart Syncthing |
| Windows PC offline | `ping 10.10.10.150` then `nc -z 22000` | SSH in, `Start-ScheduledTask -TaskName "Syncthing"` |
| Phone not syncing | Phone Syncthing app in background? | User must open app, keep screen on |
| High CPU on TrueNAS | Syncthing rescan? KSM? | Check rescan intervals, disable KSM |
| VM won't start | Storage available? RAM free? | `ssh pve 'qm start VMID'`, check logs |
| Tailscale offline | `tailscale status` | `tailscale up` or restart service |
| Tailscale no subnet access | Check subnet routers | Verify pve or ucg-fiber advertising routes |
| Sync stuck at X% | Folder errors? Conflicts? | Check `rest/folder/errors?folder=NAME` |
| Server running hot | Check KSM, check CPU processes | Disable KSM, identify runaway process |
| Storage enclosure loud | Check fan speed via SES | See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
| Drives not detected | Check SAS link, LCC status | Switch LCC, rescan SCSI hosts |
### Server Temperature Check
```bash
# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy temps**: 70-80°C under load. **Warning**: >85°C. **Throttle**: 90°C.
### Service Dependencies
```
TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks between devices
├── NFS/SMB shares for VMs
└── Media storage for Plex
PiHole (CT 200)
└── DNS for entire network - if down, name resolution fails
Traefik (CT 202)
└── Reverse proxy - if down, external access to services fails
Router (10.10.10.1)
└── Everything - gateway for all traffic
```
### API Quick Reference
| Service | Device | Endpoint | Auth |
|---------|--------|----------|------|
| Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` |
| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` (via SSH) | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
| Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` |
| Proxmox | PVE | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
| Proxmox | PVE2 | `https://10.10.10.102:8006/api2/json/` | SSH key auth |
### Common Maintenance Tasks
When user asks for maintenance or you notice issues:
1. **Check Syncthing sync status** - Any folders behind? Errors?
2. **Verify all devices connected** - Run connection check
3. **Check disk space** - `ssh pve 'df -h'`, `ssh pve2 'df -h'`
4. **Review ZFS pool health** - `ssh pve 'zpool status'`
5. **Check for stuck processes** - High CPU? Memory pressure?
6. **Verify backups** - Are critical folders syncing?
### Emergency Commands
```bash
# Restart VM on Proxmox
ssh pve 'qm stop VMID && qm start VMID'
# Check what's using CPU
ssh pve 'ps aux --sort=-%cpu | head -10'
# Check ZFS pool status (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Check EMC enclosure fans
ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"'
# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"
# Restart Syncthing on Windows (when stuck)
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
# Get all device IPs from router
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```
## Overview
Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads.
## Servers
### PVE (10.10.10.120) - Primary
- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
- **RAM**: 128 GB
- **Storage**:
- `nvme-mirror1`: 2x Sabrent Rocket Q NVMe (3.6TB usable)
- `nvme-mirror2`: 2x Kingston SFYRD 2TB (1.8TB usable)
- `rpool`: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable)
- **GPUs**:
- NVIDIA Quadro P2000 (75W TDP) - Plex transcoding
- NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1
- **Role**: Primary VM host, TrueNAS, media services
### PVE2 (10.10.10.102) - Secondary
- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
- **RAM**: 128 GB
- **Storage**:
- `nvme-mirror3`: 2x NVMe mirror
- `local-zfs2`: 2x WD Red 6TB HDD mirror
- **GPUs**:
- NVIDIA RTX A6000 (300W TDP) - passed to trading-vm
- **Role**: Trading platform, development
## SSH Access
### SSH Key Authentication (All Hosts)
SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. Use the `~/.ssh/homelab` key.
| Host Alias | IP | User | Type | Notes |
|------------|-----|------|------|-------|
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
**Usage examples:**
```bash
ssh pve 'qm list' # List VMs
ssh truenas 'zpool status vault' # Check ZFS pool
ssh saltbox 'docker ps' # List containers
ssh pihole 'pihole status' # Check Pi-hole
```
### Password Auth (Special Cases)
| Device | IP | User | Auth Method | Notes |
|--------|-----|------|-------------|-------|
| UniFi Router | 10.10.10.1 | root | expect (keyboard-interactive) | Gateway |
| Windows PC | 10.10.10.150 | claude | sshpass | PowerShell, use `;` not `&&` |
| HomeAssistant | 10.10.10.110 | - | QEMU agent only | No SSH server |
**Router access (requires expect):**
```bash
# Run command on router
expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
# Get ARP table (all device IPs)
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```
**Windows PC access:**
```bash
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
```
**HomeAssistant (no SSH, use QEMU agent):**
```bash
ssh pve 'qm guest exec 110 -- bash -c "ha core info"'
```
## VMs and Containers
### PVE (10.10.10.120)
| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-------|-----|---------|-----------------|------------|
| 100 | truenas | 8 | 32GB | NAS, storage | LSI SAS2308 HBA, Samsung NVMe | Yes |
| 101 | saltbox | 16 | 16GB | Media automation | TITAN RTX | Yes |
| 105 | fs-dev | 10 | 8GB | Development | - | Yes |
| 110 | homeassistant | 2 | 2GB | Home automation | - | No |
| 111 | lmdev1 | 8 | 32GB | AI/LLM development | TITAN RTX | Yes |
| 201 | copyparty | 2 | 2GB | File sharing | - | Yes |
| 206 | docker-host | 2 | 4GB | Docker services | - | Yes |
| 200 | pihole (CT) | - | - | DNS/Ad blocking | - | N/A |
| 202 | traefik (CT) | - | - | Reverse proxy | - | N/A |
| 205 | findshyt (CT) | - | - | Custom app | - | N/A |
### PVE2 (10.10.10.102)
| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-------|-----|---------|-----------------|------------|
| 300 | gitea-vm | 2 | 4GB | Git server | - | Yes |
| 301 | trading-vm | 16 | 32GB | AI trading platform | RTX A6000 | Yes |
### QEMU Guest Agent
VMs with QEMU agent can be managed via `qm guest exec`:
```bash
# Execute command in VM
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
```
Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead.
## Power Management
### Estimated Power Draw
- **PVE**: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs)
- **PVE2**: 450-600W (CPU + RTX A6000 + storage)
- **Combined**: ~1000-1350W under load
### Optimizations Applied
1. **KSMD Disabled** (2024-12-17 updated)
- Was consuming 44-57% CPU on PVE with negative profit
- Caused CPU temp to rise from 74°C to 83°C
- Savings: ~7-10W + significant temp reduction
- Made permanent via:
- systemd service: `/etc/systemd/system/disable-ksm.service`
- **ksmtuned masked**: `systemctl mask ksmtuned` (prevents re-enabling)
- **Note**: KSM can get re-enabled by Proxmox updates. If CPU is hot, check:
```bash
cat /sys/kernel/mm/ksm/run # Should be 0
ps aux | grep ksmd # Should show 0% CPU
# If KSM is running (run=1), disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned
```
2. **Syncthing Rescan Intervals** (2024-12-16)
- Changed aggressive 60s rescans to 3600s for large folders
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
- Savings: ~60-80W (TrueNAS VM was at constant 86% CPU)
3. **CPU Governor Optimization** (2024-12-16)
- PVE: `powersave` governor + `balance_power` EPP (amd-pstate-epp driver)
- PVE2: `schedutil` governor (acpi-cpufreq driver)
- Made permanent via systemd service: `/etc/systemd/system/cpu-powersave.service`
- Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz)
4. **GPU Power States** (2024-12-16) - Verified optimal
- RTX A6000: 11W idle (P8 state)
- TITAN RTX: 2-3W idle (P8 state)
- Quadro P2000: 25W (P0 - Plex keeps it active)
5. **ksmtuned Disabled** (2024-12-16)
- KSM tuning daemon was still running after KSMD disabled
- Stopped and disabled on both servers
- Savings: ~2-5W
6. **HDD Spindown on PVE2** (2024-12-16)
- local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7
- Set 30-minute spindown via `hdparm -S 241`
- Persistent via udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
- Savings: ~10-16W when spun down
### Potential Optimizations
- [ ] PCIe ASPM power management
- [ ] NMI watchdog disable
## Memory Configuration
- Ballooning enabled on most VMs but not actively used
- No memory overcommit (98GB allocated on 128GB physical for PVE)
- KSMD was wasting CPU with no benefit (negative general_profit)
## Network
See [NETWORK.md](NETWORK.md) for full details.
### Network Ranges
| Network | Range | Purpose |
|---------|-------|---------|
| LAN | 10.10.10.0/24 | Primary network, all external access |
| Internal | 10.10.20.0/24 | Inter-VM only (storage, NFS/iSCSI) |
### PVE Bridges (10.10.10.120)
| Bridge | NIC | Speed | Purpose | Use For |
|--------|-----|-------|---------|---------|
| vmbr0 | enp1s0 | 1 Gb | Management | General VMs/CTs |
| vmbr1 | enp35s0f0 | 10 Gb | High-speed LXC | Bandwidth-heavy containers |
| vmbr2 | enp35s0f1 | 10 Gb | High-speed VM | TrueNAS, Saltbox, storage VMs |
| vmbr3 | (none) | Virtual | Internal only | NFS/iSCSI traffic, no internet |
### Quick Reference
```bash
# Add VM to standard network (1Gb)
qm set VMID --net0 virtio,bridge=vmbr0
# Add VM to high-speed network (10Gb)
qm set VMID --net0 virtio,bridge=vmbr2
# Add secondary NIC for internal storage network
qm set VMID --net1 virtio,bridge=vmbr3
```
### MTU 9000 (Jumbo Frames)
Jumbo frames are enabled across the network for improved throughput on large transfers.
| Device | Interface | MTU | Persistent |
|--------|-----------|-----|------------|
| Mac Mini | en0 | 9000 | Yes (networksetup) |
| PVE | vmbr0, enp1s0 | 9000 | Yes (/etc/network/interfaces) |
| PVE2 | vmbr0, nic1 | 9000 | Yes (/etc/network/interfaces) |
| TrueNAS | enp6s18, enp6s19 | 9000 | Yes |
| UCG-Fiber | br0 | 9216 | Yes (default) |
**Verify MTU:**
```bash
# Mac Mini
ifconfig en0 | grep mtu
# PVE/PVE2
ssh pve 'ip link show vmbr0 | grep mtu'
ssh pve2 'ip link show vmbr0 | grep mtu'
# Test jumbo frames
ping -c 1 -D -s 8000 10.10.10.120 # 8000 + 8 byte header = 8008 bytes
```
**Important:** When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped.
### Tailscale VPN
Tailscale provides secure remote access to the homelab from anywhere.
**Subnet Routers (HA Failover)**
Two devices advertise the `10.10.10.0/24` subnet for redundancy:
| Device | Tailscale IP | Role | Notes |
|--------|--------------|------|-------|
| pve | 100.113.177.80 | Primary | Proxmox host |
| ucg-fiber | 100.94.246.32 | Failover | UniFi router (always on) |
If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec).
**Router Tailscale Setup (UCG-Fiber)**
- Installed via: `curl -fsSL https://tailscale.com/install.sh | sh`
- Config: `tailscale up --advertise-routes=10.10.10.0/24 --accept-routes`
- Survives reboots (systemd service)
- Routes must be approved in [Tailscale Admin Console](https://login.tailscale.com/admin/machines)
**Tailscale IPs Quick Reference**
| Device | Tailscale IP | Local IP |
|--------|--------------|----------|
| Mac Mini | 100.108.89.58 | 10.10.10.125 |
| PVE | 100.113.177.80 | 10.10.10.120 |
| UCG-Fiber | 100.94.246.32 | 10.10.10.1 |
| TrueNAS | 100.100.94.71 | 10.10.10.200 |
| Pi-hole | 100.112.59.128 | 10.10.10.10 |
**Check Tailscale Status**
```bash
# From Mac Mini
/Applications/Tailscale.app/Contents/MacOS/Tailscale status
# From router
expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```
## Common Commands
```bash
# Check VM status
ssh pve 'qm list'
ssh pve2 'qm list'
# Check container status
ssh pve 'pct list'
# Monitor CPU/power
ssh pve 'top -bn1 | head -20'
# Check ZFS pools
ssh pve 'zpool status'
# Check GPU (if nvidia-smi installed in VM)
ssh pve 'lspci | grep -i nvidia'
```
## Remote Claude Code Sessions (Mac Mini)
### Overview
The Mac Mini (`hutson-mac-mini.local`) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required.
### Architecture
```
Mac Mini (100.108.89.58 via Tailscale)
├── launchd (auto-starts on boot)
│ └── com.hutson.happy-daemon.plist (starts Happy daemon)
├── Happy Coder daemon (manages remote sessions)
└── Tailscale (secure remote access)
```
### How It Works
1. Happy daemon runs on Mac Mini (auto-starts on boot)
2. Open Happy Coder app on phone/tablet
3. Start a new Claude session from the app
4. Session runs in any working directory you choose
5. Session ends when you're done - no cleanup needed
### Quick Commands
```bash
# Check daemon status
happy daemon list
# Start a new session manually (from Mac Mini terminal)
cd ~/Projects/homelab && happy claude
# Check active sessions
happy daemon list
```
### Mobile Access Setup (One-time)
1. Download Happy Coder app:
- iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505
- Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy
2. On Mac Mini, run: `happy auth` and scan QR code with the app
3. Daemon auto-starts on boot via launchd
### Daemon Management
```bash
happy daemon start # Start daemon
happy daemon stop # Stop daemon
happy daemon status # Check status
happy daemon list # List active sessions
```
### Remote Access via SSH + Tailscale
From any device on Tailscale network:
```bash
# SSH to Mac Mini
ssh hutson@100.108.89.58
# Or via hostname
ssh hutson@mac-mini
# Start Claude in desired directory
cd ~/Projects/homelab && happy claude
```
### Files & Configuration
| File | Purpose |
|------|---------|
| `~/Library/LaunchAgents/com.hutson.happy-daemon.plist` | launchd auto-start Happy daemon |
| `~/.happy/` | Happy Coder config and logs |
### Troubleshooting
```bash
# Check if daemon is running
pgrep -f "happy.*daemon"
# Check launchd status
launchctl list | grep happy
# List active sessions
happy daemon list
# Restart daemon
happy daemon stop && happy daemon start
# If Tailscale is disconnected
/Applications/Tailscale.app/Contents/MacOS/Tailscale up
```
## Agent and Tool Guidelines
### Background Agents
- **Always spin up background agents when doing multiple independent tasks**
- Background agents allow parallel execution of tasks that don't depend on each other
- This improves efficiency and reduces total execution time
- Use background agents for tasks like running tests, builds, or searches simultaneously
### MCP Tools for Web Searches
#### ref.tools - Documentation Lookups
- **`mcp__Ref__ref_search_documentation`**: Search through documentation for specific topics
- **`mcp__Ref__ref_read_url`**: Read and parse content from documentation URLs
#### Exa MCP - General Web and Code Searches
- **`mcp__exa__web_search_exa`**: General web searches for current information
- **`mcp__exa__get_code_context_exa`**: Code-related searches and repository lookups
### MCP Tools Reference Table
| Tool Name | Provider | Purpose | Use Case |
|-----------|----------|---------|----------|
| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | Finding specific topics in official docs |
| `mcp__Ref__ref_read_url` | ref.tools | Read documentation URLs | Parsing and extracting content from doc pages |
| `mcp__exa__web_search_exa` | Exa MCP | General web search | Current events, general information lookup |
| `mcp__exa__get_code_context_exa` | Exa MCP | Code-specific search | Finding code examples, repository searches |
## Reverse Proxy Architecture (Traefik)
### Overview
There are **TWO separate Traefik instances** handling different services:
| Instance | Location | IP | Purpose | Manages |
|----------|----------|-----|---------|---------|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | All non-Saltbox services |
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services | Plex, *arr apps, media stack |
### ⚠️ CRITICAL RULE: Which Traefik to Use
**When adding ANY new service:**
- ✅ **Use Traefik-Primary (10.10.10.250)** - Unless service lives inside Saltbox VM
- ❌ **DO NOT touch Traefik-Saltbox** - It manages Saltbox services with their own certificates
**Why this matters:**
- Traefik-Saltbox has complex Saltbox-managed configs
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
- Each Traefik has its own Let's Encrypt certificates
- Mixing them causes certificate conflicts
### Traefik-Primary (CT 202) - For New Services
**Location**: `/etc/traefik/` on Container 202
**Config**: `/etc/traefik/traefik.yaml`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
**Services using Traefik-Primary (10.10.10.250):**
- excalidraw.htsn.io → 10.10.10.206:8080 (docker-host)
- findshyt.htsn.io → 10.10.10.205 (CT 205)
- gitea (git.htsn.io) → 10.10.10.220:3000
- homeassistant → 10.10.10.110
- lmdev → 10.10.10.111
- pihole → 10.10.10.200
- truenas → 10.10.10.200
- proxmox → 10.10.10.120
- copyparty → 10.10.10.201
- aitrade → trading server
- pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring)
**Access Traefik config:**
```bash
# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
```
### Traefik-Saltbox (VM 101) - DO NOT MODIFY
**Location**: `/opt/traefik/` inside Saltbox VM
**Managed by**: Saltbox Ansible playbooks
**Mounts**: Docker bind mount from `/opt/traefik` → `/etc/traefik` in container
**Services using Traefik-Saltbox (10.10.10.100):**
- Plex (plex.htsn.io)
- Sonarr, Radarr, Lidarr
- SABnzbd, NZBGet, qBittorrent
- Overseerr, Tautulli, Organizr
- Jackett, NZBHydra2
- Authelia (SSO)
- All other Saltbox-managed containers
**View Saltbox Traefik (read-only):**
```bash
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
```
### Adding a New Public Service - Complete Workflow
Follow these steps to deploy a new service and make it publicly accessible at `servicename.htsn.io`.
#### Step 0. Deploy Your Service
First, deploy your service on the appropriate host:
**Option A: Docker on docker-host (10.10.10.206)**
```bash
ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
myservice:
image: myimage:latest
ports:
- "8080:80"
restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d
```
**Option B: New LXC Container on PVE**
```bash
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname myservice --memory 2048 --cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 --unprivileged 1 --start 1'
```
**Option C: New VM on PVE**
```bash
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
```
#### Step 1. Create Traefik Config File
Use this template for new services on **Traefik-Primary (CT 202)**:
```yaml
# /etc/traefik/conf.d/myservice.yaml
http:
routers:
# HTTPS router
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
priority: 50
# HTTP → HTTPS redirect
myservice-redirect:
entryPoints:
- web
rule: "Host(`myservice.htsn.io`)"
middlewares:
- myservice-https-redirect
service: myservice
priority: 50
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
middlewares:
myservice-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
### SSL Certificates
Traefik has **two certificate resolvers** configured:
| Resolver | Use When | Challenge Type | Notes |
|----------|----------|----------------|-------|
| `letsencrypt` | Cloudflare DNS-only (gray cloud) | HTTP-01 | Requires port 80 reachable |
| `cloudflare` | Cloudflare Proxied (orange cloud) | DNS-01 | Works with Cloudflare proxy |
**⚠️ Important:** If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use `cloudflare` resolver instead.
**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
```bash
Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
```
**Certificate storage:**
- HTTP challenge certs: `/etc/traefik/acme.json`
- DNS challenge certs: `/etc/traefik/acme-cf.json`
**Deploy the config:**
```bash
# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'
# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
#### 2. Add Cloudflare DNS Entry
**Cloudflare Credentials:**
- Email: `cloudflare@htsn.io`
- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
**Manual method (via Cloudflare Dashboard):**
1. Go to https://dash.cloudflare.com/
2. Select `htsn.io` domain
3. DNS → Add Record
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
**Automated method (CLI script):**
Save this as `~/bin/add-cloudflare-dns.sh`:
```bash
#!/bin/bash
# Add DNS record to Cloudflare for htsn.io
SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" # htsn.io zone
PUBLIC_IP="70.237.94.174" # Update if IP changes: curl -s ifconfig.me
if [ -z "$SUBDOMAIN" ]; then
echo "Usage: $0 <subdomain>"
echo "Example: $0 myservice # Creates myservice.htsn.io"
exit 1
fi
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data "{
\"type\":\"A\",
\"name\":\"$SUBDOMAIN\",
\"content\":\"$PUBLIC_IP\",
\"ttl\":1,
\"proxied\":true
}" | jq .
```
**Usage:**
```bash
chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh excalidraw # Creates excalidraw.htsn.io
```
#### 3. Testing
```bash
# Check if DNS resolves
dig myservice.htsn.io
# Test HTTP redirect
curl -I http://myservice.htsn.io
# Test HTTPS
curl -I https://myservice.htsn.io
# Check Traefik dashboard (if enabled)
# Access: http://10.10.10.250:8080/dashboard/
```
#### Step 4. Update Documentation
After deploying, update these files:
1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
2. **CLAUDE.md** - Add to "Services using Traefik-Primary" list (line ~495)
### Quick Reference - One-Liner Commands
```bash
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
routers:
myservice-secure:
entryPoints: [websecure]
rule: Host(\\\`myservice.htsn.io\\\`)
service: myservice
tls: {certResolver: letsencrypt}
services:
myservice:
loadBalancer:
servers:
- url: http://10.10.10.206:8080
EOF"'
# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io
```
### Traefik Troubleshooting
```bash
# View Traefik logs (CT 202)
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
# Check if config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
# List all dynamic configs
ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/'
# Check certificate
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq'
# Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
### Certificate Management
**Let's Encrypt certificates** are automatically managed by Traefik.
**Certificate storage:**
- Traefik-Primary: `/etc/traefik/acme.json` on CT 202
- Traefik-Saltbox: `/opt/traefik/acme.json` on VM 101
**Certificate renewal:**
- Automatic via HTTP-01 challenge
- Traefik checks every 24h
- Renews 30 days before expiry
**If certificates fail:**
```bash
# Check acme.json permissions (must be 600)
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json'
# Check Traefik can reach Let's Encrypt
ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory'
# Delete bad certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
### Docker Service with Traefik Labels (Alternative)
If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files:
```yaml
# docker-compose.yml
services:
myservice:
image: myimage:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
- "traefik.http.routers.myservice.entrypoints=websecure"
- "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
- "traefik.http.services.myservice.loadbalancer.server.port=8080"
networks:
- traefik
networks:
traefik:
external: true
```
**Note**: This requires Traefik to have access to Docker socket and be on same network.
## Cloudflare API Access
**Credentials** (stored in Saltbox config):
- Email: `cloudflare@htsn.io`
- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
- Domain: `htsn.io`
**Retrieve from Saltbox:**
```bash
ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"'
```
**Cloudflare API Documentation:**
- API Docs: https://developers.cloudflare.com/api/
- DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record
**Common API operations:**
```bash
# Set credentials
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
# List all DNS records
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" | jq
# Add A record
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"subdomain","content":"IP","proxied":true}'
# Delete record
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY"
```
## Git Repository
This documentation is stored at:
- **Gitea**: https://git.htsn.io/hutson/homelab-docs
- **Local**: `~/Projects/homelab`
- **Notes**: `~/Notes/05_Homelab` (symlink)
```bash
# Clone
git clone git@git.htsn.io:hutson/homelab-docs.git
# Push changes
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push
```
## Related Documentation
| File | Description |
|------|-------------|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | EMC storage enclosure (SES commands, LCC troubleshooting, maintenance) |
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API access, automations, integrations |
| [NETWORK.md](NETWORK.md) | Network bridges, VLANs, which bridge to use for new VMs |
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices and services |
| [SYNCTHING.md](SYNCTHING.md) | Syncthing setup, API access, device list, troubleshooting |
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.) |
| [configs/](configs/) | Symlinks to shared shell configs |
---
## Backlog
Future improvements and maintenance tasks:
| Priority | Task | Notes |
|----------|------|-------|
| Medium | **Re-IP all devices** | Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254 |
| Low | Install SSH on HomeAssistant | Currently only accessible via QEMU agent |
| Low | Set up SSH key for router | Currently requires expect/password |
---
## Changelog
### 2024-12-20
**Git Repository Setup**
- Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs)
- Set up SSH key authentication for git@git.htsn.io
- Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab
- Added Gitea API token for future automation
**SSH Key Deployment - All Systems**
- Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key)
- Updated `~/.ssh/config` with complete host aliases
- Fixed permissions: FindShyt LXC `.ssh` ownership, enabled PermitRootLogin on LXCs
- Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt
**Documentation Updates**
- Rewrote SSH Access section with complete host table
- Added Password Auth section for router/Windows/HomeAssistant
- Added Backlog section with re-IP task
- Added Git Repository section with clone/push instructions
### 2024-12-19
**EMC Storage Enclosure - LCC B Failure**
- Diagnosed loud fan issue (speed code 5 → 4160 RPM)
- Root cause: Faulty LCC B controller causing false readings
- Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM)
- Replacement ordered: EMC 303-108-000E ($14.95 eBay)
- Created [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) with full documentation
**SSH Key Consolidation**
- Renamed `~/.ssh/ai_trading_ed25519``~/.ssh/homelab`
- Updated `~/.ssh/config` on MacBook with all homelab hosts
- SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm
- No more sshpass needed for PVE servers
**QEMU Guest Agent Deployment**
- Installed on: docker-host (206), fs-dev (105), copyparty (201)
- All PVE VMs now have agent except homeassistant (110)
- Can now use `qm guest exec` for remote commands
**VM Configuration Updates**
- docker-host: Fixed SSH key in cloud-init
- fs-dev: Fixed `.ssh` directory ownership (1000 → 1001)
- copyparty: Changed from DHCP to static IP (10.10.10.201)
**Documentation Updates**
- Updated CLAUDE.md SSH section (removed sshpass examples)
- Added QEMU Agent column to VM tables
- Added storage enclosure troubleshooting to runbooks