Files

Hutson 23e9df68c9 Update Happy Coder docs with complete setup flow and troubleshooting

- Expand Mobile Access Setup with full authentication steps
  (HAPPY_SERVER_URL, happy auth login, happy connect claude, local claude login)
- Fix launchd path: ~/Library/LaunchAgents/ not /Library/LaunchDaemons/
- Add Common Issues troubleshooting table with fixes for:
  - Invalid API key (Claude not logged in locally)
  - Failed to start daemon (stale lock files)
  - Sessions not showing (missing HAPPY_SERVER_URL)
  - Slow responses (Cloudflare proxy enabled)
- Update DNS note: Cloudflare proxy disabled for WebSocket performance
- Add .zshrc to Files & Configuration table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-21 13:28:30 -05:00

40 KiB

Raw Blame History

Homelab Infrastructure

Quick Reference - Common Tasks

Task	Section	Quick Command
Add new public service	Reverse Proxy	Create Traefik config + Cloudflare DNS
Add Cloudflare DNS	Cloudflare API	`curl -X POST cloudflare.com/...`
Check server temps	Temperature Check	`ssh pve 'grep Tctl ...'`
Syncthing issues	Troubleshooting	Check API connections
SSL cert issues	Traefik DNS Challenge	Use `cloudflare` resolver

Key Credentials (see sections for full details):

Cloudflare: cloudflare@htsn.io / API Key in Cloudflare API
SSH Password: GrilledCh33s3#
Traefik: CT 202 @ 10.10.10.250

Role

You are the Homelab Assistant - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include:

Infrastructure Management: Proxmox servers, VMs, containers, networking
File Sync: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android)
Network Administration: Router config, SSH access, Tailscale, device management
Power Optimization: CPU governors, GPU power states, service tuning
Documentation: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date
Automation: Shell aliases, startup scripts, scheduled tasks

You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure.

Proactive Behaviors

When the user mentions issues or asks questions, proactively:

"sync not working" → Check Syncthing status on ALL devices, identify which is offline
"device offline" → Ping both local and Tailscale IPs, check if service is running
"slow" → Check CPU usage, running processes, Syncthing rescan activity
"check status" → Run full health check across all systems
"something's wrong" → Run diagnostics on likely culprits based on context

Quick Health Checks

Run these to get a quick overview of the homelab:

# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"

# Ping critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"

# Check Windows PC Syncthing (often goes offline)
nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN"

Troubleshooting Runbooks

Symptom	Check	Fix
Device not syncing	`curl Syncthing API → connections`	Check if device online, restart Syncthing
Windows PC offline	`ping 10.10.10.150` then `nc -z 22000`	SSH in, `Start-ScheduledTask -TaskName "Syncthing"`
Phone not syncing	Phone Syncthing app in background?	User must open app, keep screen on
High CPU on TrueNAS	Syncthing rescan? KSM?	Check rescan intervals, disable KSM
VM won't start	Storage available? RAM free?	`ssh pve 'qm start VMID'`, check logs
Tailscale offline	`tailscale status`	`tailscale up` or restart service
Tailscale no subnet access	Check subnet routers	Verify pve or ucg-fiber advertising routes
Sync stuck at X%	Folder errors? Conflicts?	Check `rest/folder/errors?folder=NAME`
Server running hot	Check KSM, check CPU processes	Disable KSM, identify runaway process
Storage enclosure loud	Check fan speed via SES	See EMC-ENCLOSURE.md
Drives not detected	Check SAS link, LCC status	Switch LCC, rescan SCSI hosts

Server Temperature Check

# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Healthy temps: 70-80°C under load. Warning: >85°C. Throttle: 90°C.

Service Dependencies

TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks between devices
├── NFS/SMB shares for VMs
└── Media storage for Plex

PiHole (CT 200)
└── DNS for entire network - if down, name resolution fails

Traefik (CT 202)
└── Reverse proxy - if down, external access to services fails

Router (10.10.10.1)
└── Everything - gateway for all traffic

API Quick Reference

Service	Device	Endpoint	Auth
Syncthing	Mac Mini	`http://127.0.0.1:8384/rest/`	`X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
Syncthing	MacBook	`http://127.0.0.1:8384/rest/` (via SSH)	`X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
Syncthing	Phone	`https://10.10.10.54:8384/rest/`	`X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
Proxmox	PVE	`https://10.10.10.120:8006/api2/json/`	SSH key auth
Proxmox	PVE2	`https://10.10.10.102:8006/api2/json/`	SSH key auth

Common Maintenance Tasks

When user asks for maintenance or you notice issues:

Check Syncthing sync status - Any folders behind? Errors?
Verify all devices connected - Run connection check
Check disk space - ssh pve 'df -h', ssh pve2 'df -h'
Review ZFS pool health - ssh pve 'zpool status'
Check for stuck processes - High CPU? Memory pressure?
Verify backups - Are critical folders syncing?

Emergency Commands

# Restart VM on Proxmox
ssh pve 'qm stop VMID && qm start VMID'

# Check what's using CPU
ssh pve 'ps aux --sort=-%cpu | head -10'

# Check ZFS pool status (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Check EMC enclosure fans
ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"'

# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"

# Restart Syncthing on Windows (when stuck)
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'

# Get all device IPs from router
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Overview

Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads.

Servers

PVE (10.10.10.120) - Primary

CPU: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
RAM: 128 GB
Storage:
- nvme-mirror1: 2x Sabrent Rocket Q NVMe (3.6TB usable)
- nvme-mirror2: 2x Kingston SFYRD 2TB (1.8TB usable)
- rpool: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable)
GPUs:
- NVIDIA Quadro P2000 (75W TDP) - Plex transcoding
- NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1
Role: Primary VM host, TrueNAS, media services

PVE2 (10.10.10.102) - Secondary

CPU: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
RAM: 128 GB
Storage:
- nvme-mirror3: 2x NVMe mirror
- local-zfs2: 2x WD Red 6TB HDD mirror
GPUs:
- NVIDIA RTX A6000 (300W TDP) - passed to trading-vm
Role: Trading platform, development

SSH Access

SSH Key Authentication (All Hosts)

SSH keys are configured in ~/.ssh/config on both Mac Mini and MacBook. Use the ~/.ssh/homelab key.

Host Alias	IP	User	Type	Notes
`pve`	10.10.10.120	root	Proxmox	Primary server
`pve2`	10.10.10.102	root	Proxmox	Secondary server
`truenas`	10.10.10.200	root	VM	NAS/storage
`saltbox`	10.10.10.100	hutson	VM	Media automation
`lmdev1`	10.10.10.111	hutson	VM	AI/LLM development
`docker-host`	10.10.10.206	hutson	VM	Docker services
`fs-dev`	10.10.10.5	hutson	VM	Development
`copyparty`	10.10.10.201	hutson	VM	File sharing
`gitea-vm`	10.10.10.220	hutson	VM	Git server
`trading-vm`	10.10.10.221	hutson	VM	AI trading platform
`pihole`	10.10.10.10	root	LXC	DNS/Ad blocking
`traefik`	10.10.10.250	root	LXC	Reverse proxy
`findshyt`	10.10.10.8	root	LXC	Custom app

Usage examples:

ssh pve 'qm list'                    # List VMs
ssh truenas 'zpool status vault'     # Check ZFS pool
ssh saltbox 'docker ps'              # List containers
ssh pihole 'pihole status'           # Check Pi-hole

Password Auth (Special Cases)

Device	IP	User	Auth Method	Notes
UniFi Router	10.10.10.1	root	expect (keyboard-interactive)	Gateway
Windows PC	10.10.10.150	claude	sshpass	PowerShell, use `;` not `&&`
HomeAssistant	10.10.10.110	-	QEMU agent only	No SSH server

Router access (requires expect):

# Run command on router
expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

# Get ARP table (all device IPs)
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Windows PC access:

sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'

HomeAssistant (no SSH, use QEMU agent):

ssh pve 'qm guest exec 110 -- bash -c "ha core info"'

VMs and Containers

PVE (10.10.10.120)

VMID	Name	vCPUs	RAM	Purpose	GPU/Passthrough	QEMU Agent
100	truenas	8	32GB	NAS, storage	LSI SAS2308 HBA, Samsung NVMe	Yes
101	saltbox	16	16GB	Media automation	TITAN RTX	Yes
105	fs-dev	10	8GB	Development	-	Yes
110	homeassistant	2	2GB	Home automation	-	No
111	lmdev1	8	32GB	AI/LLM development	TITAN RTX	Yes
201	copyparty	2	2GB	File sharing	-	Yes
206	docker-host	2	4GB	Docker services	-	Yes
200	pihole (CT)	-	-	DNS/Ad blocking	-	N/A
202	traefik (CT)	-	-	Reverse proxy	-	N/A
205	findshyt (CT)	-	-	Custom app	-	N/A

PVE2 (10.10.10.102)

VMID	Name	vCPUs	RAM	Purpose	GPU/Passthrough	QEMU Agent
300	gitea-vm	2	4GB	Git server	-	Yes
301	trading-vm	16	32GB	AI trading platform	RTX A6000	Yes

QEMU Guest Agent

VMs with QEMU agent can be managed via qm guest exec:

# Execute command in VM
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'

Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead.

Power Management

Estimated Power Draw

PVE: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs)
PVE2: 450-600W (CPU + RTX A6000 + storage)
Combined: ~1000-1350W under load

Optimizations Applied

KSMD Disabled (2024-12-17 updated)
- Was consuming 44-57% CPU on PVE with negative profit
- Caused CPU temp to rise from 74°C to 83°C
- Savings: ~7-10W + significant temp reduction
- Made permanent via:
  - systemd service: /etc/systemd/system/disable-ksm.service
  - ksmtuned masked: systemctl mask ksmtuned (prevents re-enabling)
- Note: KSM can get re-enabled by Proxmox updates. If CPU is hot, check:
```
cat /sys/kernel/mm/ksm/run  # Should be 0
ps aux | grep ksmd          # Should show 0% CPU
# If KSM is running (run=1), disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned
```
Syncthing Rescan Intervals (2024-12-16)
- Changed aggressive 60s rescans to 3600s for large folders
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
- Savings: ~60-80W (TrueNAS VM was at constant 86% CPU)
CPU Governor Optimization (2024-12-16)
- PVE: powersave governor + balance_power EPP (amd-pstate-epp driver)
- PVE2: schedutil governor (acpi-cpufreq driver)
- Made permanent via systemd service: /etc/systemd/system/cpu-powersave.service
- Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz)
GPU Power States (2024-12-16) - Verified optimal
- RTX A6000: 11W idle (P8 state)
- TITAN RTX: 2-3W idle (P8 state)
- Quadro P2000: 25W (P0 - Plex keeps it active)
ksmtuned Disabled (2024-12-16)
- KSM tuning daemon was still running after KSMD disabled
- Stopped and disabled on both servers
- Savings: ~2-5W
HDD Spindown on PVE2 (2024-12-16)
- local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7
- Set 30-minute spindown via hdparm -S 241
- Persistent via udev rule: /etc/udev/rules.d/69-hdd-spindown.rules
- Savings: ~10-16W when spun down

Potential Optimizations

PCIe ASPM power management
NMI watchdog disable

Memory Configuration

Ballooning enabled on most VMs but not actively used
No memory overcommit (98GB allocated on 128GB physical for PVE)
KSMD was wasting CPU with no benefit (negative general_profit)

Network

See NETWORK.md for full details.

Network Ranges

Network	Range	Purpose
LAN	10.10.10.0/24	Primary network, all external access
Internal	10.10.20.0/24	Inter-VM only (storage, NFS/iSCSI)

PVE Bridges (10.10.10.120)

Bridge	NIC	Speed	Purpose	Use For
vmbr0	enp1s0	1 Gb	Management	General VMs/CTs
vmbr1	enp35s0f0	10 Gb	High-speed LXC	Bandwidth-heavy containers
vmbr2	enp35s0f1	10 Gb	High-speed VM	TrueNAS, Saltbox, storage VMs
vmbr3	(none)	Virtual	Internal only	NFS/iSCSI traffic, no internet

Quick Reference

# Add VM to standard network (1Gb)
qm set VMID --net0 virtio,bridge=vmbr0

# Add VM to high-speed network (10Gb)
qm set VMID --net0 virtio,bridge=vmbr2

# Add secondary NIC for internal storage network
qm set VMID --net1 virtio,bridge=vmbr3

MTU 9000 (Jumbo Frames)

Jumbo frames are enabled across the network for improved throughput on large transfers.

Device	Interface	MTU	Persistent
Mac Mini	en0	9000	Yes (networksetup)
PVE	vmbr0, enp1s0	9000	Yes (/etc/network/interfaces)
PVE2	vmbr0, nic1	9000	Yes (/etc/network/interfaces)
TrueNAS	enp6s18, enp6s19	9000	Yes
UCG-Fiber	br0	9216	Yes (default)

Verify MTU:

# Mac Mini
ifconfig en0 | grep mtu

# PVE/PVE2
ssh pve 'ip link show vmbr0 | grep mtu'
ssh pve2 'ip link show vmbr0 | grep mtu'

# Test jumbo frames
ping -c 1 -D -s 8000 10.10.10.120  # 8000 + 8 byte header = 8008 bytes

Important: When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped.

Tailscale VPN

Tailscale provides secure remote access to the homelab from anywhere.

Subnet Routers (HA Failover)

Two devices advertise the 10.10.10.0/24 subnet for redundancy:

Device	Tailscale IP	Role	Notes
pve	100.113.177.80	Primary	Proxmox host
ucg-fiber	100.94.246.32	Failover	UniFi router (always on)

If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec).

Router Tailscale Setup (UCG-Fiber)

Installed via: curl -fsSL https://tailscale.com/install.sh | sh
Config: tailscale up --advertise-routes=10.10.10.0/24 --accept-routes
Survives reboots (systemd service)
Routes must be approved in Tailscale Admin Console

Tailscale IPs Quick Reference

Device	Tailscale IP	Local IP
Mac Mini	100.108.89.58	10.10.10.125
PVE	100.113.177.80	10.10.10.120
UCG-Fiber	100.94.246.32	10.10.10.1
TrueNAS	100.100.94.71	10.10.10.200
Pi-hole	100.112.59.128	10.10.10.10

Check Tailscale Status

# From Mac Mini
/Applications/Tailscale.app/Contents/MacOS/Tailscale status

# From router
expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'

Common Commands

# Check VM status
ssh pve 'qm list'
ssh pve2 'qm list'

# Check container status
ssh pve 'pct list'

# Monitor CPU/power
ssh pve 'top -bn1 | head -20'

# Check ZFS pools
ssh pve 'zpool status'

# Check GPU (if nvidia-smi installed in VM)
ssh pve 'lspci | grep -i nvidia'

Remote Claude Code Sessions (Mac Mini)

Overview

The Mac Mini (hutson-mac-mini.local) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required.

Architecture

Mac Mini (100.108.89.58 via Tailscale)
├── launchd (auto-starts on boot)
│   └── com.hutson.happy-daemon.plist (starts Happy daemon)
├── Happy Coder daemon (manages remote sessions)
└── Tailscale (secure remote access)

How It Works

Happy daemon runs on Mac Mini (auto-starts on boot)
Open Happy Coder app on phone/tablet
Start a new Claude session from the app
Session runs in any working directory you choose
Session ends when you're done - no cleanup needed

Quick Commands

# Check daemon status
happy daemon list

# Start a new session manually (from Mac Mini terminal)
cd ~/Projects/homelab && happy claude

# Check active sessions
happy daemon list

Mobile Access Setup (One-time)

Download Happy Coder app:
- iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505
- Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy

On Mac Mini, ensure self-hosted server is configured:

echo 'export HAPPY_SERVER_URL="https://happy.htsn.io"' >> ~/.zshrc
source ~/.zshrc

Authenticate with the Happy server:

happy auth login --force  # Opens browser, scan QR with app

Connect Claude API access:

happy connect claude      # Links your Anthropic API credentials

Ensure Claude is logged in locally (critical for spawned sessions):

claude                    # Start Claude Code
/login                    # Authenticate if prompted

Daemon auto-starts on login via launchd

Daemon Management

happy daemon start    # Start daemon
happy daemon stop     # Stop daemon
happy daemon status   # Check status
happy daemon list     # List active sessions

Remote Access via SSH + Tailscale

From any device on Tailscale network:

# SSH to Mac Mini
ssh hutson@100.108.89.58

# Or via hostname
ssh hutson@mac-mini

# Start Claude in desired directory
cd ~/Projects/homelab && happy claude

Files & Configuration

File	Purpose
`~/Library/LaunchAgents/com.hutson.happy-daemon.plist`	User LaunchAgent (starts at login)
`~/.happy/`	Happy Coder config, state, and logs
`~/.zshrc`	Contains `HAPPY_SERVER_URL` export

Server: https://happy.htsn.io (self-hosted Happy server on docker-host)

Troubleshooting

# Check if daemon is running
pgrep -f "happy.*daemon"

# Check launchd status
launchctl list | grep happy

# List active sessions
happy daemon list

# Restart daemon
happy daemon stop && happy daemon start

# If Tailscale is disconnected
/Applications/Tailscale.app/Contents/MacOS/Tailscale up

Common Issues:

Issue	Cause	Fix
"Invalid API key" in spawned session	Claude not logged in locally	Run `claude` then `/login` on Mac Mini
"Failed to start daemon"	Stale lock file	`rm -f ~/.happy/daemon.state.json.lock ~/.happy/daemon.state.json`
Sessions not showing on phone	HAPPY_SERVER_URL not set	Add to `~/.zshrc`: `export HAPPY_SERVER_URL="https://happy.htsn.io"`
Slow responses	Cloudflare proxy enabled	Disable proxy for happy.htsn.io subdomain

Happy Server (Self-Hosted Relay)

Self-hosted Happy Coder relay server for lower latency and no external dependencies.

Architecture

Phone App → https://happy.htsn.io → Traefik → docker-host:3002 → Happy Server
                                                    ↓
                                    PostgreSQL + Redis + MinIO (local)

Service Details

Component	Location	Port	Notes
Happy Server	docker-host (10.10.10.206)	3002	Main relay service
PostgreSQL	docker-host	5432 (internal)	User/session data
Redis	docker-host	6379 (internal)	Real-time events
MinIO	docker-host	9000 (internal)	File/image storage
Traefik	CT 202	443	SSL termination

Configuration

Docker Compose: /opt/happy-server/docker-compose.yml Traefik Config: /etc/traefik/conf.d/happy.yaml (on CT 202) DNS: happy.htsn.io → 70.237.94.174 (Cloudflare DNS-only, NOT proxied for WebSocket performance)

Credentials:

Master Secret: 3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac
MinIO: happyadmin / happyadmin123
PostgreSQL: happy / happypass

Quick Commands

# Check status
ssh docker-host 'docker ps --filter "name=happy"'

# View logs
ssh docker-host 'docker logs -f happy-server'

# Restart stack
ssh docker-host 'cd /opt/happy-server && sudo docker-compose restart'

# Health check
curl https://happy.htsn.io/health

# Run migrations (if needed)
ssh docker-host 'docker exec happy-server npx prisma migrate deploy'

Connecting Devices

Phone (Happy App):

Settings → Relay Server URL
Enter: https://happy.htsn.io
Save and reconnect

CLI (Mac/Linux):

export HAPPY_SERVER_URL="https://happy.htsn.io"
happy auth  # Re-authenticate with new server

Maintenance

Backup data:

ssh docker-host 'docker exec happy-postgres pg_dump -U happy happy > /tmp/happy-backup.sql'

Update Happy Server:

ssh docker-host 'cd /opt/happy-server && git pull && sudo docker-compose build && sudo docker-compose up -d'

Agent and Tool Guidelines

Background Agents

Always spin up background agents when doing multiple independent tasks
Background agents allow parallel execution of tasks that don't depend on each other
This improves efficiency and reduces total execution time
Use background agents for tasks like running tests, builds, or searches simultaneously

MCP Tools for Web Searches

ref.tools - Documentation Lookups

mcp__Ref__ref_search_documentation: Search through documentation for specific topics
mcp__Ref__ref_read_url: Read and parse content from documentation URLs

Exa MCP - General Web and Code Searches

mcp__exa__web_search_exa: General web searches for current information
mcp__exa__get_code_context_exa: Code-related searches and repository lookups

MCP Tools Reference Table

Tool Name	Provider	Purpose	Use Case
`mcp__Ref__ref_search_documentation`	ref.tools	Search documentation	Finding specific topics in official docs
`mcp__Ref__ref_read_url`	ref.tools	Read documentation URLs	Parsing and extracting content from doc pages
`mcp__exa__web_search_exa`	Exa MCP	General web search	Current events, general information lookup
`mcp__exa__get_code_context_exa`	Exa MCP	Code-specific search	Finding code examples, repository searches

Reverse Proxy Architecture (Traefik)

Overview

There are TWO separate Traefik instances handling different services:

Instance	Location	IP	Purpose	Manages
Traefik-Primary	CT 202	10.10.10.250	General services	All non-Saltbox services
Traefik-Saltbox	VM 101 (Docker)	10.10.10.100	Saltbox services	Plex, *arr apps, media stack

⚠️ CRITICAL RULE: Which Traefik to Use

When adding ANY new service:

✅ Use Traefik-Primary (10.10.10.250) - Unless service lives inside Saltbox VM
❌ DO NOT touch Traefik-Saltbox - It manages Saltbox services with their own certificates

Why this matters:

Traefik-Saltbox has complex Saltbox-managed configs
Messing with it breaks Plex, Sonarr, Radarr, and all media services
Each Traefik has its own Let's Encrypt certificates
Mixing them causes certificate conflicts

Traefik-Primary (CT 202) - For New Services

Location: /etc/traefik/ on Container 202 Config: /etc/traefik/traefik.yaml Dynamic Configs: /etc/traefik/conf.d/*.yaml

Services using Traefik-Primary (10.10.10.250):

excalidraw.htsn.io → 10.10.10.206:8080 (docker-host)
findshyt.htsn.io → 10.10.10.205 (CT 205)
gitea (git.htsn.io) → 10.10.10.220:3000
homeassistant → 10.10.10.110
lmdev → 10.10.10.111
pihole → 10.10.10.200
truenas → 10.10.10.200
proxmox → 10.10.10.120
copyparty → 10.10.10.201
aitrade → trading server
pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring)
happy.htsn.io → 10.10.10.206:3002 (Happy Coder relay server)

Access Traefik config:

# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'

# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'

Traefik-Saltbox (VM 101) - DO NOT MODIFY

Location: /opt/traefik/ inside Saltbox VM Managed by: Saltbox Ansible playbooks Mounts: Docker bind mount from /opt/traefik → /etc/traefik in container

Services using Traefik-Saltbox (10.10.10.100):

Plex (plex.htsn.io)
Sonarr, Radarr, Lidarr
SABnzbd, NZBGet, qBittorrent
Overseerr, Tautulli, Organizr
Jackett, NZBHydra2
Authelia (SSO)
All other Saltbox-managed containers

View Saltbox Traefik (read-only):

ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'

Adding a New Public Service - Complete Workflow

Follow these steps to deploy a new service and make it publicly accessible at servicename.htsn.io.

Step 0. Deploy Your Service

First, deploy your service on the appropriate host:

Option A: Docker on docker-host (10.10.10.206)

ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
  myservice:
    image: myimage:latest
    ports:
      - "8080:80"
    restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d

Option B: New LXC Container on PVE

ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname myservice --memory 2048 --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
  --rootfs local-zfs:8 --unprivileged 1 --start 1'

Option C: New VM on PVE

ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'

Step 1. Create Traefik Config File

Use this template for new services on Traefik-Primary (CT 202):

# /etc/traefik/conf.d/myservice.yaml
http:
  routers:
    # HTTPS router
    myservice-secure:
      entryPoints:
        - websecure
      rule: "Host(`myservice.htsn.io`)"
      service: myservice
      tls:
        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
      priority: 50

    # HTTP → HTTPS redirect
    myservice-redirect:
      entryPoints:
        - web
      rule: "Host(`myservice.htsn.io`)"
      middlewares:
        - myservice-https-redirect
      service: myservice
      priority: 50

  services:
    myservice:
      loadBalancer:
        servers:
          - url: "http://10.10.10.XXX:PORT"

  middlewares:
    myservice-https-redirect:
      redirectScheme:
        scheme: https
        permanent: true

SSL Certificates

Traefik has two certificate resolvers configured:

Resolver	Use When	Challenge Type	Notes
`letsencrypt`	Cloudflare DNS-only (gray cloud)	HTTP-01	Requires port 80 reachable
`cloudflare`	Cloudflare Proxied (orange cloud)	DNS-01	Works with Cloudflare proxy

⚠️ Important: If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use cloudflare resolver instead.

Cloudflare API credentials are configured in /etc/systemd/system/traefik.service:

Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"

Certificate storage:

HTTP challenge certs: /etc/traefik/acme.json
DNS challenge certs: /etc/traefik/acme-cf.json

Deploy the config:

# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'

# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'

2. Add Cloudflare DNS Entry

Cloudflare Credentials:

Email: cloudflare@htsn.io
API Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc

Manual method (via Cloudflare Dashboard):

Go to https://dash.cloudflare.com/
Select htsn.io domain
DNS → Add Record
Type: A, Name: myservice, IPv4: 70.237.94.174, Proxied: ☑️

Automated method (CLI script):

Save this as ~/bin/add-cloudflare-dns.sh:

#!/bin/bash
# Add DNS record to Cloudflare for htsn.io

SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"  # htsn.io zone
PUBLIC_IP="70.237.94.174"  # Update if IP changes: curl -s ifconfig.me

if [ -z "$SUBDOMAIN" ]; then
  echo "Usage: $0 <subdomain>"
  echo "Example: $0 myservice  # Creates myservice.htsn.io"
  exit 1
fi

curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data "{
    \"type\":\"A\",
    \"name\":\"$SUBDOMAIN\",
    \"content\":\"$PUBLIC_IP\",
    \"ttl\":1,
    \"proxied\":true
  }" | jq .

Usage:

chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh excalidraw  # Creates excalidraw.htsn.io

3. Testing

# Check if DNS resolves
dig myservice.htsn.io

# Test HTTP redirect
curl -I http://myservice.htsn.io

# Test HTTPS
curl -I https://myservice.htsn.io

# Check Traefik dashboard (if enabled)
# Access: http://10.10.10.250:8080/dashboard/

Step 4. Update Documentation

After deploying, update these files:

IP-ASSIGNMENTS.md - Add to Services & Reverse Proxy Mapping table
CLAUDE.md - Add to "Services using Traefik-Primary" list (line ~495)

Quick Reference - One-Liner Commands

# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===

# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
  routers:
    myservice-secure:
      entryPoints: [websecure]
      rule: Host(\\\`myservice.htsn.io\\\`)
      service: myservice
      tls: {certResolver: letsencrypt}
  services:
    myservice:
      loadBalancer:
        servers:
          - url: http://10.10.10.206:8080
EOF"'

# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
  -H "X-Auth-Email: cloudflare@htsn.io" \
  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'

# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io

Traefik Troubleshooting

# View Traefik logs (CT 202)
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'

# Check if config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'

# List all dynamic configs
ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/'

# Check certificate
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq'

# Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'

Certificate Management

Let's Encrypt certificates are automatically managed by Traefik.

Certificate storage:

Traefik-Primary: /etc/traefik/acme.json on CT 202
Traefik-Saltbox: /opt/traefik/acme.json on VM 101

Certificate renewal:

Automatic via HTTP-01 challenge
Traefik checks every 24h
Renews 30 days before expiry

If certificates fail:

# Check acme.json permissions (must be 600)
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json'

# Check Traefik can reach Let's Encrypt
ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory'

# Delete bad certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'

Docker Service with Traefik Labels (Alternative)

If deploying a service via Docker on docker-host (VM 206), you can use Traefik labels instead of config files:

# docker-compose.yml
services:
  myservice:
    image: myimage:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
      - "traefik.http.routers.myservice.entrypoints=websecure"
      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
    networks:
      - traefik

networks:
  traefik:
    external: true

Note: This requires Traefik to have access to Docker socket and be on same network.

Cloudflare API Access

Credentials (stored in Saltbox config):

Email: cloudflare@htsn.io
API Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc
Domain: htsn.io

Retrieve from Saltbox:

ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"'

Cloudflare API Documentation:

API Docs: https://developers.cloudflare.com/api/
DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record

Common API operations:

# Set credentials
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"

# List all DNS records
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" | jq

# Add A record
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}'

# Delete record
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY"

Git Repository

This documentation is stored at:

Gitea: https://git.htsn.io/hutson/homelab-docs
Local: ~/Projects/homelab
Notes: ~/Notes/05_Homelab (symlink)

# Clone
git clone git@git.htsn.io:hutson/homelab-docs.git

# Push changes
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push

File	Description
EMC-ENCLOSURE.md	EMC storage enclosure (SES commands, LCC troubleshooting, maintenance)
HOMEASSISTANT.md	Home Assistant API access, automations, integrations
NETWORK.md	Network bridges, VLANs, which bridge to use for new VMs
IP-ASSIGNMENTS.md	Complete IP address assignments for all devices and services
SYNCTHING.md	Syncthing setup, API access, device list, troubleshooting
SHELL-ALIASES.md	ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.)
configs/	Symlinks to shared shell configs

Backlog

Future improvements and maintenance tasks:

Priority	Task	Notes
Medium	Re-IP all devices	Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254
Low	Install SSH on HomeAssistant	Currently only accessible via QEMU agent
Low	Set up SSH key for router	Currently requires expect/password

Changelog

2025-12-21

Happy Server Self-Hosted Relay

Deployed self-hosted Happy Coder relay server on docker-host (10.10.10.206)
Stack includes: Happy Server, PostgreSQL, Redis, MinIO (all containerized)
Configured Traefik reverse proxy at https://happy.htsn.io
Added Cloudflare DNS record (proxied)
Fixed Dockerfile to include Prisma migrations on startup

Docker-host CPU Upgrade

Changed VM 206 CPU from emulated to host passthrough
Fixes x86-64-v2 compatibility issues with modern binaries (Sharp, MinIO)
Requires: ssh pve 'qm set 206 -cpu host' + VM reboot

PVE Tailscale Routing Fix

Fixed issue where PVE was unreachable via local network (10.10.10.120)
Root cause: Tailscale routing table 52 was capturing local subnet traffic
Fix: Added routing rule ip rule add from 10.10.10.120 table main priority 5200
Made permanent in /etc/network/interfaces under vmbr0

2024-12-20

Git Repository Setup

Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs)
Set up SSH key authentication for git@git.htsn.io
Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab
Added Gitea API token for future automation

SSH Key Deployment - All Systems

Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key)
Updated ~/.ssh/config with complete host aliases
Fixed permissions: FindShyt LXC .ssh ownership, enabled PermitRootLogin on LXCs
Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt

Documentation Updates

Rewrote SSH Access section with complete host table
Added Password Auth section for router/Windows/HomeAssistant
Added Backlog section with re-IP task
Added Git Repository section with clone/push instructions

2024-12-19

EMC Storage Enclosure - LCC B Failure

Diagnosed loud fan issue (speed code 5 → 4160 RPM)
Root cause: Faulty LCC B controller causing false readings
Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM)
Replacement ordered: EMC 303-108-000E ($14.95 eBay)
Created EMC-ENCLOSURE.md with full documentation

SSH Key Consolidation

Renamed ~/.ssh/ai_trading_ed25519 → ~/.ssh/homelab
Updated ~/.ssh/config on MacBook with all homelab hosts
SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm
No more sshpass needed for PVE servers

QEMU Guest Agent Deployment

Installed on: docker-host (206), fs-dev (105), copyparty (201)
All PVE VMs now have agent except homeassistant (110)
Can now use qm guest exec for remote commands

VM Configuration Updates

docker-host: Fixed SSH key in cloud-init
fs-dev: Fixed .ssh directory ownership (1000 → 1001)
copyparty: Changed from DHCP to static IP (10.10.10.201)

Documentation Updates

Updated CLAUDE.md SSH section (removed sshpass examples)
Added QEMU Agent column to VM tables
Added storage enclosure troubleshooting to runbooks

40 KiB Raw Blame History

Homelab Infrastructure

Quick Reference - Common Tasks

Role

Proactive Behaviors

Quick Health Checks

Troubleshooting Runbooks

Server Temperature Check

Service Dependencies

API Quick Reference

Common Maintenance Tasks

Emergency Commands

Overview

Servers

PVE (10.10.10.120) - Primary

PVE2 (10.10.10.102) - Secondary

SSH Access

SSH Key Authentication (All Hosts)

Password Auth (Special Cases)

VMs and Containers

PVE (10.10.10.120)

PVE2 (10.10.10.102)

QEMU Guest Agent

Power Management

Estimated Power Draw

Optimizations Applied

Potential Optimizations

Memory Configuration

Network

Network Ranges

PVE Bridges (10.10.10.120)

Quick Reference

MTU 9000 (Jumbo Frames)

Tailscale VPN

Common Commands

Remote Claude Code Sessions (Mac Mini)

Overview

Architecture

How It Works

Quick Commands

Mobile Access Setup (One-time)

Daemon Management

Remote Access via SSH + Tailscale

Files & Configuration

Troubleshooting

Happy Server (Self-Hosted Relay)

Architecture

Service Details

Configuration

Quick Commands

Connecting Devices

Maintenance

Agent and Tool Guidelines

Background Agents

MCP Tools for Web Searches

ref.tools - Documentation Lookups

Exa MCP - General Web and Code Searches

MCP Tools Reference Table

Reverse Proxy Architecture (Traefik)

Overview

⚠️ CRITICAL RULE: Which Traefik to Use

Traefik-Primary (CT 202) - For New Services

Traefik-Saltbox (VM 101) - DO NOT MODIFY

Adding a New Public Service - Complete Workflow

Step 0. Deploy Your Service

Step 1. Create Traefik Config File

SSL Certificates

2. Add Cloudflare DNS Entry

3. Testing

Step 4. Update Documentation

Quick Reference - One-Liner Commands

Traefik Troubleshooting

Certificate Management

Docker Service with Traefik Labels (Alternative)

Cloudflare API Access

Git Repository

Related Documentation

Backlog

Changelog

2025-12-21

40 KiB

Raw Blame History