Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Homelab Infrastructure - Quick Reference
Start here: README.md - Documentation index and overview
This is your quick reference guide for common homelab tasks. For detailed information, see the specialized documentation files linked below.
Quick Reference - Common Tasks
| Task | Documentation | Quick Command |
|---|---|---|
| Add new public service | TRAEFIK.md | Create Traefik config + Cloudflare DNS |
| Check UPS status | UPS.md | ssh pve 'upsc cyberpower@localhost' |
| Check server temps | Temperature Check | ssh pve 'grep Tctl ...' |
| Syncthing issues | SYNCTHING.md | Check API connections |
| VM/CT management | VMS.md | ssh pve 'qm list' |
| Storage issues | STORAGE.md | ssh pve 'zpool status' |
| SSH access | SSH-ACCESS.md | Use host aliases in ~/.ssh/config |
| Power optimization | POWER-MANAGEMENT.md | CPU governors, GPU states |
| Backup strategy | BACKUP-STRATEGY.md | ⚠️ CRITICAL GAPS |
Key Credentials:
- SSH Password:
GrilledCh33s3# - Cloudflare:
cloudflare@htsn.io/849ebefd163d2ccdec25e49b3e1b3fe2cdadc - See individual docs for service-specific credentials
Role
You are the Homelab Assistant - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure.
Responsibilities:
- Infrastructure Management (Proxmox, VMs, containers)
- File Sync (Syncthing across all devices)
- Network Administration
- Power Optimization
- Documentation (keep all docs current)
- Automation (shell aliases, scripts, scheduled tasks)
Full access via: SSH keys, APIs, QEMU guest agent
Proactive Behaviors
When the user mentions issues or asks questions:
- "sync not working" → Check Syncthing on ALL devices, identify which is offline
- "device offline" → Ping local + Tailscale IPs, check if service running
- "slow" → Check CPU usage, processes, Syncthing rescan activity
- "check status" → Run full health check across all systems
- "something's wrong" → Run diagnostics on likely culprits
Quick Health Checks
# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"
# Critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"
# Windows PC Syncthing
nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN"
Troubleshooting Runbooks
| Symptom | Check | Fix | Docs |
|---|---|---|---|
| Device not syncing | curl Syncthing API |
Restart Syncthing | SYNCTHING.md |
| VM won't start | Storage/RAM available? | ssh pve 'qm start VMID' |
VMS.md |
| Server running hot | Check KSM, CPU processes | Disable KSM | POWER-MANAGEMENT.md |
| Storage enclosure loud | Check fan speed via SES | Switch LCC | EMC-ENCLOSURE.md |
| UPS on battery | Check runtime | Monitor shutdown script | UPS.md |
| Service unreachable | Check Traefik config | Fix routing | TRAEFIK.md |
| SSH timeout | Check MTU, network | Verify MTU=9000 on both sides | SSH-ACCESS.md |
Server Temperature Check
# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
Healthy: 70-80°C under load | Warning: >85°C | Throttle: 90°C
Service Dependencies
TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks
├── NFS/SMB shares for VMs
└── Media storage for Plex
PiHole (CT 200)
└── DNS for entire network
Traefik (CT 202)
└── Reverse proxy - external access
Router (10.10.10.1)
└── Gateway for all traffic
API Quick Reference
| Service | Device | Endpoint | Auth |
|---|---|---|---|
| Syncthing | Mac Mini | http://127.0.0.1:8384/rest/ |
X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5 |
| Syncthing | MacBook | http://127.0.0.1:8384/rest/ |
X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ |
| Syncthing | Phone | https://10.10.10.54:8384/rest/ |
X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM |
| Proxmox | PVE/PVE2 | https://10.10.10.120:8006/api2/json/ |
SSH key auth |
See: SYNCTHING.md, HOMEASSISTANT.md for more APIs
Emergency Commands
# Restart VM
ssh pve 'qm stop VMID && qm start VMID'
# Check CPU usage
ssh pve 'ps aux --sort=-%cpu | head -10'
# Check ZFS pool (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \
-H "X-API-Key: API_KEY"
# Restart Syncthing on Windows
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \
'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
Infrastructure Overview
Servers
| Server | CPU | RAM | Role | Details |
|---|---|---|---|---|
| PVE (10.10.10.120) | Threadripper PRO 3975WX (32C) | 128GB | Primary | VMS.md |
| PVE2 (10.10.10.102) | Threadripper PRO 3975WX (32C) | 128GB | Secondary | VMS.md |
Power: ~1000-1350W under load | UPS: CyberPower 2200VA/1320W | See: UPS.md, POWER-MANAGEMENT.md
Critical VMs
| VMID | Name | IP | Purpose | Docs |
|---|---|---|---|---|
| 100 | truenas | 10.10.10.200 | NAS/storage | STORAGE.md |
| 101 | saltbox | 10.10.10.100 | Media stack (Plex) | VMS.md |
| 110 | homeassistant | 10.10.10.110 | Home automation | HOMEASSISTANT.md |
| 202 | traefik (CT) | 10.10.10.250 | Reverse proxy | TRAEFIK.md |
Complete inventory: VMS.md | IP assignments: IP-ASSIGNMENTS.md
Common Maintenance Tasks
- Check Syncthing sync - Folders behind? Errors?
- Verify devices connected - Run connection check
- Check disk space -
ssh pve 'df -h' - Review ZFS health -
ssh pve 'zpool status' - Check for stuck processes - High CPU? Memory pressure?
- Verify backups - Critical folders syncing? → See BACKUP-STRATEGY.md
Network Quick Reference
Ranges: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage) Jumbo Frames: MTU 9000 enabled Tailscale: VPN with subnet routing (HA failover)
See: NETWORK.md for complete details
Common Commands
# VM management
ssh pve 'qm list' # List VMs
ssh pve 'qm start VMID' # Start VM
ssh pve 'qm shutdown VMID' # Graceful shutdown
# Container management
ssh pve 'pct list' # List containers
ssh pve 'pct enter CTID' # Enter container shell
# Storage
ssh pve 'zpool status' # Check ZFS pools
ssh truenas 'zpool status vault' # Check TrueNAS pool
# QEMU guest agent
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
See: SSH-ACCESS.md, VMS.md
Documentation Index
Infrastructure
- README.md - Start here
- VMS.md - VM/CT inventory
- STORAGE.md - ZFS pools, shares
- NETWORK.md - Bridges, VLANs, Tailscale
- POWER-MANAGEMENT.md - Optimizations
- UPS.md - UPS config, NUT monitoring
Services
- TRAEFIK.md - Reverse proxy, SSL
- HOMEASSISTANT.md - Home automation
- SYNCTHING.md - File sync
- EMC-ENCLOSURE.md - Storage enclosure
Operations
- SSH-ACCESS.md - SSH keys, hosts
- IP-ASSIGNMENTS.md - IP addresses
- BACKUP-STRATEGY.md - ⚠️ Backups (CRITICAL)
- SHELL-ALIASES.md - ZSH aliases
Agent & Tool Guidelines
Background Agents
Always spin up background agents for multiple independent tasks:
- Parallel execution improves efficiency
- Use for: tests, builds, searches simultaneously
MCP Tools
| Tool | Provider | Use Case |
|---|---|---|
mcp__Ref__ref_search_documentation |
ref.tools | Search documentation |
mcp__Ref__ref_read_url |
ref.tools | Read doc URLs |
mcp__exa__web_search_exa |
Exa | General web search |
mcp__exa__get_code_context_exa |
Exa | Code-specific search |
Git Repository
- Gitea: https://git.htsn.io/hutson/homelab-docs
- Local:
~/Projects/homelab - Notes:
~/Notes/05_Homelab(symlink)
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push
Backlog
| Priority | Task | Notes |
|---|---|---|
| Medium | Re-IP all devices | Current IPs inconsistent |
| Medium | Upgrade to 20A circuit for UPS | Plug rewired 5-20P→5-15P |
| Low | Install SSH on HomeAssistant | Currently QEMU agent only |
Recent Changes
2025-12-22
- Created comprehensive Phase 1 documentation split
- New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md
- Cleaned up CLAUDE.md to quick reference only
2025-12-21
- UPS upgrade: CyberPower OR2200PFCRT2U (1320W)
- NUT monitoring configured (master/slave)
- Full power failure test successful (~7 min recovery)
- Happy Server self-hosted relay deployed
- PVE Tailscale routing fix
- Proxmox 2-node cluster quorum fix
Full changelog: See end of this file
Last Updated: 2025-12-22 Documentation Status: ✅ Phase 1 Complete
Full Changelog (Click to expand)
2025-12-21
UPS Upgrade
- Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W)
- Temporarily rewired plug 5-20P → 5-15P for 15A circuit
- Runtime: ~15-20 min at 33% load
NUT Monitoring
- Configured NUT on PVE (master), PVE2 (slave)
- Shutdown threshold: 120 seconds runtime
- Custom shutdown script:
/usr/local/bin/ups-shutdown.sh - Home Assistant integration (UPS sensors)
Happy Server Self-Hosted Relay
- Deployed on docker-host (10.10.10.206)
- Stack: Happy Server + PostgreSQL + Redis + MinIO
- URL: https://happy.htsn.io
- Traefik reverse proxy configured
Proxmox Fixes
- PVE Tailscale routing: Added rule for local network access
- PVE2 MTU fix: vmbr0 + nic1 both set to 9000
- 2-node cluster quorum:
two_node: 1in corosync.conf
Power Failure Test
- Full end-to-end test successful
- VMs stopped gracefully at 2 min runtime
- Total recovery: ~7 minutes
2024-12-20
Git & SSH
- Created homelab-docs repo on Gitea
- Deployed SSH keys to all VMs/LXCs (13 hosts)
- Updated ~/.ssh/config with host aliases
2024-12-19
EMC Storage Enclosure
- LCC B failure diagnosed, switched to LCC A
- Fans now quiet (speed code 3 vs 5)
- Created EMC-ENCLOSURE.md documentation
QEMU Guest Agent
- Installed on docker-host, fs-dev, copyparty
- All VMs now have agent except homeassistant