Files
homelab-docs/CLAUDE.md
2026-01-11 16:17:56 -05:00

431 lines
15 KiB
Markdown

# Homelab Infrastructure - Quick Reference
**Start here**: [README.md](README.md) - Documentation index and overview
This is your **quick reference guide** for common homelab tasks. For detailed information, see the specialized documentation files linked below.
---
## Quick Reference - Common Tasks
| Task | Documentation | Quick Command |
|------|--------------|---------------|
| **Gateway issues** | [GATEWAY.md](GATEWAY.md) | `ssh ucg-fiber 'free -m'` |
| **Tailscale/VPN issues** | [TAILSCALE.md](TAILSCALE.md) | `tailscale status` |
| **Add new public service** | [TRAEFIK.md](TRAEFIK.md) | Create Traefik config + Cloudflare DNS |
| **Check UPS status** | [UPS.md](UPS.md) | `ssh pve 'upsc cyberpower@localhost'` |
| **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` |
| **Syncthing issues** | [SYNCTHING.md](SYNCTHING.md) | Check API connections |
| **VM/CT management** | [VMS.md](VMS.md) | `ssh pve 'qm list'` |
| **Storage issues** | [STORAGE.md](STORAGE.md) | `ssh pve 'zpool status'` |
| **SSH access** | [SSH-ACCESS.md](SSH-ACCESS.md) | Use host aliases in `~/.ssh/config` |
| **Power optimization** | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU states |
| **Backup strategy** | [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | ⚠️ CRITICAL GAPS |
**Key Credentials:**
- SSH Password: `GrilledCh33s3#`
- Cloudflare: `cloudflare@htsn.io` / `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
- See individual docs for service-specific credentials
---
## Role
You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure.
**Responsibilities:**
- Infrastructure Management (Proxmox, VMs, containers)
- File Sync (Syncthing across all devices)
- Network Administration
- Power Optimization
- Documentation (keep all docs current)
- Automation (shell aliases, scripts, scheduled tasks)
**Full access via**: SSH keys, APIs, QEMU guest agent
---
## Proactive Behaviors
When the user mentions issues or asks questions:
- **"sync not working"** → Check Syncthing on ALL devices, identify which is offline
- **"device offline"** → Ping local + Tailscale IPs, check if service running
- **"slow"** → Check CPU usage, processes, Syncthing rescan activity
- **"check status"** → Run full health check across all systems
- **"something's wrong"** → Run diagnostics on likely culprits
---
## Quick Health Checks
```bash
# === FULL HEALTH CHECK ===
# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"
# Critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"
# Windows PC Syncthing
nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN"
```
---
## Troubleshooting Runbooks
| Symptom | Check | Fix | Docs |
|---------|-------|-----|------|
| **Network down** | `ssh ucg-fiber 'free -m'` | Check memory, watchdog reboots auto | [GATEWAY.md](GATEWAY.md) |
| **Tailscale DNS not working** | `tailscale status` | Check PVE online, subnet routing | [TAILSCALE.md](TAILSCALE.md) |
| **Subnet unreachable** | `ping 10.10.10.10` | Check `--accept-routes` on local devices | [TAILSCALE.md](TAILSCALE.md) |
| **Relay-only connections** | `tailscale ping <ip>` | Check for VPN conflicts, restart tailscaled | [TAILSCALE.md](TAILSCALE.md) |
| Device not syncing | `curl Syncthing API` | Restart Syncthing | [SYNCTHING.md](SYNCTHING.md) |
| VM won't start | Storage/RAM available? | `ssh pve 'qm start VMID'` | [VMS.md](VMS.md) |
| Server running hot | Check KSM, CPU processes | Disable KSM | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) |
| Storage enclosure loud | Check fan speed via SES | Switch LCC | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
| UPS on battery | Check runtime | Monitor shutdown script | [UPS.md](UPS.md) |
| Service unreachable | Check Traefik config | Fix routing | [TRAEFIK.md](TRAEFIK.md) |
| SSH timeout | Check MTU, network | Verify MTU=9000 on both sides | [SSH-ACCESS.md](SSH-ACCESS.md) |
---
## Server Temperature Check
```bash
# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy**: 70-80°C under load | **Warning**: >85°C | **Throttle**: 90°C
---
## Service Dependencies
```
TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks
├── NFS/SMB shares for VMs
└── Media storage for Plex
PiHole (CT 200)
└── DNS for entire network
Traefik (CT 202)
└── Reverse proxy - external access
Router (10.10.10.1)
└── Gateway for all traffic
```
---
## API Quick Reference
| Service | Device | Endpoint | Auth |
|---------|--------|----------|------|
| Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` |
| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
| Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` |
| Proxmox | PVE/PVE2 | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
| MetaMCP | docker-host2 | `https://metamcp.htsn.io/` | Web UI login |
| n8n | docker-host2 | `http://10.10.10.207:5678/api/v1/` | `X-N8N-API-KEY` (see [N8N.md](N8N.md)) |
**See**: [SYNCTHING.md](SYNCTHING.md), [HOMEASSISTANT.md](HOMEASSISTANT.md), [N8N.md](N8N.md) for more APIs
---
## Emergency Commands
```bash
# Restart VM
ssh pve 'qm stop VMID && qm start VMID'
# Check CPU usage
ssh pve 'ps aux --sort=-%cpu | head -10'
# Check ZFS pool (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \
-H "X-API-Key: API_KEY"
# Restart Syncthing on Windows
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \
'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
```
---
## Infrastructure Overview
### Servers
| Server | CPU | RAM | Role | Details |
|--------|-----|-----|------|---------|
| **PVE** (10.10.10.120) | Threadripper PRO 3975WX (32C) | 128GB | Primary | [VMS.md](VMS.md) |
| **PVE2** (10.10.10.102) | Threadripper PRO 3975WX (32C) | 128GB | Secondary | [VMS.md](VMS.md) |
**Power**: ~1000-1350W under load | **UPS**: CyberPower 2200VA/1320W | **See**: [UPS.md](UPS.md), [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md)
### Critical VMs
| VMID | Name | IP | Purpose | Docs |
|------|------|-----|---------|------|
| 100 | truenas | 10.10.10.200 | NAS/storage | [STORAGE.md](STORAGE.md) |
| 101 | saltbox | 10.10.10.100 | Media stack (Plex) | [VMS.md](VMS.md) |
| 110 | homeassistant | 10.10.10.110 | Home automation | [HOMEASSISTANT.md](HOMEASSISTANT.md) |
| 202 | traefik (CT) | 10.10.10.250 | Reverse proxy | [TRAEFIK.md](TRAEFIK.md) |
| 206 | docker-host | 10.10.10.206 | Monitoring stack (Grafana/Prometheus) | [VMS.md](VMS.md) |
| 302 | docker-host2 | 10.10.10.207 | MetaMCP, n8n, automation | [VMS.md](VMS.md) |
**Complete inventory**: [VMS.md](VMS.md) | **IP assignments**: [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
---
## Common Maintenance Tasks
1. **Check Syncthing sync** - Folders behind? Errors?
2. **Verify devices connected** - Run connection check
3. **Check disk space** - `ssh pve 'df -h'`
4. **Review ZFS health** - `ssh pve 'zpool status'`
5. **Check for stuck processes** - High CPU? Memory pressure?
6. **Verify backups** - Critical folders syncing? → See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md)
---
## Network Quick Reference
**Ranges**: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage)
**Jumbo Frames**: MTU 9000 enabled
**Tailscale**: VPN with subnet routing (HA failover)
**See**: [NETWORK.md](NETWORK.md) for complete details
---
## Common Commands
```bash
# VM management
ssh pve 'qm list' # List VMs
ssh pve 'qm start VMID' # Start VM
ssh pve 'qm shutdown VMID' # Graceful shutdown
# Container management
ssh pve 'pct list' # List containers
ssh pve 'pct enter CTID' # Enter container shell
# Storage
ssh pve 'zpool status' # Check ZFS pools
ssh truenas 'zpool status vault' # Check TrueNAS pool
# QEMU guest agent
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
```
**See**: [SSH-ACCESS.md](SSH-ACCESS.md), [VMS.md](VMS.md)
---
## Documentation Index
### Infrastructure
- [README.md](README.md) - Start here
- [GATEWAY.md](GATEWAY.md) - UniFi gateway, monitoring services
- [TAILSCALE.md](TAILSCALE.md) - VPN, subnet routing, DNS
- [VMS.md](VMS.md) - VM/CT inventory
- [STORAGE.md](STORAGE.md) - ZFS pools, shares
- [NETWORK.md](NETWORK.md) - Bridges, VLANs, MTU
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Optimizations
- [UPS.md](UPS.md) - UPS config, NUT monitoring
### Services
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy, SSL
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home automation
- [SYNCTHING.md](SYNCTHING.md) - File sync
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure
- [MONITORING.md](MONITORING.md) - System monitoring
### Operations
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH keys, hosts
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - ⚠️ Backups (CRITICAL)
- [SHELL-ALIASES.md](SHELL-ALIASES.md) - ZSH aliases
---
## Agent & Tool Guidelines
### Background Agents
**Always** spin up background agents for multiple independent tasks:
- Parallel execution improves efficiency
- Use for: tests, builds, searches simultaneously
### MCP Tools
| Tool | Provider | Use Case |
|------|----------|----------|
| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation |
| `mcp__Ref__ref_read_url` | ref.tools | Read doc URLs |
| `mcp__exa__web_search_exa` | Exa | General web search |
| `mcp__exa__get_code_context_exa` | Exa | Code-specific search |
---
## Git Repository
- **Gitea**: https://git.htsn.io/hutson/homelab-docs
- **Local**: `~/Projects/homelab`
- **Notes**: `~/Notes/05_Homelab` (symlink)
```bash
cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push
```
---
## Backlog
| Priority | Task | Notes |
|----------|------|-------|
| Medium | Re-IP all devices | Current IPs inconsistent |
| Medium | Upgrade to 20A circuit for UPS | Plug rewired 5-20P→5-15P |
| Low | Install SSH on HomeAssistant | Currently QEMU agent only |
---
## Recent Changes
### 2026-01-11
- **BlueMap web map** for Minecraft Hutworld server
- URL: https://map.htsn.io (password protected: hutworld / Suwanna123)
- BlueMap 5.15 plugin installed
- Port 8100 exposed in Crafty docker-compose
- Traefik routing with basicAuth middleware
- Fixed corrupted ViaVersion/ViaBackwards plugins
- Documented 1.21+ spawner give command syntax
- Fixed Docker file permission issues in Crafty container
### 2026-01-05
- Created [TAILSCALE.md](TAILSCALE.md) - comprehensive Tailscale VPN documentation
- **Fixed Tailscale subnet routing issues:**
- Switched primary subnet router from UCG-Fiber to PVE (gateway had relay-only connections)
- Disabled `--accept-routes` on UCG-Fiber and PiHole (devices on subnet must not accept subnet routes)
- Fixed PiHole ProtonVPN from full-tunnel to split-tunnel (DNS-only via fwmark routing)
- **Root cause:** Devices directly on 10.10.10.0/24 with `--accept-routes=true` were routing local traffic through Tailscale mesh instead of local interface
- **Key lesson:** Any device directly connected to an advertised subnet MUST have `--accept-routes=false`
### 2026-01-03
- Deployed **Crafty Controller 4** on docker-host2 for Minecraft server management
- URL: https://mc.htsn.io (Web GUI)
- Minecraft Java: 10.10.10.207:25565
- Minecraft Bedrock (Geyser): 10.10.10.207:19132/udp
- Admin: `admin` / password in `/crafty/app/config/default-creds.txt`
- World data to be migrated from Windows PC (D:\Minecraft\mcss\servers\hutworld)
- Deployed **MetaMCP** on docker-host2 (10.10.10.207) for unified MCP server management
- URL: https://metamcp.htsn.io
- Added docker-host2 to SSH config (`~/.ssh/config`)
- Updated IP-ASSIGNMENTS.md, SSH-ACCESS.md, TRAEFIK.md with docker-host2
### 2026-01-02
- Created [GATEWAY.md](GATEWAY.md) - UniFi gateway documentation
- Deployed internet-watchdog service (auto-reboot on connectivity loss)
- Deployed memory-monitor service (logs memory usage every 10 min)
- Configured SSH key auth for gateway (`ucg-fiber`/`gateway` aliases)
- Disabled UniFi Connect to free ~200MB RAM
- Updated [MONITORING.md](MONITORING.md) with gateway monitoring
- Updated [SSH-ACCESS.md](SSH-ACCESS.md) with key auth for router
### 2025-12-22
- Created comprehensive Phase 1 documentation split
- New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md
- Cleaned up CLAUDE.md to quick reference only
### 2025-12-21
- UPS upgrade: CyberPower OR2200PFCRT2U (1320W)
- NUT monitoring configured (master/slave)
- Full power failure test successful (~7 min recovery)
- Happy Server self-hosted relay deployed
- PVE Tailscale routing fix
- Proxmox 2-node cluster quorum fix
**Full changelog**: See end of this file
---
**Last Updated**: 2026-01-05
**Documentation Status**: ✅ Phase 1 Complete + Gateway Monitoring + MetaMCP + Tailscale
---
<details>
<summary><b>Full Changelog (Click to expand)</b></summary>
### 2025-12-21
**UPS Upgrade**
- Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W)
- Temporarily rewired plug 5-20P → 5-15P for 15A circuit
- Runtime: ~15-20 min at 33% load
**NUT Monitoring**
- Configured NUT on PVE (master), PVE2 (slave)
- Shutdown threshold: 120 seconds runtime
- Custom shutdown script: `/usr/local/bin/ups-shutdown.sh`
- Home Assistant integration (UPS sensors)
**Happy Server Self-Hosted Relay**
- Deployed on docker-host (10.10.10.206)
- Stack: Happy Server + PostgreSQL + Redis + MinIO
- URL: https://happy.htsn.io
- Traefik reverse proxy configured
**Proxmox Fixes**
- PVE Tailscale routing: Added rule for local network access
- PVE2 MTU fix: vmbr0 + nic1 both set to 9000
- 2-node cluster quorum: `two_node: 1` in corosync.conf
**Power Failure Test**
- Full end-to-end test successful
- VMs stopped gracefully at 2 min runtime
- Total recovery: ~7 minutes
### 2024-12-20
**Git & SSH**
- Created homelab-docs repo on Gitea
- Deployed SSH keys to all VMs/LXCs (13 hosts)
- Updated ~/.ssh/config with host aliases
### 2024-12-19
**EMC Storage Enclosure**
- LCC B failure diagnosed, switched to LCC A
- Fans now quiet (speed code 3 vs 5)
- Created EMC-ENCLOSURE.md documentation
**QEMU Guest Agent**
- Installed on docker-host, fs-dev, copyparty
- All VMs now have agent except homeassistant
</details>