431 lines
15 KiB
Markdown
431 lines
15 KiB
Markdown
# Homelab Infrastructure - Quick Reference
|
|
|
|
**Start here**: [README.md](README.md) - Documentation index and overview
|
|
|
|
This is your **quick reference guide** for common homelab tasks. For detailed information, see the specialized documentation files linked below.
|
|
|
|
---
|
|
|
|
## Quick Reference - Common Tasks
|
|
|
|
| Task | Documentation | Quick Command |
|
|
|------|--------------|---------------|
|
|
| **Gateway issues** | [GATEWAY.md](GATEWAY.md) | `ssh ucg-fiber 'free -m'` |
|
|
| **Tailscale/VPN issues** | [TAILSCALE.md](TAILSCALE.md) | `tailscale status` |
|
|
| **Add new public service** | [TRAEFIK.md](TRAEFIK.md) | Create Traefik config + Cloudflare DNS |
|
|
| **Check UPS status** | [UPS.md](UPS.md) | `ssh pve 'upsc cyberpower@localhost'` |
|
|
| **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` |
|
|
| **Syncthing issues** | [SYNCTHING.md](SYNCTHING.md) | Check API connections |
|
|
| **VM/CT management** | [VMS.md](VMS.md) | `ssh pve 'qm list'` |
|
|
| **Storage issues** | [STORAGE.md](STORAGE.md) | `ssh pve 'zpool status'` |
|
|
| **SSH access** | [SSH-ACCESS.md](SSH-ACCESS.md) | Use host aliases in `~/.ssh/config` |
|
|
| **Power optimization** | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU states |
|
|
| **Backup strategy** | [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | ⚠️ CRITICAL GAPS |
|
|
|
|
**Key Credentials:**
|
|
- SSH Password: `GrilledCh33s3#`
|
|
- Cloudflare: `cloudflare@htsn.io` / `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
|
|
- See individual docs for service-specific credentials
|
|
|
|
---
|
|
|
|
## Role
|
|
|
|
You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure.
|
|
|
|
**Responsibilities:**
|
|
- Infrastructure Management (Proxmox, VMs, containers)
|
|
- File Sync (Syncthing across all devices)
|
|
- Network Administration
|
|
- Power Optimization
|
|
- Documentation (keep all docs current)
|
|
- Automation (shell aliases, scripts, scheduled tasks)
|
|
|
|
**Full access via**: SSH keys, APIs, QEMU guest agent
|
|
|
|
---
|
|
|
|
## Proactive Behaviors
|
|
|
|
When the user mentions issues or asks questions:
|
|
- **"sync not working"** → Check Syncthing on ALL devices, identify which is offline
|
|
- **"device offline"** → Ping local + Tailscale IPs, check if service running
|
|
- **"slow"** → Check CPU usage, processes, Syncthing rescan activity
|
|
- **"check status"** → Run full health check across all systems
|
|
- **"something's wrong"** → Run diagnostics on likely culprits
|
|
|
|
---
|
|
|
|
## Quick Health Checks
|
|
|
|
```bash
|
|
# === FULL HEALTH CHECK ===
|
|
|
|
# Syncthing connections (Mac Mini)
|
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
|
"http://127.0.0.1:8384/rest/system/connections" | \
|
|
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
|
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
|
|
|
# Proxmox VMs
|
|
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
|
|
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"
|
|
|
|
# Critical devices
|
|
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
|
|
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"
|
|
|
|
# Windows PC Syncthing
|
|
nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN"
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting Runbooks
|
|
|
|
| Symptom | Check | Fix | Docs |
|
|
|---------|-------|-----|------|
|
|
| **Network down** | `ssh ucg-fiber 'free -m'` | Check memory, watchdog reboots auto | [GATEWAY.md](GATEWAY.md) |
|
|
| **Tailscale DNS not working** | `tailscale status` | Check PVE online, subnet routing | [TAILSCALE.md](TAILSCALE.md) |
|
|
| **Subnet unreachable** | `ping 10.10.10.10` | Check `--accept-routes` on local devices | [TAILSCALE.md](TAILSCALE.md) |
|
|
| **Relay-only connections** | `tailscale ping <ip>` | Check for VPN conflicts, restart tailscaled | [TAILSCALE.md](TAILSCALE.md) |
|
|
| Device not syncing | `curl Syncthing API` | Restart Syncthing | [SYNCTHING.md](SYNCTHING.md) |
|
|
| VM won't start | Storage/RAM available? | `ssh pve 'qm start VMID'` | [VMS.md](VMS.md) |
|
|
| Server running hot | Check KSM, CPU processes | Disable KSM | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) |
|
|
| Storage enclosure loud | Check fan speed via SES | Switch LCC | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
|
|
| UPS on battery | Check runtime | Monitor shutdown script | [UPS.md](UPS.md) |
|
|
| Service unreachable | Check Traefik config | Fix routing | [TRAEFIK.md](TRAEFIK.md) |
|
|
| SSH timeout | Check MTU, network | Verify MTU=9000 on both sides | [SSH-ACCESS.md](SSH-ACCESS.md) |
|
|
|
|
---
|
|
|
|
## Server Temperature Check
|
|
|
|
```bash
|
|
# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
|
|
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
|
label=$(cat ${f%_input}_label 2>/dev/null); \
|
|
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
|
|
|
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
|
label=$(cat ${f%_input}_label 2>/dev/null); \
|
|
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
|
```
|
|
|
|
**Healthy**: 70-80°C under load | **Warning**: >85°C | **Throttle**: 90°C
|
|
|
|
---
|
|
|
|
## Service Dependencies
|
|
|
|
```
|
|
TrueNAS (10.10.10.200)
|
|
├── Central Syncthing hub - if down, sync breaks
|
|
├── NFS/SMB shares for VMs
|
|
└── Media storage for Plex
|
|
|
|
PiHole (CT 200)
|
|
└── DNS for entire network
|
|
|
|
Traefik (CT 202)
|
|
└── Reverse proxy - external access
|
|
|
|
Router (10.10.10.1)
|
|
└── Gateway for all traffic
|
|
```
|
|
|
|
---
|
|
|
|
## API Quick Reference
|
|
|
|
| Service | Device | Endpoint | Auth |
|
|
|---------|--------|----------|------|
|
|
| Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` |
|
|
| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
|
|
| Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` |
|
|
| Proxmox | PVE/PVE2 | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
|
|
| MetaMCP | docker-host2 | `https://metamcp.htsn.io/` | Web UI login |
|
|
| n8n | docker-host2 | `http://10.10.10.207:5678/api/v1/` | `X-N8N-API-KEY` (see [N8N.md](N8N.md)) |
|
|
|
|
**See**: [SYNCTHING.md](SYNCTHING.md), [HOMEASSISTANT.md](HOMEASSISTANT.md), [N8N.md](N8N.md) for more APIs
|
|
|
|
---
|
|
|
|
## Emergency Commands
|
|
|
|
```bash
|
|
# Restart VM
|
|
ssh pve 'qm stop VMID && qm start VMID'
|
|
|
|
# Check CPU usage
|
|
ssh pve 'ps aux --sort=-%cpu | head -10'
|
|
|
|
# Check ZFS pool (via QEMU agent)
|
|
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
|
|
|
|
# Force Syncthing rescan
|
|
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \
|
|
-H "X-API-Key: API_KEY"
|
|
|
|
# Restart Syncthing on Windows
|
|
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \
|
|
'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
|
|
```
|
|
|
|
---
|
|
|
|
## Infrastructure Overview
|
|
|
|
### Servers
|
|
|
|
| Server | CPU | RAM | Role | Details |
|
|
|--------|-----|-----|------|---------|
|
|
| **PVE** (10.10.10.120) | Threadripper PRO 3975WX (32C) | 128GB | Primary | [VMS.md](VMS.md) |
|
|
| **PVE2** (10.10.10.102) | Threadripper PRO 3975WX (32C) | 128GB | Secondary | [VMS.md](VMS.md) |
|
|
|
|
**Power**: ~1000-1350W under load | **UPS**: CyberPower 2200VA/1320W | **See**: [UPS.md](UPS.md), [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md)
|
|
|
|
### Critical VMs
|
|
|
|
| VMID | Name | IP | Purpose | Docs |
|
|
|------|------|-----|---------|------|
|
|
| 100 | truenas | 10.10.10.200 | NAS/storage | [STORAGE.md](STORAGE.md) |
|
|
| 101 | saltbox | 10.10.10.100 | Media stack (Plex) | [VMS.md](VMS.md) |
|
|
| 110 | homeassistant | 10.10.10.110 | Home automation | [HOMEASSISTANT.md](HOMEASSISTANT.md) |
|
|
| 202 | traefik (CT) | 10.10.10.250 | Reverse proxy | [TRAEFIK.md](TRAEFIK.md) |
|
|
| 206 | docker-host | 10.10.10.206 | Monitoring stack (Grafana/Prometheus) | [VMS.md](VMS.md) |
|
|
| 302 | docker-host2 | 10.10.10.207 | MetaMCP, n8n, automation | [VMS.md](VMS.md) |
|
|
|
|
**Complete inventory**: [VMS.md](VMS.md) | **IP assignments**: [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
|
|
|
|
---
|
|
|
|
## Common Maintenance Tasks
|
|
|
|
1. **Check Syncthing sync** - Folders behind? Errors?
|
|
2. **Verify devices connected** - Run connection check
|
|
3. **Check disk space** - `ssh pve 'df -h'`
|
|
4. **Review ZFS health** - `ssh pve 'zpool status'`
|
|
5. **Check for stuck processes** - High CPU? Memory pressure?
|
|
6. **Verify backups** - Critical folders syncing? → See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md)
|
|
|
|
---
|
|
|
|
## Network Quick Reference
|
|
|
|
**Ranges**: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage)
|
|
**Jumbo Frames**: MTU 9000 enabled
|
|
**Tailscale**: VPN with subnet routing (HA failover)
|
|
|
|
**See**: [NETWORK.md](NETWORK.md) for complete details
|
|
|
|
---
|
|
|
|
## Common Commands
|
|
|
|
```bash
|
|
# VM management
|
|
ssh pve 'qm list' # List VMs
|
|
ssh pve 'qm start VMID' # Start VM
|
|
ssh pve 'qm shutdown VMID' # Graceful shutdown
|
|
|
|
# Container management
|
|
ssh pve 'pct list' # List containers
|
|
ssh pve 'pct enter CTID' # Enter container shell
|
|
|
|
# Storage
|
|
ssh pve 'zpool status' # Check ZFS pools
|
|
ssh truenas 'zpool status vault' # Check TrueNAS pool
|
|
|
|
# QEMU guest agent
|
|
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
|
|
```
|
|
|
|
**See**: [SSH-ACCESS.md](SSH-ACCESS.md), [VMS.md](VMS.md)
|
|
|
|
---
|
|
|
|
## Documentation Index
|
|
|
|
### Infrastructure
|
|
- [README.md](README.md) - Start here
|
|
- [GATEWAY.md](GATEWAY.md) - UniFi gateway, monitoring services
|
|
- [TAILSCALE.md](TAILSCALE.md) - VPN, subnet routing, DNS
|
|
- [VMS.md](VMS.md) - VM/CT inventory
|
|
- [STORAGE.md](STORAGE.md) - ZFS pools, shares
|
|
- [NETWORK.md](NETWORK.md) - Bridges, VLANs, MTU
|
|
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Optimizations
|
|
- [UPS.md](UPS.md) - UPS config, NUT monitoring
|
|
|
|
### Services
|
|
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy, SSL
|
|
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home automation
|
|
- [SYNCTHING.md](SYNCTHING.md) - File sync
|
|
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure
|
|
- [MONITORING.md](MONITORING.md) - System monitoring
|
|
|
|
### Operations
|
|
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH keys, hosts
|
|
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses
|
|
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - ⚠️ Backups (CRITICAL)
|
|
- [SHELL-ALIASES.md](SHELL-ALIASES.md) - ZSH aliases
|
|
|
|
---
|
|
|
|
## Agent & Tool Guidelines
|
|
|
|
### Background Agents
|
|
**Always** spin up background agents for multiple independent tasks:
|
|
- Parallel execution improves efficiency
|
|
- Use for: tests, builds, searches simultaneously
|
|
|
|
### MCP Tools
|
|
|
|
| Tool | Provider | Use Case |
|
|
|------|----------|----------|
|
|
| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation |
|
|
| `mcp__Ref__ref_read_url` | ref.tools | Read doc URLs |
|
|
| `mcp__exa__web_search_exa` | Exa | General web search |
|
|
| `mcp__exa__get_code_context_exa` | Exa | Code-specific search |
|
|
|
|
---
|
|
|
|
## Git Repository
|
|
|
|
- **Gitea**: https://git.htsn.io/hutson/homelab-docs
|
|
- **Local**: `~/Projects/homelab`
|
|
- **Notes**: `~/Notes/05_Homelab` (symlink)
|
|
|
|
```bash
|
|
cd ~/Projects/homelab
|
|
git add -A && git commit -m "Update docs" && git push
|
|
```
|
|
|
|
---
|
|
|
|
## Backlog
|
|
|
|
| Priority | Task | Notes |
|
|
|----------|------|-------|
|
|
| Medium | Re-IP all devices | Current IPs inconsistent |
|
|
| Medium | Upgrade to 20A circuit for UPS | Plug rewired 5-20P→5-15P |
|
|
| Low | Install SSH on HomeAssistant | Currently QEMU agent only |
|
|
|
|
---
|
|
|
|
## Recent Changes
|
|
|
|
### 2026-01-11
|
|
- **BlueMap web map** for Minecraft Hutworld server
|
|
- URL: https://map.htsn.io (password protected: hutworld / Suwanna123)
|
|
- BlueMap 5.15 plugin installed
|
|
- Port 8100 exposed in Crafty docker-compose
|
|
- Traefik routing with basicAuth middleware
|
|
- Fixed corrupted ViaVersion/ViaBackwards plugins
|
|
- Documented 1.21+ spawner give command syntax
|
|
- Fixed Docker file permission issues in Crafty container
|
|
|
|
### 2026-01-05
|
|
- Created [TAILSCALE.md](TAILSCALE.md) - comprehensive Tailscale VPN documentation
|
|
- **Fixed Tailscale subnet routing issues:**
|
|
- Switched primary subnet router from UCG-Fiber to PVE (gateway had relay-only connections)
|
|
- Disabled `--accept-routes` on UCG-Fiber and PiHole (devices on subnet must not accept subnet routes)
|
|
- Fixed PiHole ProtonVPN from full-tunnel to split-tunnel (DNS-only via fwmark routing)
|
|
- **Root cause:** Devices directly on 10.10.10.0/24 with `--accept-routes=true` were routing local traffic through Tailscale mesh instead of local interface
|
|
- **Key lesson:** Any device directly connected to an advertised subnet MUST have `--accept-routes=false`
|
|
|
|
### 2026-01-03
|
|
- Deployed **Crafty Controller 4** on docker-host2 for Minecraft server management
|
|
- URL: https://mc.htsn.io (Web GUI)
|
|
- Minecraft Java: 10.10.10.207:25565
|
|
- Minecraft Bedrock (Geyser): 10.10.10.207:19132/udp
|
|
- Admin: `admin` / password in `/crafty/app/config/default-creds.txt`
|
|
- World data to be migrated from Windows PC (D:\Minecraft\mcss\servers\hutworld)
|
|
- Deployed **MetaMCP** on docker-host2 (10.10.10.207) for unified MCP server management
|
|
- URL: https://metamcp.htsn.io
|
|
- Added docker-host2 to SSH config (`~/.ssh/config`)
|
|
- Updated IP-ASSIGNMENTS.md, SSH-ACCESS.md, TRAEFIK.md with docker-host2
|
|
|
|
### 2026-01-02
|
|
- Created [GATEWAY.md](GATEWAY.md) - UniFi gateway documentation
|
|
- Deployed internet-watchdog service (auto-reboot on connectivity loss)
|
|
- Deployed memory-monitor service (logs memory usage every 10 min)
|
|
- Configured SSH key auth for gateway (`ucg-fiber`/`gateway` aliases)
|
|
- Disabled UniFi Connect to free ~200MB RAM
|
|
- Updated [MONITORING.md](MONITORING.md) with gateway monitoring
|
|
- Updated [SSH-ACCESS.md](SSH-ACCESS.md) with key auth for router
|
|
|
|
### 2025-12-22
|
|
- Created comprehensive Phase 1 documentation split
|
|
- New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md
|
|
- Cleaned up CLAUDE.md to quick reference only
|
|
|
|
### 2025-12-21
|
|
- UPS upgrade: CyberPower OR2200PFCRT2U (1320W)
|
|
- NUT monitoring configured (master/slave)
|
|
- Full power failure test successful (~7 min recovery)
|
|
- Happy Server self-hosted relay deployed
|
|
- PVE Tailscale routing fix
|
|
- Proxmox 2-node cluster quorum fix
|
|
|
|
**Full changelog**: See end of this file
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-05
|
|
**Documentation Status**: ✅ Phase 1 Complete + Gateway Monitoring + MetaMCP + Tailscale
|
|
|
|
---
|
|
|
|
<details>
|
|
<summary><b>Full Changelog (Click to expand)</b></summary>
|
|
|
|
### 2025-12-21
|
|
|
|
**UPS Upgrade**
|
|
- Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W)
|
|
- Temporarily rewired plug 5-20P → 5-15P for 15A circuit
|
|
- Runtime: ~15-20 min at 33% load
|
|
|
|
**NUT Monitoring**
|
|
- Configured NUT on PVE (master), PVE2 (slave)
|
|
- Shutdown threshold: 120 seconds runtime
|
|
- Custom shutdown script: `/usr/local/bin/ups-shutdown.sh`
|
|
- Home Assistant integration (UPS sensors)
|
|
|
|
**Happy Server Self-Hosted Relay**
|
|
- Deployed on docker-host (10.10.10.206)
|
|
- Stack: Happy Server + PostgreSQL + Redis + MinIO
|
|
- URL: https://happy.htsn.io
|
|
- Traefik reverse proxy configured
|
|
|
|
**Proxmox Fixes**
|
|
- PVE Tailscale routing: Added rule for local network access
|
|
- PVE2 MTU fix: vmbr0 + nic1 both set to 9000
|
|
- 2-node cluster quorum: `two_node: 1` in corosync.conf
|
|
|
|
**Power Failure Test**
|
|
- Full end-to-end test successful
|
|
- VMs stopped gracefully at 2 min runtime
|
|
- Total recovery: ~7 minutes
|
|
|
|
### 2024-12-20
|
|
|
|
**Git & SSH**
|
|
- Created homelab-docs repo on Gitea
|
|
- Deployed SSH keys to all VMs/LXCs (13 hosts)
|
|
- Updated ~/.ssh/config with host aliases
|
|
|
|
### 2024-12-19
|
|
|
|
**EMC Storage Enclosure**
|
|
- LCC B failure diagnosed, switched to LCC A
|
|
- Fans now quiet (speed code 3 vs 5)
|
|
- Created EMC-ENCLOSURE.md documentation
|
|
|
|
**QEMU Guest Agent**
|
|
- Installed on docker-host, fs-dev, copyparty
|
|
- All VMs now have agent except homeassistant
|
|
|
|
</details>
|