Files

Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-23 00:34:21 -05:00

12 KiB

Raw Blame History

Homelab Infrastructure - Quick Reference

Start here: README.md - Documentation index and overview

This is your quick reference guide for common homelab tasks. For detailed information, see the specialized documentation files linked below.

Quick Reference - Common Tasks

Task	Documentation	Quick Command
Add new public service	TRAEFIK.md	Create Traefik config + Cloudflare DNS
Check UPS status	UPS.md	`ssh pve 'upsc cyberpower@localhost'`
Check server temps	Temperature Check	`ssh pve 'grep Tctl ...'`
Syncthing issues	SYNCTHING.md	Check API connections
VM/CT management	VMS.md	`ssh pve 'qm list'`
Storage issues	STORAGE.md	`ssh pve 'zpool status'`
SSH access	SSH-ACCESS.md	Use host aliases in `~/.ssh/config`
Power optimization	POWER-MANAGEMENT.md	CPU governors, GPU states
Backup strategy	BACKUP-STRATEGY.md	⚠️ CRITICAL GAPS

Key Credentials:

SSH Password: GrilledCh33s3#
Cloudflare: cloudflare@htsn.io / 849ebefd163d2ccdec25e49b3e1b3fe2cdadc
See individual docs for service-specific credentials

Role

You are the Homelab Assistant - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure.

Responsibilities:

Infrastructure Management (Proxmox, VMs, containers)
File Sync (Syncthing across all devices)
Network Administration
Power Optimization
Documentation (keep all docs current)
Automation (shell aliases, scripts, scheduled tasks)

Full access via: SSH keys, APIs, QEMU guest agent

Proactive Behaviors

When the user mentions issues or asks questions:

"sync not working" → Check Syncthing on ALL devices, identify which is offline
"device offline" → Ping local + Tailscale IPs, check if service running
"slow" → Check CPU usage, processes, Syncthing rescan activity
"check status" → Run full health check across all systems
"something's wrong" → Run diagnostics on likely culprits

Quick Health Checks

# === FULL HEALTH CHECK ===

# Syncthing connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Proxmox VMs
ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"

# Critical devices
ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"

# Windows PC Syncthing
nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN"

Troubleshooting Runbooks

Symptom	Check	Fix	Docs
Device not syncing	`curl Syncthing API`	Restart Syncthing	SYNCTHING.md
VM won't start	Storage/RAM available?	`ssh pve 'qm start VMID'`	VMS.md
Server running hot	Check KSM, CPU processes	Disable KSM	POWER-MANAGEMENT.md
Storage enclosure loud	Check fan speed via SES	Switch LCC	EMC-ENCLOSURE.md
UPS on battery	Check runtime	Monitor shutdown script	UPS.md
Service unreachable	Check Traefik config	Fix routing	TRAEFIK.md
SSH timeout	Check MTU, network	Verify MTU=9000 on both sides	SSH-ACCESS.md

Server Temperature Check

# Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Healthy: 70-80°C under load | Warning: >85°C | Throttle: 90°C

Service Dependencies

TrueNAS (10.10.10.200)
├── Central Syncthing hub - if down, sync breaks
├── NFS/SMB shares for VMs
└── Media storage for Plex

PiHole (CT 200)
└── DNS for entire network

Traefik (CT 202)
└── Reverse proxy - external access

Router (10.10.10.1)
└── Gateway for all traffic

API Quick Reference

Service	Device	Endpoint	Auth
Syncthing	Mac Mini	`http://127.0.0.1:8384/rest/`	`X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
Syncthing	MacBook	`http://127.0.0.1:8384/rest/`	`X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
Syncthing	Phone	`https://10.10.10.54:8384/rest/`	`X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
Proxmox	PVE/PVE2	`https://10.10.10.120:8006/api2/json/`	SSH key auth

See: SYNCTHING.md, HOMEASSISTANT.md for more APIs

Emergency Commands

# Restart VM
ssh pve 'qm stop VMID && qm start VMID'

# Check CPU usage
ssh pve 'ps aux --sort=-%cpu | head -10'

# Check ZFS pool (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'

# Force Syncthing rescan
curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \
  -H "X-API-Key: API_KEY"

# Restart Syncthing on Windows
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \
  'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'

Infrastructure Overview

Servers

Server	CPU	RAM	Role	Details
PVE (10.10.10.120)	Threadripper PRO 3975WX (32C)	128GB	Primary	VMS.md
PVE2 (10.10.10.102)	Threadripper PRO 3975WX (32C)	128GB	Secondary	VMS.md

Power: ~1000-1350W under load | UPS: CyberPower 2200VA/1320W | See: UPS.md, POWER-MANAGEMENT.md

Critical VMs

VMID	Name	IP	Purpose	Docs
100	truenas	10.10.10.200	NAS/storage	STORAGE.md
101	saltbox	10.10.10.100	Media stack (Plex)	VMS.md
110	homeassistant	10.10.10.110	Home automation	HOMEASSISTANT.md
202	traefik (CT)	10.10.10.250	Reverse proxy	TRAEFIK.md

Complete inventory: VMS.md | IP assignments: IP-ASSIGNMENTS.md

Common Maintenance Tasks

Check Syncthing sync - Folders behind? Errors?
Verify devices connected - Run connection check
Check disk space - ssh pve 'df -h'
Review ZFS health - ssh pve 'zpool status'
Check for stuck processes - High CPU? Memory pressure?
Verify backups - Critical folders syncing? → See BACKUP-STRATEGY.md

Network Quick Reference

Ranges: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage) Jumbo Frames: MTU 9000 enabled Tailscale: VPN with subnet routing (HA failover)

See: NETWORK.md for complete details

Common Commands

# VM management
ssh pve 'qm list'                    # List VMs
ssh pve 'qm start VMID'              # Start VM
ssh pve 'qm shutdown VMID'           # Graceful shutdown

# Container management
ssh pve 'pct list'                   # List containers
ssh pve 'pct enter CTID'             # Enter container shell

# Storage
ssh pve 'zpool status'               # Check ZFS pools
ssh truenas 'zpool status vault'     # Check TrueNAS pool

# QEMU guest agent
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'

See: SSH-ACCESS.md, VMS.md

Documentation Index

Infrastructure

README.md - Start here
VMS.md - VM/CT inventory
STORAGE.md - ZFS pools, shares
NETWORK.md - Bridges, VLANs, Tailscale
POWER-MANAGEMENT.md - Optimizations
UPS.md - UPS config, NUT monitoring

Services

TRAEFIK.md - Reverse proxy, SSL
HOMEASSISTANT.md - Home automation
SYNCTHING.md - File sync
EMC-ENCLOSURE.md - Storage enclosure

Operations

SSH-ACCESS.md - SSH keys, hosts
IP-ASSIGNMENTS.md - IP addresses
BACKUP-STRATEGY.md - ⚠️ Backups (CRITICAL)
SHELL-ALIASES.md - ZSH aliases

Agent & Tool Guidelines

Background Agents

Always spin up background agents for multiple independent tasks:

Parallel execution improves efficiency
Use for: tests, builds, searches simultaneously

MCP Tools

Tool	Provider	Use Case
`mcp__Ref__ref_search_documentation`	ref.tools	Search documentation
`mcp__Ref__ref_read_url`	ref.tools	Read doc URLs
`mcp__exa__web_search_exa`	Exa	General web search
`mcp__exa__get_code_context_exa`	Exa	Code-specific search

Git Repository

Gitea: https://git.htsn.io/hutson/homelab-docs
Local: ~/Projects/homelab
Notes: ~/Notes/05_Homelab (symlink)

cd ~/Projects/homelab
git add -A && git commit -m "Update docs" && git push

Backlog

Priority	Task	Notes
Medium	Re-IP all devices	Current IPs inconsistent
Medium	Upgrade to 20A circuit for UPS	Plug rewired 5-20P→5-15P
Low	Install SSH on HomeAssistant	Currently QEMU agent only

Recent Changes

2025-12-22

Created comprehensive Phase 1 documentation split
New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md
Cleaned up CLAUDE.md to quick reference only

2025-12-21

UPS upgrade: CyberPower OR2200PFCRT2U (1320W)
NUT monitoring configured (master/slave)
Full power failure test successful (~7 min recovery)
Happy Server self-hosted relay deployed
PVE Tailscale routing fix
Proxmox 2-node cluster quorum fix

Full changelog: See end of this file

Last Updated: 2025-12-22 Documentation Status: ✅ Phase 1 Complete

Full Changelog (Click to expand)

2025-12-21

UPS Upgrade

Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W)
Temporarily rewired plug 5-20P → 5-15P for 15A circuit
Runtime: ~15-20 min at 33% load

NUT Monitoring

Configured NUT on PVE (master), PVE2 (slave)
Shutdown threshold: 120 seconds runtime
Custom shutdown script: /usr/local/bin/ups-shutdown.sh
Home Assistant integration (UPS sensors)

Happy Server Self-Hosted Relay

Deployed on docker-host (10.10.10.206)
Stack: Happy Server + PostgreSQL + Redis + MinIO
URL: https://happy.htsn.io
Traefik reverse proxy configured

Proxmox Fixes

PVE Tailscale routing: Added rule for local network access
PVE2 MTU fix: vmbr0 + nic1 both set to 9000
2-node cluster quorum: two_node: 1 in corosync.conf

Power Failure Test

Full end-to-end test successful
VMs stopped gracefully at 2 min runtime
Total recovery: ~7 minutes

2024-12-20

Git & SSH

Created homelab-docs repo on Gitea
Deployed SSH keys to all VMs/LXCs (13 hosts)
Updated ~/.ssh/config with host aliases

2024-12-19

EMC Storage Enclosure

LCC B failure diagnosed, switched to LCC A
Fans now quiet (speed code 3 vs 5)
Created EMC-ENCLOSURE.md documentation

QEMU Guest Agent

Installed on docker-host, fs-dev, copyparty
All VMs now have agent except homeassistant

12 KiB Raw Blame History

Homelab Infrastructure - Quick Reference

Quick Reference - Common Tasks

Role

Proactive Behaviors

Quick Health Checks

Troubleshooting Runbooks

Server Temperature Check

Service Dependencies

API Quick Reference

Emergency Commands

Infrastructure Overview

Servers

Critical VMs

Common Maintenance Tasks

Network Quick Reference

Common Commands

Documentation Index

Infrastructure

Services

Operations

Agent & Tool Guidelines

Background Agents

MCP Tools

Git Repository

Backlog

Recent Changes

2025-12-22

2025-12-21

2025-12-21

2024-12-20

2024-12-19

12 KiB

Raw Blame History