Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/MAINTENANCE.md
+++ b/MAINTENANCE.md
@@ -0,0 +1,618 @@
+# Maintenance Procedures and Schedules
+
+Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
+
+## Overview
+
+| Frequency | Tasks | Estimated Time |
+|-----------|-------|----------------|
+| **Daily** | Quick health check | 2-5 min |
+| **Weekly** | Service status, logs review | 15-30 min |
+| **Monthly** | Updates, backups verification | 1-2 hours |
+| **Quarterly** | Full system audit, testing | 2-4 hours |
+| **Annual** | Hardware maintenance, planning | 4-8 hours |
+
+---
+
+## Daily Maintenance (Automated)
+
+### Quick Health Check Script
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily via cron**:
+```bash
+# Add to crontab
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+## Weekly Maintenance
+
+### Service Status Review
+
+**Check all critical services**:
+```bash
+# Proxmox services
+ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
+ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
+
+# NUT (UPS monitoring)
+ssh pve 'systemctl status nut-server nut-monitor'
+ssh pve2 'systemctl status nut-monitor'
+
+# Container services
+ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
+ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik
+
+# VM services (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS
+```
+
+### Log Review
+
+**Check for errors in critical logs**:
+```bash
+# Proxmox system logs
+ssh pve 'journalctl -p err -b | tail -50'
+ssh pve2 'journalctl -p err -b | tail -50'
+
+# VM logs (if QEMU agent available)
+ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
+
+# Traefik access logs
+ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
+```
+
+### Syncthing Sync Status
+
+**Check for sync errors**:
+```bash
+# Check all folder errors
+for folder in documents downloads desktop movies pictures notes config; do
+  echo "=== $folder ==="
+  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
+done
+```
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+## Monthly Maintenance
+
+### System Updates
+
+#### Proxmox Updates
+
+**Check for updates**:
+```bash
+ssh pve 'apt update && apt list --upgradable'
+ssh pve2 'apt update && apt list --upgradable'
+```
+
+**Apply updates**:
+```bash
+# PVE
+ssh pve 'apt update && apt dist-upgrade -y'
+
+# PVE2
+ssh pve2 'apt update && apt dist-upgrade -y'
+
+# Reboot if kernel updated
+ssh pve 'reboot'
+ssh pve2 'reboot'
+```
+
+**⚠️ Important**:
+- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
+- Test on PVE2 first if possible
+- Ensure all VMs are backed up before updating
+- Monitor VMs after reboot - some may need manual restart
+
+#### Container Updates (LXC)
+
+```bash
+# Update all containers
+ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
+```
+
+#### VM Updates
+
+**Update VMs individually via SSH**:
+```bash
+# Ubuntu/Debian VMs
+ssh truenas 'apt update && apt upgrade -y'
+ssh docker-host 'apt update && apt upgrade -y'
+ssh fs-dev 'apt update && apt upgrade -y'
+
+# Check if reboot required
+ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
+```
+
+### ZFS Scrubs
+
+**Schedule**: Run monthly on all pools
+
+**PVE**:
+```bash
+# Start scrub on all pools
+ssh pve 'zpool scrub nvme-mirror1'
+ssh pve 'zpool scrub nvme-mirror2'
+ssh pve 'zpool scrub rpool'
+
+# Check scrub status
+ssh pve 'zpool status | grep -A2 scrub'
+```
+
+**PVE2**:
+```bash
+ssh pve2 'zpool scrub nvme-mirror3'
+ssh pve2 'zpool scrub local-zfs2'
+ssh pve2 'zpool status | grep -A2 scrub'
+```
+
+**TrueNAS**:
+```bash
+# Scrub via TrueNAS web UI or SSH
+ssh truenas 'zpool scrub vault'
+ssh truenas 'zpool status vault | grep -A2 scrub'
+```
+
+**Automate scrubs**:
+```bash
+# Add to crontab (run on 1st of month at 2 AM)
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub nvme-mirror2
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**See**: [STORAGE.md](STORAGE.md) for pool details
+
+### SMART Tests
+
+**Run extended SMART tests monthly**:
+
+```bash
+# TrueNAS drives (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
+
+# Check results after 4-8 hours
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
+
+# PVE drives
+ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+
+# PVE2 drives
+ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+```
+
+**Automate SMART tests**:
+```bash
+# Add to crontab (run on 15th of month at 3 AM)
+0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
+0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
+```
+
+### Certificate Renewal Verification
+
+**Check SSL certificate expiry**:
+```bash
+# Check Traefik certificates
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
+
+# Check specific service
+echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
+```
+
+**Certificates should auto-renew 30 days before expiry via Traefik**
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
+
+### Backup Verification
+
+**⚠️ TODO**: No backup strategy currently in place
+
+**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
+
+---
+
+## Quarterly Maintenance
+
+### Full System Audit
+
+**Check all systems comprehensively**:
+
+1. **ZFS Pool Health**:
+   ```bash
+   ssh pve 'zpool status -v'
+   ssh pve2 'zpool status -v'
+   ssh truenas 'zpool status -v vault'
+   ```
+   Look for: errors, degraded vdevs, resilver operations
+
+2. **SMART Health**:
+   ```bash
+   # Run SMART health check script
+   ~/bin/smart-health-check.sh
+   ```
+   Look for: reallocated sectors, pending sectors, failures
+
+3. **Disk Space Trends**:
+   ```bash
+   # Check growth rate
+   ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
+   ssh truenas 'df -h /mnt/vault'
+   ```
+   Plan for expansion if >80% full
+
+4. **VM Resource Usage**:
+   ```bash
+   # Check if VMs need more/less resources
+   ssh pve 'qm list'
+   ssh pve 'pvesh get /nodes/pve/status'
+   ```
+
+5. **Network Performance**:
+   ```bash
+   # Test bandwidth between critical nodes
+   iperf3 -s  # On one host
+   iperf3 -c 10.10.10.120  # From another
+   ```
+
+6. **Temperature Monitoring**:
+   ```bash
+   # Check max temps over past quarter
+   # TODO: Set up Prometheus/Grafana for historical data
+   ssh pve 'sensors'
+   ssh pve2 'sensors'
+   ```
+
+### Service Dependency Testing
+
+**Test critical paths**:
+
+1. **Power failure recovery** (if safe to test):
+   - See [UPS.md](UPS.md) for full procedure
+   - Verify VM startup order works
+   - Confirm all services come back online
+
+2. **Failover testing**:
+   - Tailscale subnet routing (PVE → UCG-Fiber)
+   - NUT monitoring (PVE server → PVE2 client)
+
+3. **Backup restoration** (when backups implemented):
+   - Test restoring a VM from backup
+   - Test restoring files from Syncthing versioning
+
+### Documentation Review
+
+- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
+- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
+- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
+- [ ] Update any changed procedures in this document
+
+---
+
+## Annual Maintenance
+
+### Hardware Maintenance
+
+**Physical cleaning**:
+```bash
+# Shut down servers (coordinate with users)
+ssh pve 'shutdown -h now'
+ssh pve2 'shutdown -h now'
+
+# Clean dust from:
+# - CPU heatsinks
+# - GPU fans
+# - Case fans
+# - PSU vents
+# - Storage enclosure fans
+
+# Check for:
+# - Bulging capacitors on PSU/motherboard
+# - Loose cables
+# - Fan noise/vibration
+```
+
+**Thermal paste inspection** (every 2-3 years):
+- Check CPU temps vs baseline
+- If temps >85°C under load, consider reapplying paste
+- Threadripper PRO: Tctl max safe = 90°C
+
+**See**: [HARDWARE.md](HARDWARE.md) for component details
+
+### UPS Battery Test
+
+**Runtime test**:
+```bash
+# Check battery health
+ssh pve 'upsc cyberpower@localhost | grep battery'
+
+# Perform runtime test (coordinate power loss)
+# 1. Note current runtime estimate
+# 2. Unplug UPS from wall
+# 3. Let battery drain to 20%
+# 4. Note actual runtime vs estimate
+# 5. Plug back in before shutdown triggers
+
+# Battery replacement if:
+# - Runtime < 10 min at typical load
+# - Battery age > 3-5 years
+# - Battery charge < 100% when on AC for 24h
+```
+
+**See**: [UPS.md](UPS.md) for full UPS details
+
+### Drive Replacement Planning
+
+**Check drive age and health**:
+```bash
+# Get drive hours and health
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
+done'
+```
+
+**Replace drives if**:
+- Reallocated sectors > 0
+- Pending sectors > 0
+- SMART pre-fail warnings
+- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
+- Hours > 50,000 for consumer drives
+
+**Budget for replacements**:
+- HDDs: WD Red 6TB (~$150/drive)
+- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
+
+### Capacity Planning
+
+**Review growth trends**:
+```bash
+# Storage growth (compare to last year)
+ssh pve 'zpool list'
+ssh truenas 'df -h /mnt/vault'
+
+# Network bandwidth (if monitoring in place)
+# Review Grafana dashboards
+
+# Power consumption
+ssh pve 'upsc cyberpower@localhost ups.load'
+```
+
+**Plan expansions**:
+- Storage: Add drives if >70% full
+- RAM: Check if VMs hitting limits
+- Network: Upgrade if bandwidth saturation
+- UPS: Upgrade if load >80%
+
+### License and Subscription Review
+
+**Proxmox subscription** (if applicable):
+- Community (free) or Enterprise subscription?
+- Check for updates to pricing/features
+
+**Service subscriptions**:
+- Domain registration (htsn.io)
+- Cloudflare plan (currently free)
+- Let's Encrypt (free, no action needed)
+
+---
+
+## Update Schedules
+
+### Proxmox
+
+| Component | Frequency | Notes |
+|-----------|-----------|-------|
+| Security patches | Weekly | Via `apt upgrade` |
+| Minor updates | Monthly | Test on PVE2 first |
+| Major versions | Quarterly | Read release notes, plan downtime |
+| Kernel updates | Monthly | Requires reboot |
+
+**Update procedure**:
+1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
+2. Backup VM configs: `vzdump --dumpdir /tmp`
+3. Update: `apt update && apt dist-upgrade`
+4. Reboot if kernel changed: `reboot`
+5. Verify VMs auto-started: `qm list`
+
+### Containers (LXC)
+
+| Container | Update Frequency | Package Manager |
+|-----------|------------------|-----------------|
+| Pi-hole (200) | Weekly | `apt` |
+| Traefik (202) | Monthly | `apt` |
+| FindShyt (205) | As needed | `apt` |
+
+**Update command**:
+```bash
+ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
+```
+
+### VMs
+
+| VM | Update Frequency | Notes |
+|----|------------------|-------|
+| TrueNAS | Monthly | Via web UI or `apt` |
+| Saltbox | Weekly | Managed by Saltbox updates |
+| HomeAssistant | Monthly | Via HA supervisor |
+| Docker-host | Weekly | `apt` + Docker images |
+| Trading-VM | As needed | Via SSH |
+| Gitea-VM | Monthly | Via web UI + `apt` |
+
+**Docker image updates**:
+```bash
+ssh docker-host 'docker-compose pull && docker-compose up -d'
+```
+
+### Firmware Updates
+
+| Component | Check Frequency | Update Method |
+|-----------|----------------|---------------|
+| Motherboard BIOS | Annually | Manual flash (high risk) |
+| GPU firmware | Rarely | `nvidia-smi` or manual |
+| SSD/NVMe firmware | Quarterly | Vendor tools |
+| HBA firmware | Annually | LSI tools |
+| UPS firmware | Annually | PowerPanel or manual |
+
+**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
+- Critical security issue
+- Needed for hardware compatibility
+- Fixing known bug affecting you
+
+---
+
+## Testing Checklists
+
+### Pre-Update Checklist
+
+Before ANY system update:
+- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
+- [ ] Verify backups are current (when backup system in place)
+- [ ] Check for critical VMs/services that can't have downtime
+- [ ] Review update changelog/release notes
+- [ ] Test on non-critical system first (PVE2 or test VM)
+- [ ] Plan rollback strategy if update fails
+- [ ] Notify users if downtime expected
+
+### Post-Update Checklist
+
+After system update:
+- [ ] Verify system booted correctly: `uptime`
+- [ ] Check all VMs/CTs started: `qm list`, `pct list`
+- [ ] Test critical services:
+  - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
+  - [ ] Traefik routing: `curl -I https://plex.htsn.io`
+  - [ ] NFS/SMB shares: Test mount from VM
+  - [ ] Syncthing sync: Check all devices connected
+- [ ] Review logs for errors: `journalctl -p err -b`
+- [ ] Check temperatures: `sensors`
+- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
+
+### Disaster Recovery Test
+
+**Quarterly test** (when backup system in place):
+- [ ] Simulate VM failure: Restore from backup
+- [ ] Simulate storage failure: Import pool on different system
+- [ ] Simulate network failure: Verify Tailscale failover
+- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
+- [ ] Document recovery time and issues
+
+---
+
+## Log Rotation
+
+**System logs** are automatically rotated by systemd-journald and logrotate.
+
+**Check log sizes**:
+```bash
+# Journalctl size
+ssh pve 'journalctl --disk-usage'
+
+# Traefik logs
+ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
+```
+
+**Configure retention**:
+```bash
+# Limit journald to 500MB
+ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
+ssh pve 'systemctl restart systemd-journald'
+```
+
+**Traefik log rotation** (already configured):
+```bash
+# /etc/logrotate.d/traefik on CT 202
+/var/log/traefik/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+}
+```
+
+---
+
+## Monitoring Integration
+
+**TODO**: Set up automated monitoring for these procedures
+
+**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
+- ZFS scrub completion/errors
+- SMART test failures
+- Certificate expiry warnings (<30 days)
+- Update availability notifications
+- Disk space thresholds (>80%)
+- Temperature warnings (>85°C)
+
+---
+
+## Related Documentation
+
+- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
+- [UPS.md](UPS.md) - Power failure procedures
+- [STORAGE.md](STORAGE.md) - ZFS pool management
+- [HARDWARE.md](HARDWARE.md) - Hardware specifications
+- [SERVICES.md](SERVICES.md) - Service inventory
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Manual procedures only - monitoring automation needed