Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Hutson
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions

618
MAINTENANCE.md Normal file
View File

@@ -0,0 +1,618 @@
# Maintenance Procedures and Schedules
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
## Overview
| Frequency | Tasks | Estimated Time |
|-----------|-------|----------------|
| **Daily** | Quick health check | 2-5 min |
| **Weekly** | Service status, logs review | 15-30 min |
| **Monthly** | Updates, backups verification | 1-2 hours |
| **Quarterly** | Full system audit, testing | 2-4 hours |
| **Annual** | Hardware maintenance, planning | 4-8 hours |
---
## Daily Maintenance (Automated)
### Quick Health Check Script
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily via cron**:
```bash
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
## Weekly Maintenance
### Service Status Review
**Check all critical services**:
```bash
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'
# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
```
### Log Review
**Check for errors in critical logs**:
```bash
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'
# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
```
### Syncthing Sync Status
**Check for sync errors**:
```bash
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
echo "=== $folder ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
```
**See**: [SYNCTHING.md](SYNCTHING.md)
---
## Monthly Maintenance
### System Updates
#### Proxmox Updates
**Check for updates**:
```bash
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
```
**Apply updates**:
```bash
# PVE
ssh pve 'apt update && apt dist-upgrade -y'
# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'
# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
```
**⚠️ Important**:
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart
#### Container Updates (LXC)
```bash
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
```
#### VM Updates
**Update VMs individually via SSH**:
```bash
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'
# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
```
### ZFS Scrubs
**Schedule**: Run monthly on all pools
**PVE**:
```bash
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'
# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
```
**PVE2**:
```bash
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
```
**TrueNAS**:
```bash
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
```
**Automate scrubs**:
```bash
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
```
**See**: [STORAGE.md](STORAGE.md) for pool details
### SMART Tests
**Run extended SMART tests monthly**:
```bash
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
```
**Automate SMART tests**:
```bash
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
```
### Certificate Renewal Verification
**Check SSL certificate expiry**:
```bash
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
**Certificates should auto-renew 30 days before expiry via Traefik**
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
### Backup Verification
**⚠️ TODO**: No backup strategy currently in place
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
---
## Quarterly Maintenance
### Full System Audit
**Check all systems comprehensively**:
1. **ZFS Pool Health**:
```bash
ssh pve 'zpool status -v'
ssh pve2 'zpool status -v'
ssh truenas 'zpool status -v vault'
```
Look for: errors, degraded vdevs, resilver operations
2. **SMART Health**:
```bash
# Run SMART health check script
~/bin/smart-health-check.sh
```
Look for: reallocated sectors, pending sectors, failures
3. **Disk Space Trends**:
```bash
# Check growth rate
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
ssh truenas 'df -h /mnt/vault'
```
Plan for expansion if >80% full
4. **VM Resource Usage**:
```bash
# Check if VMs need more/less resources
ssh pve 'qm list'
ssh pve 'pvesh get /nodes/pve/status'
```
5. **Network Performance**:
```bash
# Test bandwidth between critical nodes
iperf3 -s # On one host
iperf3 -c 10.10.10.120 # From another
```
6. **Temperature Monitoring**:
```bash
# Check max temps over past quarter
# TODO: Set up Prometheus/Grafana for historical data
ssh pve 'sensors'
ssh pve2 'sensors'
```
### Service Dependency Testing
**Test critical paths**:
1. **Power failure recovery** (if safe to test):
- See [UPS.md](UPS.md) for full procedure
- Verify VM startup order works
- Confirm all services come back online
2. **Failover testing**:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
3. **Backup restoration** (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning
### Documentation Review
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
- [ ] Update any changed procedures in this document
---
## Annual Maintenance
### Hardware Maintenance
**Physical cleaning**:
```bash
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'
# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans
# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
```
**Thermal paste inspection** (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C
**See**: [HARDWARE.md](HARDWARE.md) for component details
### UPS Battery Test
**Runtime test**:
```bash
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'
# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers
# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
```
**See**: [UPS.md](UPS.md) for full UPS details
### Drive Replacement Planning
**Check drive age and health**:
```bash
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
```
**Replace drives if**:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives
**Budget for replacements**:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
### Capacity Planning
**Review growth trends**:
```bash
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'
# Network bandwidth (if monitoring in place)
# Review Grafana dashboards
# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
```
**Plan expansions**:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%
### License and Subscription Review
**Proxmox subscription** (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features
**Service subscriptions**:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)
---
## Update Schedules
### Proxmox
| Component | Frequency | Notes |
|-----------|-----------|-------|
| Security patches | Weekly | Via `apt upgrade` |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |
**Update procedure**:
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
2. Backup VM configs: `vzdump --dumpdir /tmp`
3. Update: `apt update && apt dist-upgrade`
4. Reboot if kernel changed: `reboot`
5. Verify VMs auto-started: `qm list`
### Containers (LXC)
| Container | Update Frequency | Package Manager |
|-----------|------------------|-----------------|
| Pi-hole (200) | Weekly | `apt` |
| Traefik (202) | Monthly | `apt` |
| FindShyt (205) | As needed | `apt` |
**Update command**:
```bash
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
```
### VMs
| VM | Update Frequency | Notes |
|----|------------------|-------|
| TrueNAS | Monthly | Via web UI or `apt` |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | `apt` + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + `apt` |
**Docker image updates**:
```bash
ssh docker-host 'docker-compose pull && docker-compose up -d'
```
### Firmware Updates
| Component | Check Frequency | Update Method |
|-----------|----------------|---------------|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | `nvidia-smi` or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you
---
## Testing Checklists
### Pre-Update Checklist
Before ANY system update:
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
- [ ] Verify backups are current (when backup system in place)
- [ ] Check for critical VMs/services that can't have downtime
- [ ] Review update changelog/release notes
- [ ] Test on non-critical system first (PVE2 or test VM)
- [ ] Plan rollback strategy if update fails
- [ ] Notify users if downtime expected
### Post-Update Checklist
After system update:
- [ ] Verify system booted correctly: `uptime`
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
- [ ] Test critical services:
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
- [ ] NFS/SMB shares: Test mount from VM
- [ ] Syncthing sync: Check all devices connected
- [ ] Review logs for errors: `journalctl -p err -b`
- [ ] Check temperatures: `sensors`
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
### Disaster Recovery Test
**Quarterly test** (when backup system in place):
- [ ] Simulate VM failure: Restore from backup
- [ ] Simulate storage failure: Import pool on different system
- [ ] Simulate network failure: Verify Tailscale failover
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
- [ ] Document recovery time and issues
---
## Log Rotation
**System logs** are automatically rotated by systemd-journald and logrotate.
**Check log sizes**:
```bash
# Journalctl size
ssh pve 'journalctl --disk-usage'
# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
```
**Configure retention**:
```bash
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
```
**Traefik log rotation** (already configured):
```bash
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
```
---
## Monitoring Integration
**TODO**: Set up automated monitoring for these procedures
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)
---
## Related Documentation
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
- [UPS.md](UPS.md) - Power failure procedures
- [STORAGE.md](STORAGE.md) - ZFS pool management
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
- [SERVICES.md](SERVICES.md) - Service inventory
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Manual procedures only - monitoring automation needed