Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
Maintenance Procedures and Schedules
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
Overview
| Frequency | Tasks | Estimated Time |
|---|---|---|
| Daily | Quick health check | 2-5 min |
| Weekly | Service status, logs review | 15-30 min |
| Monthly | Updates, backups verification | 1-2 hours |
| Quarterly | Full system audit, testing | 2-4 hours |
| Annual | Hardware maintenance, planning | 4-8 hours |
Daily Maintenance (Automated)
Quick Health Check Script
Save as ~/bin/homelab-health-check.sh:
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
Run daily via cron:
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
Weekly Maintenance
Service Status Review
Check all critical services:
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'
# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
Log Review
Check for errors in critical logs:
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'
# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
Syncthing Sync Status
Check for sync errors:
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
echo "=== $folder ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
See: SYNCTHING.md
Monthly Maintenance
System Updates
Proxmox Updates
Check for updates:
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
Apply updates:
# PVE
ssh pve 'apt update && apt dist-upgrade -y'
# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'
# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
⚠️ Important:
- Check Proxmox release notes before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart
Container Updates (LXC)
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
VM Updates
Update VMs individually via SSH:
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'
# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
ZFS Scrubs
Schedule: Run monthly on all pools
PVE:
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'
# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
PVE2:
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
TrueNAS:
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
Automate scrubs:
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
See: STORAGE.md for pool details
SMART Tests
Run extended SMART tests monthly:
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
Automate SMART tests:
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
Certificate Renewal Verification
Check SSL certificate expiry:
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
Certificates should auto-renew 30 days before expiry via Traefik
See: TRAEFIK.md for certificate management
Backup Verification
⚠️ TODO: No backup strategy currently in place
See: BACKUP-STRATEGY.md for implementation plan
Quarterly Maintenance
Full System Audit
Check all systems comprehensively:
-
ZFS Pool Health:
ssh pve 'zpool status -v' ssh pve2 'zpool status -v' ssh truenas 'zpool status -v vault'Look for: errors, degraded vdevs, resilver operations
-
SMART Health:
# Run SMART health check script ~/bin/smart-health-check.shLook for: reallocated sectors, pending sectors, failures
-
Disk Space Trends:
# Check growth rate ssh pve 'zpool list -o name,size,allocated,free,fragmentation' ssh truenas 'df -h /mnt/vault'Plan for expansion if >80% full
-
VM Resource Usage:
# Check if VMs need more/less resources ssh pve 'qm list' ssh pve 'pvesh get /nodes/pve/status' -
Network Performance:
# Test bandwidth between critical nodes iperf3 -s # On one host iperf3 -c 10.10.10.120 # From another -
Temperature Monitoring:
# Check max temps over past quarter # TODO: Set up Prometheus/Grafana for historical data ssh pve 'sensors' ssh pve2 'sensors'
Service Dependency Testing
Test critical paths:
-
Power failure recovery (if safe to test):
- See UPS.md for full procedure
- Verify VM startup order works
- Confirm all services come back online
-
Failover testing:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
-
Backup restoration (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning
Documentation Review
- Update IP assignments in IP-ASSIGNMENTS.md
- Review and update service URLs in SERVICES.md
- Check for missing hardware specs in HARDWARE.md
- Update any changed procedures in this document
Annual Maintenance
Hardware Maintenance
Physical cleaning:
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'
# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans
# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
Thermal paste inspection (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C
See: HARDWARE.md for component details
UPS Battery Test
Runtime test:
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'
# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers
# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
See: UPS.md for full UPS details
Drive Replacement Planning
Check drive age and health:
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
Replace drives if:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives
Budget for replacements:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
Capacity Planning
Review growth trends:
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'
# Network bandwidth (if monitoring in place)
# Review Grafana dashboards
# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
Plan expansions:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%
License and Subscription Review
Proxmox subscription (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features
Service subscriptions:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)
Update Schedules
Proxmox
| Component | Frequency | Notes |
|---|---|---|
| Security patches | Weekly | Via apt upgrade |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |
Update procedure:
- Check Proxmox release notes
- Backup VM configs:
vzdump --dumpdir /tmp - Update:
apt update && apt dist-upgrade - Reboot if kernel changed:
reboot - Verify VMs auto-started:
qm list
Containers (LXC)
| Container | Update Frequency | Package Manager |
|---|---|---|
| Pi-hole (200) | Weekly | apt |
| Traefik (202) | Monthly | apt |
| FindShyt (205) | As needed | apt |
Update command:
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
VMs
| VM | Update Frequency | Notes |
|---|---|---|
| TrueNAS | Monthly | Via web UI or apt |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | apt + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + apt |
Docker image updates:
ssh docker-host 'docker-compose pull && docker-compose up -d'
Firmware Updates
| Component | Check Frequency | Update Method |
|---|---|---|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | nvidia-smi or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |
⚠️ Warning: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you
Testing Checklists
Pre-Update Checklist
Before ANY system update:
- Check current system state:
uptime,qm list,zpool status - Verify backups are current (when backup system in place)
- Check for critical VMs/services that can't have downtime
- Review update changelog/release notes
- Test on non-critical system first (PVE2 or test VM)
- Plan rollback strategy if update fails
- Notify users if downtime expected
Post-Update Checklist
After system update:
- Verify system booted correctly:
uptime - Check all VMs/CTs started:
qm list,pct list - Test critical services:
- Pi-hole DNS:
nslookup google.com 10.10.10.10 - Traefik routing:
curl -I https://plex.htsn.io - NFS/SMB shares: Test mount from VM
- Syncthing sync: Check all devices connected
- Pi-hole DNS:
- Review logs for errors:
journalctl -p err -b - Check temperatures:
sensors - Verify UPS monitoring:
upsc cyberpower@localhost
Disaster Recovery Test
Quarterly test (when backup system in place):
- Simulate VM failure: Restore from backup
- Simulate storage failure: Import pool on different system
- Simulate network failure: Verify Tailscale failover
- Simulate power failure: Test UPS shutdown procedure (if safe)
- Document recovery time and issues
Log Rotation
System logs are automatically rotated by systemd-journald and logrotate.
Check log sizes:
# Journalctl size
ssh pve 'journalctl --disk-usage'
# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
Configure retention:
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
Traefik log rotation (already configured):
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
Monitoring Integration
TODO: Set up automated monitoring for these procedures
When monitoring is implemented (see MONITORING.md):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)
Related Documentation
- MONITORING.md - Automated health checks and alerts
- BACKUP-STRATEGY.md - Backup implementation plan
- UPS.md - Power failure procedures
- STORAGE.md - ZFS pool management
- HARDWARE.md - Hardware specifications
- SERVICES.md - Service inventory
Last Updated: 2025-12-22 Status: ⚠️ Manual procedures only - monitoring automation needed