# Maintenance Procedures and Schedules Regular maintenance procedures for homelab infrastructure to ensure reliability and performance. ## Overview | Frequency | Tasks | Estimated Time | |-----------|-------|----------------| | **Daily** | Quick health check | 2-5 min | | **Weekly** | Service status, logs review | 15-30 min | | **Monthly** | Updates, backups verification | 1-2 hours | | **Quarterly** | Full system audit, testing | 2-4 hours | | **Annual** | Hardware maintenance, planning | 4-8 hours | --- ## Daily Maintenance (Automated) ### Quick Health Check Script Save as `~/bin/homelab-health-check.sh`: ```bash #!/bin/bash # Daily homelab health check echo "=== Homelab Health Check ===" echo "Date: $(date)" echo "" echo "=== Server Status ===" ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE" ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE" echo "" echo "=== CPU Temperatures ===" ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done' ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done' echo "" echo "=== UPS Status ===" ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' echo "" echo "=== ZFS Pools ===" ssh pve 'zpool status -x' 2>/dev/null ssh pve2 'zpool status -x' 2>/dev/null ssh truenas 'zpool status -x vault' echo "" echo "=== Disk Space ===" ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"' ssh truenas 'df -h /mnt/vault' echo "" echo "=== VM Status ===" ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:" ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:" echo "" echo "=== Syncthing Connections ===" curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/system/connections" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" echo "" echo "=== Check Complete ===" ``` **Run daily via cron**: ```bash # Add to crontab 0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com ``` --- ## Weekly Maintenance ### Service Status Review **Check all critical services**: ```bash # Proxmox services ssh pve 'systemctl status pve-cluster pvedaemon pveproxy' ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy' # NUT (UPS monitoring) ssh pve 'systemctl status nut-server nut-monitor' ssh pve2 'systemctl status nut-monitor' # Container services ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik # VM services (via QEMU agent) ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS ``` ### Log Review **Check for errors in critical logs**: ```bash # Proxmox system logs ssh pve 'journalctl -p err -b | tail -50' ssh pve2 'journalctl -p err -b | tail -50' # VM logs (if QEMU agent available) ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"' # Traefik access logs ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log' ``` ### Syncthing Sync Status **Check for sync errors**: ```bash # Check all folder errors for folder in documents downloads desktop movies pictures notes config; do echo "=== $folder ===" curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq done ``` **See**: [SYNCTHING.md](SYNCTHING.md) --- ## Monthly Maintenance ### System Updates #### Proxmox Updates **Check for updates**: ```bash ssh pve 'apt update && apt list --upgradable' ssh pve2 'apt update && apt list --upgradable' ``` **Apply updates**: ```bash # PVE ssh pve 'apt update && apt dist-upgrade -y' # PVE2 ssh pve2 'apt update && apt dist-upgrade -y' # Reboot if kernel updated ssh pve 'reboot' ssh pve2 'reboot' ``` **⚠️ Important**: - Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates - Test on PVE2 first if possible - Ensure all VMs are backed up before updating - Monitor VMs after reboot - some may need manual restart #### Container Updates (LXC) ```bash # Update all containers ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done' ``` #### VM Updates **Update VMs individually via SSH**: ```bash # Ubuntu/Debian VMs ssh truenas 'apt update && apt upgrade -y' ssh docker-host 'apt update && apt upgrade -y' ssh fs-dev 'apt update && apt upgrade -y' # Check if reboot required ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"' ``` ### ZFS Scrubs **Schedule**: Run monthly on all pools **PVE**: ```bash # Start scrub on all pools ssh pve 'zpool scrub nvme-mirror1' ssh pve 'zpool scrub nvme-mirror2' ssh pve 'zpool scrub rpool' # Check scrub status ssh pve 'zpool status | grep -A2 scrub' ``` **PVE2**: ```bash ssh pve2 'zpool scrub nvme-mirror3' ssh pve2 'zpool scrub local-zfs2' ssh pve2 'zpool status | grep -A2 scrub' ``` **TrueNAS**: ```bash # Scrub via TrueNAS web UI or SSH ssh truenas 'zpool scrub vault' ssh truenas 'zpool status vault | grep -A2 scrub' ``` **Automate scrubs**: ```bash # Add to crontab (run on 1st of month at 2 AM) 0 2 1 * * /sbin/zpool scrub nvme-mirror1 0 2 1 * * /sbin/zpool scrub nvme-mirror2 0 2 1 * * /sbin/zpool scrub rpool ``` **See**: [STORAGE.md](STORAGE.md) for pool details ### SMART Tests **Run extended SMART tests monthly**: ```bash # TrueNAS drives (via QEMU agent) ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"' # Check results after 4-8 hours ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"' # PVE drives ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done' # PVE2 drives ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done' ``` **Automate SMART tests**: ```bash # Add to crontab (run on 15th of month at 3 AM) 0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0 0 3 15 * * /usr/sbin/smartctl -t long /dev/sda ``` ### Certificate Renewal Verification **Check SSL certificate expiry**: ```bash # Check Traefik certificates ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"' # Check specific service echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates ``` **Certificates should auto-renew 30 days before expiry via Traefik** **See**: [TRAEFIK.md](TRAEFIK.md) for certificate management ### Backup Verification **⚠️ TODO**: No backup strategy currently in place **See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan --- ## Quarterly Maintenance ### Full System Audit **Check all systems comprehensively**: 1. **ZFS Pool Health**: ```bash ssh pve 'zpool status -v' ssh pve2 'zpool status -v' ssh truenas 'zpool status -v vault' ``` Look for: errors, degraded vdevs, resilver operations 2. **SMART Health**: ```bash # Run SMART health check script ~/bin/smart-health-check.sh ``` Look for: reallocated sectors, pending sectors, failures 3. **Disk Space Trends**: ```bash # Check growth rate ssh pve 'zpool list -o name,size,allocated,free,fragmentation' ssh truenas 'df -h /mnt/vault' ``` Plan for expansion if >80% full 4. **VM Resource Usage**: ```bash # Check if VMs need more/less resources ssh pve 'qm list' ssh pve 'pvesh get /nodes/pve/status' ``` 5. **Network Performance**: ```bash # Test bandwidth between critical nodes iperf3 -s # On one host iperf3 -c 10.10.10.120 # From another ``` 6. **Temperature Monitoring**: ```bash # Check max temps over past quarter # TODO: Set up Prometheus/Grafana for historical data ssh pve 'sensors' ssh pve2 'sensors' ``` ### Service Dependency Testing **Test critical paths**: 1. **Power failure recovery** (if safe to test): - See [UPS.md](UPS.md) for full procedure - Verify VM startup order works - Confirm all services come back online 2. **Failover testing**: - Tailscale subnet routing (PVE → UCG-Fiber) - NUT monitoring (PVE server → PVE2 client) 3. **Backup restoration** (when backups implemented): - Test restoring a VM from backup - Test restoring files from Syncthing versioning ### Documentation Review - [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - [ ] Review and update service URLs in [SERVICES.md](SERVICES.md) - [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md) - [ ] Update any changed procedures in this document --- ## Annual Maintenance ### Hardware Maintenance **Physical cleaning**: ```bash # Shut down servers (coordinate with users) ssh pve 'shutdown -h now' ssh pve2 'shutdown -h now' # Clean dust from: # - CPU heatsinks # - GPU fans # - Case fans # - PSU vents # - Storage enclosure fans # Check for: # - Bulging capacitors on PSU/motherboard # - Loose cables # - Fan noise/vibration ``` **Thermal paste inspection** (every 2-3 years): - Check CPU temps vs baseline - If temps >85°C under load, consider reapplying paste - Threadripper PRO: Tctl max safe = 90°C **See**: [HARDWARE.md](HARDWARE.md) for component details ### UPS Battery Test **Runtime test**: ```bash # Check battery health ssh pve 'upsc cyberpower@localhost | grep battery' # Perform runtime test (coordinate power loss) # 1. Note current runtime estimate # 2. Unplug UPS from wall # 3. Let battery drain to 20% # 4. Note actual runtime vs estimate # 5. Plug back in before shutdown triggers # Battery replacement if: # - Runtime < 10 min at typical load # - Battery age > 3-5 years # - Battery charge < 100% when on AC for 24h ``` **See**: [UPS.md](UPS.md) for full UPS details ### Drive Replacement Planning **Check drive age and health**: ```bash # Get drive hours and health ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending"; done' ``` **Replace drives if**: - Reallocated sectors > 0 - Pending sectors > 0 - SMART pre-fail warnings - Age > 5 years for HDDs (3-5 years for SSDs/NVMe) - Hours > 50,000 for consumer drives **Budget for replacements**: - HDDs: WD Red 6TB (~$150/drive) - NVMe: Samsung/Kingston 2TB (~$150-200/drive) ### Capacity Planning **Review growth trends**: ```bash # Storage growth (compare to last year) ssh pve 'zpool list' ssh truenas 'df -h /mnt/vault' # Network bandwidth (if monitoring in place) # Review Grafana dashboards # Power consumption ssh pve 'upsc cyberpower@localhost ups.load' ``` **Plan expansions**: - Storage: Add drives if >70% full - RAM: Check if VMs hitting limits - Network: Upgrade if bandwidth saturation - UPS: Upgrade if load >80% ### License and Subscription Review **Proxmox subscription** (if applicable): - Community (free) or Enterprise subscription? - Check for updates to pricing/features **Service subscriptions**: - Domain registration (htsn.io) - Cloudflare plan (currently free) - Let's Encrypt (free, no action needed) --- ## Update Schedules ### Proxmox | Component | Frequency | Notes | |-----------|-----------|-------| | Security patches | Weekly | Via `apt upgrade` | | Minor updates | Monthly | Test on PVE2 first | | Major versions | Quarterly | Read release notes, plan downtime | | Kernel updates | Monthly | Requires reboot | **Update procedure**: 1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) 2. Backup VM configs: `vzdump --dumpdir /tmp` 3. Update: `apt update && apt dist-upgrade` 4. Reboot if kernel changed: `reboot` 5. Verify VMs auto-started: `qm list` ### Containers (LXC) | Container | Update Frequency | Package Manager | |-----------|------------------|-----------------| | Pi-hole (200) | Weekly | `apt` | | Traefik (202) | Monthly | `apt` | | FindShyt (205) | As needed | `apt` | **Update command**: ```bash ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"' ``` ### VMs | VM | Update Frequency | Notes | |----|------------------|-------| | TrueNAS | Monthly | Via web UI or `apt` | | Saltbox | Weekly | Managed by Saltbox updates | | HomeAssistant | Monthly | Via HA supervisor | | Docker-host | Weekly | `apt` + Docker images | | Trading-VM | As needed | Via SSH | | Gitea-VM | Monthly | Via web UI + `apt` | **Docker image updates**: ```bash ssh docker-host 'docker-compose pull && docker-compose up -d' ``` ### Firmware Updates | Component | Check Frequency | Update Method | |-----------|----------------|---------------| | Motherboard BIOS | Annually | Manual flash (high risk) | | GPU firmware | Rarely | `nvidia-smi` or manual | | SSD/NVMe firmware | Quarterly | Vendor tools | | HBA firmware | Annually | LSI tools | | UPS firmware | Annually | PowerPanel or manual | **⚠️ Warning**: BIOS/firmware updates carry risk. Only update if: - Critical security issue - Needed for hardware compatibility - Fixing known bug affecting you --- ## Testing Checklists ### Pre-Update Checklist Before ANY system update: - [ ] Check current system state: `uptime`, `qm list`, `zpool status` - [ ] Verify backups are current (when backup system in place) - [ ] Check for critical VMs/services that can't have downtime - [ ] Review update changelog/release notes - [ ] Test on non-critical system first (PVE2 or test VM) - [ ] Plan rollback strategy if update fails - [ ] Notify users if downtime expected ### Post-Update Checklist After system update: - [ ] Verify system booted correctly: `uptime` - [ ] Check all VMs/CTs started: `qm list`, `pct list` - [ ] Test critical services: - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10` - [ ] Traefik routing: `curl -I https://plex.htsn.io` - [ ] NFS/SMB shares: Test mount from VM - [ ] Syncthing sync: Check all devices connected - [ ] Review logs for errors: `journalctl -p err -b` - [ ] Check temperatures: `sensors` - [ ] Verify UPS monitoring: `upsc cyberpower@localhost` ### Disaster Recovery Test **Quarterly test** (when backup system in place): - [ ] Simulate VM failure: Restore from backup - [ ] Simulate storage failure: Import pool on different system - [ ] Simulate network failure: Verify Tailscale failover - [ ] Simulate power failure: Test UPS shutdown procedure (if safe) - [ ] Document recovery time and issues --- ## Log Rotation **System logs** are automatically rotated by systemd-journald and logrotate. **Check log sizes**: ```bash # Journalctl size ssh pve 'journalctl --disk-usage' # Traefik logs ssh pve 'pct exec 202 -- du -sh /var/log/traefik/' ``` **Configure retention**: ```bash # Limit journald to 500MB ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf' ssh pve 'systemctl restart systemd-journald' ``` **Traefik log rotation** (already configured): ```bash # /etc/logrotate.d/traefik on CT 202 /var/log/traefik/*.log { daily rotate 7 compress delaycompress missingok notifempty } ``` --- ## Monitoring Integration **TODO**: Set up automated monitoring for these procedures **When monitoring is implemented** (see [MONITORING.md](MONITORING.md)): - ZFS scrub completion/errors - SMART test failures - Certificate expiry warnings (<30 days) - Update availability notifications - Disk space thresholds (>80%) - Temperature warnings (>85°C) --- ## Related Documentation - [MONITORING.md](MONITORING.md) - Automated health checks and alerts - [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan - [UPS.md](UPS.md) - Power failure procedures - [STORAGE.md](STORAGE.md) - ZFS pool management - [HARDWARE.md](HARDWARE.md) - Hardware specifications - [SERVICES.md](SERVICES.md) - Service inventory --- **Last Updated**: 2025-12-22 **Status**: ⚠️ Manual procedures only - monitoring automation needed