Files
homelab-docs/MAINTENANCE.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

16 KiB

Maintenance Procedures and Schedules

Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.

Overview

Frequency Tasks Estimated Time
Daily Quick health check 2-5 min
Weekly Service status, logs review 15-30 min
Monthly Updates, backups verification 1-2 hours
Quarterly Full system audit, testing 2-4 hours
Annual Hardware maintenance, planning 4-8 hours

Daily Maintenance (Automated)

Quick Health Check Script

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily via cron:

# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

Weekly Maintenance

Service Status Review

Check all critical services:

# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'

# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'

# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik

# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS

Log Review

Check for errors in critical logs:

# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'

# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'

# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'

Syncthing Sync Status

Check for sync errors:

# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
  echo "=== $folder ==="
  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done

See: SYNCTHING.md


Monthly Maintenance

System Updates

Proxmox Updates

Check for updates:

ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'

Apply updates:

# PVE
ssh pve 'apt update && apt dist-upgrade -y'

# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'

# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'

⚠️ Important:

  • Check Proxmox release notes before major updates
  • Test on PVE2 first if possible
  • Ensure all VMs are backed up before updating
  • Monitor VMs after reboot - some may need manual restart

Container Updates (LXC)

# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'

VM Updates

Update VMs individually via SSH:

# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'

# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'

ZFS Scrubs

Schedule: Run monthly on all pools

PVE:

# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'

# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'

PVE2:

ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'

TrueNAS:

# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'

Automate scrubs:

# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool

See: STORAGE.md for pool details

SMART Tests

Run extended SMART tests monthly:

# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'

# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'

# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'

# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'

Automate SMART tests:

# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda

Certificate Renewal Verification

Check SSL certificate expiry:

# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'

# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates

Certificates should auto-renew 30 days before expiry via Traefik

See: TRAEFIK.md for certificate management

Backup Verification

⚠️ TODO: No backup strategy currently in place

See: BACKUP-STRATEGY.md for implementation plan


Quarterly Maintenance

Full System Audit

Check all systems comprehensively:

  1. ZFS Pool Health:

    ssh pve 'zpool status -v'
    ssh pve2 'zpool status -v'
    ssh truenas 'zpool status -v vault'
    

    Look for: errors, degraded vdevs, resilver operations

  2. SMART Health:

    # Run SMART health check script
    ~/bin/smart-health-check.sh
    

    Look for: reallocated sectors, pending sectors, failures

  3. Disk Space Trends:

    # Check growth rate
    ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
    ssh truenas 'df -h /mnt/vault'
    

    Plan for expansion if >80% full

  4. VM Resource Usage:

    # Check if VMs need more/less resources
    ssh pve 'qm list'
    ssh pve 'pvesh get /nodes/pve/status'
    
  5. Network Performance:

    # Test bandwidth between critical nodes
    iperf3 -s  # On one host
    iperf3 -c 10.10.10.120  # From another
    
  6. Temperature Monitoring:

    # Check max temps over past quarter
    # TODO: Set up Prometheus/Grafana for historical data
    ssh pve 'sensors'
    ssh pve2 'sensors'
    

Service Dependency Testing

Test critical paths:

  1. Power failure recovery (if safe to test):

    • See UPS.md for full procedure
    • Verify VM startup order works
    • Confirm all services come back online
  2. Failover testing:

    • Tailscale subnet routing (PVE → UCG-Fiber)
    • NUT monitoring (PVE server → PVE2 client)
  3. Backup restoration (when backups implemented):

    • Test restoring a VM from backup
    • Test restoring files from Syncthing versioning

Documentation Review


Annual Maintenance

Hardware Maintenance

Physical cleaning:

# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'

# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans

# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration

Thermal paste inspection (every 2-3 years):

  • Check CPU temps vs baseline
  • If temps >85°C under load, consider reapplying paste
  • Threadripper PRO: Tctl max safe = 90°C

See: HARDWARE.md for component details

UPS Battery Test

Runtime test:

# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'

# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers

# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h

See: UPS.md for full UPS details

Drive Replacement Planning

Check drive age and health:

# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'

Replace drives if:

  • Reallocated sectors > 0
  • Pending sectors > 0
  • SMART pre-fail warnings
  • Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
  • Hours > 50,000 for consumer drives

Budget for replacements:

  • HDDs: WD Red 6TB (~$150/drive)
  • NVMe: Samsung/Kingston 2TB (~$150-200/drive)

Capacity Planning

Review growth trends:

# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'

# Network bandwidth (if monitoring in place)
# Review Grafana dashboards

# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'

Plan expansions:

  • Storage: Add drives if >70% full
  • RAM: Check if VMs hitting limits
  • Network: Upgrade if bandwidth saturation
  • UPS: Upgrade if load >80%

License and Subscription Review

Proxmox subscription (if applicable):

  • Community (free) or Enterprise subscription?
  • Check for updates to pricing/features

Service subscriptions:

  • Domain registration (htsn.io)
  • Cloudflare plan (currently free)
  • Let's Encrypt (free, no action needed)

Update Schedules

Proxmox

Component Frequency Notes
Security patches Weekly Via apt upgrade
Minor updates Monthly Test on PVE2 first
Major versions Quarterly Read release notes, plan downtime
Kernel updates Monthly Requires reboot

Update procedure:

  1. Check Proxmox release notes
  2. Backup VM configs: vzdump --dumpdir /tmp
  3. Update: apt update && apt dist-upgrade
  4. Reboot if kernel changed: reboot
  5. Verify VMs auto-started: qm list

Containers (LXC)

Container Update Frequency Package Manager
Pi-hole (200) Weekly apt
Traefik (202) Monthly apt
FindShyt (205) As needed apt

Update command:

ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'

VMs

VM Update Frequency Notes
TrueNAS Monthly Via web UI or apt
Saltbox Weekly Managed by Saltbox updates
HomeAssistant Monthly Via HA supervisor
Docker-host Weekly apt + Docker images
Trading-VM As needed Via SSH
Gitea-VM Monthly Via web UI + apt

Docker image updates:

ssh docker-host 'docker-compose pull && docker-compose up -d'

Firmware Updates

Component Check Frequency Update Method
Motherboard BIOS Annually Manual flash (high risk)
GPU firmware Rarely nvidia-smi or manual
SSD/NVMe firmware Quarterly Vendor tools
HBA firmware Annually LSI tools
UPS firmware Annually PowerPanel or manual

⚠️ Warning: BIOS/firmware updates carry risk. Only update if:

  • Critical security issue
  • Needed for hardware compatibility
  • Fixing known bug affecting you

Testing Checklists

Pre-Update Checklist

Before ANY system update:

  • Check current system state: uptime, qm list, zpool status
  • Verify backups are current (when backup system in place)
  • Check for critical VMs/services that can't have downtime
  • Review update changelog/release notes
  • Test on non-critical system first (PVE2 or test VM)
  • Plan rollback strategy if update fails
  • Notify users if downtime expected

Post-Update Checklist

After system update:

  • Verify system booted correctly: uptime
  • Check all VMs/CTs started: qm list, pct list
  • Test critical services:
    • Pi-hole DNS: nslookup google.com 10.10.10.10
    • Traefik routing: curl -I https://plex.htsn.io
    • NFS/SMB shares: Test mount from VM
    • Syncthing sync: Check all devices connected
  • Review logs for errors: journalctl -p err -b
  • Check temperatures: sensors
  • Verify UPS monitoring: upsc cyberpower@localhost

Disaster Recovery Test

Quarterly test (when backup system in place):

  • Simulate VM failure: Restore from backup
  • Simulate storage failure: Import pool on different system
  • Simulate network failure: Verify Tailscale failover
  • Simulate power failure: Test UPS shutdown procedure (if safe)
  • Document recovery time and issues

Log Rotation

System logs are automatically rotated by systemd-journald and logrotate.

Check log sizes:

# Journalctl size
ssh pve 'journalctl --disk-usage'

# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'

Configure retention:

# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'

Traefik log rotation (already configured):

# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
}

Monitoring Integration

TODO: Set up automated monitoring for these procedures

When monitoring is implemented (see MONITORING.md):

  • ZFS scrub completion/errors
  • SMART test failures
  • Certificate expiry warnings (<30 days)
  • Update availability notifications
  • Disk space thresholds (>80%)
  • Temperature warnings (>85°C)


Last Updated: 2025-12-22 Status: ⚠️ Manual procedures only - monitoring automation needed