Files

Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-23 00:34:21 -05:00

16 KiB

Raw Permalink Blame History

Maintenance Procedures and Schedules

Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.

Overview

Frequency	Tasks	Estimated Time
Daily	Quick health check	2-5 min
Weekly	Service status, logs review	15-30 min
Monthly	Updates, backups verification	1-2 hours
Quarterly	Full system audit, testing	2-4 hours
Annual	Hardware maintenance, planning	4-8 hours

Daily Maintenance (Automated)

Quick Health Check Script

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily via cron:

# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

Weekly Maintenance

Service Status Review

Check all critical services:

# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'

# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'

# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik

# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS

Log Review

Check for errors in critical logs:

# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'

# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'

# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'

Syncthing Sync Status

Check for sync errors:

# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
  echo "=== $folder ==="
  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done

See: SYNCTHING.md

Monthly Maintenance

System Updates

Proxmox Updates

Check for updates:

ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'

Apply updates:

# PVE
ssh pve 'apt update && apt dist-upgrade -y'

# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'

# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'

⚠️ Important:

Check Proxmox release notes before major updates
Test on PVE2 first if possible
Ensure all VMs are backed up before updating
Monitor VMs after reboot - some may need manual restart

Container Updates (LXC)

# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'

VM Updates

Update VMs individually via SSH:

# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'

# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'

ZFS Scrubs

Schedule: Run monthly on all pools

PVE:

# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'

# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'

PVE2:

ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'

TrueNAS:

# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'

Automate scrubs:

# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool

See: STORAGE.md for pool details

SMART Tests

Run extended SMART tests monthly:

# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'

# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'

# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'

# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'

Automate SMART tests:

# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda

Certificate Renewal Verification

Check SSL certificate expiry:

# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'

# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates

Certificates should auto-renew 30 days before expiry via Traefik

See: TRAEFIK.md for certificate management

Backup Verification

⚠️ TODO: No backup strategy currently in place

See: BACKUP-STRATEGY.md for implementation plan

Quarterly Maintenance

Full System Audit

Check all systems comprehensively:

ZFS Pool Health:

ssh pve 'zpool status -v'
ssh pve2 'zpool status -v'
ssh truenas 'zpool status -v vault'

Look for: errors, degraded vdevs, resilver operations

SMART Health:
```
# Run SMART health check script
~/bin/smart-health-check.sh
```
Look for: reallocated sectors, pending sectors, failures

Disk Space Trends:

# Check growth rate
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
ssh truenas 'df -h /mnt/vault'

Plan for expansion if >80% full

VM Resource Usage:

# Check if VMs need more/less resources
ssh pve 'qm list'
ssh pve 'pvesh get /nodes/pve/status'

Network Performance:

# Test bandwidth between critical nodes
iperf3 -s  # On one host
iperf3 -c 10.10.10.120  # From another

Temperature Monitoring:

# Check max temps over past quarter
# TODO: Set up Prometheus/Grafana for historical data
ssh pve 'sensors'
ssh pve2 'sensors'

Service Dependency Testing

Test critical paths:

Power failure recovery (if safe to test):
- See UPS.md for full procedure
- Verify VM startup order works
- Confirm all services come back online
Failover testing:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
Backup restoration (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning

Documentation Review

Update IP assignments in IP-ASSIGNMENTS.md
Review and update service URLs in SERVICES.md
Check for missing hardware specs in HARDWARE.md
Update any changed procedures in this document

Annual Maintenance

Hardware Maintenance

Physical cleaning:

# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'

# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans

# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration

Thermal paste inspection (every 2-3 years):

Check CPU temps vs baseline
If temps >85°C under load, consider reapplying paste
Threadripper PRO: Tctl max safe = 90°C

See: HARDWARE.md for component details

UPS Battery Test

Runtime test:

# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'

# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers

# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h

See: UPS.md for full UPS details

Drive Replacement Planning

Check drive age and health:

# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'

Replace drives if:

Reallocated sectors > 0
Pending sectors > 0
SMART pre-fail warnings
Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
Hours > 50,000 for consumer drives

Budget for replacements:

HDDs: WD Red 6TB (~$150/drive)
NVMe: Samsung/Kingston 2TB (~$150-200/drive)

Capacity Planning

Review growth trends:

# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'

# Network bandwidth (if monitoring in place)
# Review Grafana dashboards

# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'

Plan expansions:

Storage: Add drives if >70% full
RAM: Check if VMs hitting limits
Network: Upgrade if bandwidth saturation
UPS: Upgrade if load >80%

License and Subscription Review

Proxmox subscription (if applicable):

Community (free) or Enterprise subscription?
Check for updates to pricing/features

Service subscriptions:

Domain registration (htsn.io)
Cloudflare plan (currently free)
Let's Encrypt (free, no action needed)

Update Schedules

Proxmox

Component	Frequency	Notes
Security patches	Weekly	Via `apt upgrade`
Minor updates	Monthly	Test on PVE2 first
Major versions	Quarterly	Read release notes, plan downtime
Kernel updates	Monthly	Requires reboot

Update procedure:

Check Proxmox release notes
Backup VM configs: vzdump --dumpdir /tmp
Update: apt update && apt dist-upgrade
Reboot if kernel changed: reboot
Verify VMs auto-started: qm list

Containers (LXC)

Container	Update Frequency	Package Manager
Pi-hole (200)	Weekly	`apt`
Traefik (202)	Monthly	`apt`
FindShyt (205)	As needed	`apt`

Update command:

ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'

VMs

VM	Update Frequency	Notes
TrueNAS	Monthly	Via web UI or `apt`
Saltbox	Weekly	Managed by Saltbox updates
HomeAssistant	Monthly	Via HA supervisor
Docker-host	Weekly	`apt` + Docker images
Trading-VM	As needed	Via SSH
Gitea-VM	Monthly	Via web UI + `apt`

Docker image updates:

ssh docker-host 'docker-compose pull && docker-compose up -d'

Firmware Updates

Component	Check Frequency	Update Method
Motherboard BIOS	Annually	Manual flash (high risk)
GPU firmware	Rarely	`nvidia-smi` or manual
SSD/NVMe firmware	Quarterly	Vendor tools
HBA firmware	Annually	LSI tools
UPS firmware	Annually	PowerPanel or manual

⚠️ Warning: BIOS/firmware updates carry risk. Only update if:

Critical security issue
Needed for hardware compatibility
Fixing known bug affecting you

Testing Checklists

Pre-Update Checklist

Before ANY system update:

Check current system state: uptime, qm list, zpool status
Verify backups are current (when backup system in place)
Check for critical VMs/services that can't have downtime
Review update changelog/release notes
Test on non-critical system first (PVE2 or test VM)
Plan rollback strategy if update fails
Notify users if downtime expected

Post-Update Checklist

After system update:

Verify system booted correctly: uptime
Check all VMs/CTs started: qm list, pct list
Test critical services:
- Pi-hole DNS: nslookup google.com 10.10.10.10
- Traefik routing: curl -I https://plex.htsn.io
- NFS/SMB shares: Test mount from VM
- Syncthing sync: Check all devices connected
Review logs for errors: journalctl -p err -b
Check temperatures: sensors
Verify UPS monitoring: upsc cyberpower@localhost

Disaster Recovery Test

Quarterly test (when backup system in place):

Simulate VM failure: Restore from backup
Simulate storage failure: Import pool on different system
Simulate network failure: Verify Tailscale failover
Simulate power failure: Test UPS shutdown procedure (if safe)
Document recovery time and issues

Log Rotation

System logs are automatically rotated by systemd-journald and logrotate.

Check log sizes:

# Journalctl size
ssh pve 'journalctl --disk-usage'

# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'

Configure retention:

# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'

Traefik log rotation (already configured):

# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
}

Monitoring Integration

TODO: Set up automated monitoring for these procedures

When monitoring is implemented (see MONITORING.md):

ZFS scrub completion/errors
SMART test failures
Certificate expiry warnings (<30 days)
Update availability notifications
Disk space thresholds (>80%)
Temperature warnings (>85°C)

MONITORING.md - Automated health checks and alerts
BACKUP-STRATEGY.md - Backup implementation plan
UPS.md - Power failure procedures
STORAGE.md - ZFS pool management
HARDWARE.md - Hardware specifications
SERVICES.md - Service inventory

Last Updated: 2025-12-22 Status: ⚠️ Manual procedures only - monitoring automation needed

16 KiB Raw Permalink Blame History