# Maintenance Procedures and Schedules

Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.

## Overview

| Frequency | Tasks | Estimated Time |
|-----------|-------|----------------|
| **Daily** | Quick health check | 2-5 min |
| **Weekly** | Service status, logs review | 15-30 min |
| **Monthly** | Updates, backups verification | 1-2 hours |
| **Quarterly** | Full system audit, testing | 2-4 hours |
| **Annual** | Hardware maintenance, planning | 4-8 hours |

---

## Daily Maintenance (Automated)

### Quick Health Check Script

Save as `~/bin/homelab-health-check.sh`:

```bash
#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="
```

**Run daily via cron**:
```bash
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```

---

## Weekly Maintenance

### Service Status Review

**Check all critical services**:
```bash
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'

# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'

# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik

# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS
```

### Log Review

**Check for errors in critical logs**:
```bash
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'

# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'

# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
```

### Syncthing Sync Status

**Check for sync errors**:
```bash
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
  echo "=== $folder ==="
  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
```

**See**: [SYNCTHING.md](SYNCTHING.md)

---

## Monthly Maintenance

### System Updates

#### Proxmox Updates

**Check for updates**:
```bash
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
```

**Apply updates**:
```bash
# PVE
ssh pve 'apt update && apt dist-upgrade -y'

# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'

# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
```

**⚠️ Important**:
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart

#### Container Updates (LXC)

```bash
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
```

#### VM Updates

**Update VMs individually via SSH**:
```bash
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'

# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
```

### ZFS Scrubs

**Schedule**: Run monthly on all pools

**PVE**:
```bash
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'

# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
```

**PVE2**:
```bash
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
```

**TrueNAS**:
```bash
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
```

**Automate scrubs**:
```bash
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
```

**See**: [STORAGE.md](STORAGE.md) for pool details

### SMART Tests

**Run extended SMART tests monthly**:

```bash
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'

# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'

# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'

# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
```

**Automate SMART tests**:
```bash
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
```

### Certificate Renewal Verification

**Check SSL certificate expiry**:
```bash
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'

# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```

**Certificates should auto-renew 30 days before expiry via Traefik**

**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management

### Backup Verification

**⚠️ TODO**: No backup strategy currently in place

**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan

---

## Quarterly Maintenance

### Full System Audit

**Check all systems comprehensively**:

1. **ZFS Pool Health**:
   ```bash
   ssh pve 'zpool status -v'
   ssh pve2 'zpool status -v'
   ssh truenas 'zpool status -v vault'
   ```
   Look for: errors, degraded vdevs, resilver operations

2. **SMART Health**:
   ```bash
   # Run SMART health check script
   ~/bin/smart-health-check.sh
   ```
   Look for: reallocated sectors, pending sectors, failures

3. **Disk Space Trends**:
   ```bash
   # Check growth rate
   ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
   ssh truenas 'df -h /mnt/vault'
   ```
   Plan for expansion if >80% full

4. **VM Resource Usage**:
   ```bash
   # Check if VMs need more/less resources
   ssh pve 'qm list'
   ssh pve 'pvesh get /nodes/pve/status'
   ```

5. **Network Performance**:
   ```bash
   # Test bandwidth between critical nodes
   iperf3 -s  # On one host
   iperf3 -c 10.10.10.120  # From another
   ```

6. **Temperature Monitoring**:
   ```bash
   # Check max temps over past quarter
   # TODO: Set up Prometheus/Grafana for historical data
   ssh pve 'sensors'
   ssh pve2 'sensors'
   ```

### Service Dependency Testing

**Test critical paths**:

1. **Power failure recovery** (if safe to test):
   - See [UPS.md](UPS.md) for full procedure
   - Verify VM startup order works
   - Confirm all services come back online

2. **Failover testing**:
   - Tailscale subnet routing (PVE → UCG-Fiber)
   - NUT monitoring (PVE server → PVE2 client)

3. **Backup restoration** (when backups implemented):
   - Test restoring a VM from backup
   - Test restoring files from Syncthing versioning

### Documentation Review

- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
- [ ] Update any changed procedures in this document

---

## Annual Maintenance

### Hardware Maintenance

**Physical cleaning**:
```bash
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'

# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans

# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
```

**Thermal paste inspection** (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C

**See**: [HARDWARE.md](HARDWARE.md) for component details

### UPS Battery Test

**Runtime test**:
```bash
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'

# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers

# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
```

**See**: [UPS.md](UPS.md) for full UPS details

### Drive Replacement Planning

**Check drive age and health**:
```bash
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
```

**Replace drives if**:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives

**Budget for replacements**:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)

### Capacity Planning

**Review growth trends**:
```bash
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'

# Network bandwidth (if monitoring in place)
# Review Grafana dashboards

# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
```

**Plan expansions**:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%

### License and Subscription Review

**Proxmox subscription** (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features

**Service subscriptions**:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)

---

## Update Schedules

### Proxmox

| Component | Frequency | Notes |
|-----------|-----------|-------|
| Security patches | Weekly | Via `apt upgrade` |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |

**Update procedure**:
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
2. Backup VM configs: `vzdump --dumpdir /tmp`
3. Update: `apt update && apt dist-upgrade`
4. Reboot if kernel changed: `reboot`
5. Verify VMs auto-started: `qm list`

### Containers (LXC)

| Container | Update Frequency | Package Manager |
|-----------|------------------|-----------------|
| Pi-hole (200) | Weekly | `apt` |
| Traefik (202) | Monthly | `apt` |
| FindShyt (205) | As needed | `apt` |

**Update command**:
```bash
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
```

### VMs

| VM | Update Frequency | Notes |
|----|------------------|-------|
| TrueNAS | Monthly | Via web UI or `apt` |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | `apt` + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + `apt` |

**Docker image updates**:
```bash
ssh docker-host 'docker-compose pull && docker-compose up -d'
```

### Firmware Updates

| Component | Check Frequency | Update Method |
|-----------|----------------|---------------|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | `nvidia-smi` or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |

**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you

---

## Testing Checklists

### Pre-Update Checklist

Before ANY system update:
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
- [ ] Verify backups are current (when backup system in place)
- [ ] Check for critical VMs/services that can't have downtime
- [ ] Review update changelog/release notes
- [ ] Test on non-critical system first (PVE2 or test VM)
- [ ] Plan rollback strategy if update fails
- [ ] Notify users if downtime expected

### Post-Update Checklist

After system update:
- [ ] Verify system booted correctly: `uptime`
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
- [ ] Test critical services:
  - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
  - [ ] Traefik routing: `curl -I https://plex.htsn.io`
  - [ ] NFS/SMB shares: Test mount from VM
  - [ ] Syncthing sync: Check all devices connected
- [ ] Review logs for errors: `journalctl -p err -b`
- [ ] Check temperatures: `sensors`
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`

### Disaster Recovery Test

**Quarterly test** (when backup system in place):
- [ ] Simulate VM failure: Restore from backup
- [ ] Simulate storage failure: Import pool on different system
- [ ] Simulate network failure: Verify Tailscale failover
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
- [ ] Document recovery time and issues

---

## Log Rotation

**System logs** are automatically rotated by systemd-journald and logrotate.

**Check log sizes**:
```bash
# Journalctl size
ssh pve 'journalctl --disk-usage'

# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
```

**Configure retention**:
```bash
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
```

**Traefik log rotation** (already configured):
```bash
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
}
```

---

## Monitoring Integration

**TODO**: Set up automated monitoring for these procedures

**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)

---

## Related Documentation

- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
- [UPS.md](UPS.md) - Power failure procedures
- [STORAGE.md](STORAGE.md) - ZFS pool management
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
- [SERVICES.md](SERVICES.md) - Service inventory

---

**Last Updated**: 2025-12-22
**Status**: ⚠️ Manual procedures only - monitoring automation needed