Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
618
MAINTENANCE.md
Normal file
618
MAINTENANCE.md
Normal file
@@ -0,0 +1,618 @@
|
||||
# Maintenance Procedures and Schedules
|
||||
|
||||
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
|
||||
|
||||
## Overview
|
||||
|
||||
| Frequency | Tasks | Estimated Time |
|
||||
|-----------|-------|----------------|
|
||||
| **Daily** | Quick health check | 2-5 min |
|
||||
| **Weekly** | Service status, logs review | 15-30 min |
|
||||
| **Monthly** | Updates, backups verification | 1-2 hours |
|
||||
| **Quarterly** | Full system audit, testing | 2-4 hours |
|
||||
| **Annual** | Hardware maintenance, planning | 4-8 hours |
|
||||
|
||||
---
|
||||
|
||||
## Daily Maintenance (Automated)
|
||||
|
||||
### Quick Health Check Script
|
||||
|
||||
Save as `~/bin/homelab-health-check.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Daily homelab health check
|
||||
|
||||
echo "=== Homelab Health Check ==="
|
||||
echo "Date: $(date)"
|
||||
echo ""
|
||||
|
||||
echo "=== Server Status ==="
|
||||
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
|
||||
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
|
||||
echo ""
|
||||
|
||||
echo "=== CPU Temperatures ==="
|
||||
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
|
||||
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
|
||||
echo ""
|
||||
|
||||
echo "=== UPS Status ==="
|
||||
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||
echo ""
|
||||
|
||||
echo "=== ZFS Pools ==="
|
||||
ssh pve 'zpool status -x' 2>/dev/null
|
||||
ssh pve2 'zpool status -x' 2>/dev/null
|
||||
ssh truenas 'zpool status -x vault'
|
||||
echo ""
|
||||
|
||||
echo "=== Disk Space ==="
|
||||
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
|
||||
ssh truenas 'df -h /mnt/vault'
|
||||
echo ""
|
||||
|
||||
echo "=== VM Status ==="
|
||||
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
|
||||
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
|
||||
echo ""
|
||||
|
||||
echo "=== Syncthing Connections ==="
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||
echo ""
|
||||
|
||||
echo "=== Check Complete ==="
|
||||
```
|
||||
|
||||
**Run daily via cron**:
|
||||
```bash
|
||||
# Add to crontab
|
||||
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Weekly Maintenance
|
||||
|
||||
### Service Status Review
|
||||
|
||||
**Check all critical services**:
|
||||
```bash
|
||||
# Proxmox services
|
||||
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
|
||||
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
|
||||
|
||||
# NUT (UPS monitoring)
|
||||
ssh pve 'systemctl status nut-server nut-monitor'
|
||||
ssh pve2 'systemctl status nut-monitor'
|
||||
|
||||
# Container services
|
||||
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
|
||||
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
|
||||
|
||||
# VM services (via QEMU agent)
|
||||
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
|
||||
```
|
||||
|
||||
### Log Review
|
||||
|
||||
**Check for errors in critical logs**:
|
||||
```bash
|
||||
# Proxmox system logs
|
||||
ssh pve 'journalctl -p err -b | tail -50'
|
||||
ssh pve2 'journalctl -p err -b | tail -50'
|
||||
|
||||
# VM logs (if QEMU agent available)
|
||||
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
|
||||
|
||||
# Traefik access logs
|
||||
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
|
||||
```
|
||||
|
||||
### Syncthing Sync Status
|
||||
|
||||
**Check for sync errors**:
|
||||
```bash
|
||||
# Check all folder errors
|
||||
for folder in documents downloads desktop movies pictures notes config; do
|
||||
echo "=== $folder ==="
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
|
||||
done
|
||||
```
|
||||
|
||||
**See**: [SYNCTHING.md](SYNCTHING.md)
|
||||
|
||||
---
|
||||
|
||||
## Monthly Maintenance
|
||||
|
||||
### System Updates
|
||||
|
||||
#### Proxmox Updates
|
||||
|
||||
**Check for updates**:
|
||||
```bash
|
||||
ssh pve 'apt update && apt list --upgradable'
|
||||
ssh pve2 'apt update && apt list --upgradable'
|
||||
```
|
||||
|
||||
**Apply updates**:
|
||||
```bash
|
||||
# PVE
|
||||
ssh pve 'apt update && apt dist-upgrade -y'
|
||||
|
||||
# PVE2
|
||||
ssh pve2 'apt update && apt dist-upgrade -y'
|
||||
|
||||
# Reboot if kernel updated
|
||||
ssh pve 'reboot'
|
||||
ssh pve2 'reboot'
|
||||
```
|
||||
|
||||
**⚠️ Important**:
|
||||
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
|
||||
- Test on PVE2 first if possible
|
||||
- Ensure all VMs are backed up before updating
|
||||
- Monitor VMs after reboot - some may need manual restart
|
||||
|
||||
#### Container Updates (LXC)
|
||||
|
||||
```bash
|
||||
# Update all containers
|
||||
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
|
||||
```
|
||||
|
||||
#### VM Updates
|
||||
|
||||
**Update VMs individually via SSH**:
|
||||
```bash
|
||||
# Ubuntu/Debian VMs
|
||||
ssh truenas 'apt update && apt upgrade -y'
|
||||
ssh docker-host 'apt update && apt upgrade -y'
|
||||
ssh fs-dev 'apt update && apt upgrade -y'
|
||||
|
||||
# Check if reboot required
|
||||
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
|
||||
```
|
||||
|
||||
### ZFS Scrubs
|
||||
|
||||
**Schedule**: Run monthly on all pools
|
||||
|
||||
**PVE**:
|
||||
```bash
|
||||
# Start scrub on all pools
|
||||
ssh pve 'zpool scrub nvme-mirror1'
|
||||
ssh pve 'zpool scrub nvme-mirror2'
|
||||
ssh pve 'zpool scrub rpool'
|
||||
|
||||
# Check scrub status
|
||||
ssh pve 'zpool status | grep -A2 scrub'
|
||||
```
|
||||
|
||||
**PVE2**:
|
||||
```bash
|
||||
ssh pve2 'zpool scrub nvme-mirror3'
|
||||
ssh pve2 'zpool scrub local-zfs2'
|
||||
ssh pve2 'zpool status | grep -A2 scrub'
|
||||
```
|
||||
|
||||
**TrueNAS**:
|
||||
```bash
|
||||
# Scrub via TrueNAS web UI or SSH
|
||||
ssh truenas 'zpool scrub vault'
|
||||
ssh truenas 'zpool status vault | grep -A2 scrub'
|
||||
```
|
||||
|
||||
**Automate scrubs**:
|
||||
```bash
|
||||
# Add to crontab (run on 1st of month at 2 AM)
|
||||
0 2 1 * * /sbin/zpool scrub nvme-mirror1
|
||||
0 2 1 * * /sbin/zpool scrub nvme-mirror2
|
||||
0 2 1 * * /sbin/zpool scrub rpool
|
||||
```
|
||||
|
||||
**See**: [STORAGE.md](STORAGE.md) for pool details
|
||||
|
||||
### SMART Tests
|
||||
|
||||
**Run extended SMART tests monthly**:
|
||||
|
||||
```bash
|
||||
# TrueNAS drives (via QEMU agent)
|
||||
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
|
||||
|
||||
# Check results after 4-8 hours
|
||||
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
|
||||
|
||||
# PVE drives
|
||||
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
|
||||
|
||||
# PVE2 drives
|
||||
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
|
||||
```
|
||||
|
||||
**Automate SMART tests**:
|
||||
```bash
|
||||
# Add to crontab (run on 15th of month at 3 AM)
|
||||
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
|
||||
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
|
||||
```
|
||||
|
||||
### Certificate Renewal Verification
|
||||
|
||||
**Check SSL certificate expiry**:
|
||||
```bash
|
||||
# Check Traefik certificates
|
||||
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
|
||||
|
||||
# Check specific service
|
||||
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
|
||||
```
|
||||
|
||||
**Certificates should auto-renew 30 days before expiry via Traefik**
|
||||
|
||||
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
|
||||
|
||||
### Backup Verification
|
||||
|
||||
**⚠️ TODO**: No backup strategy currently in place
|
||||
|
||||
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
|
||||
|
||||
---
|
||||
|
||||
## Quarterly Maintenance
|
||||
|
||||
### Full System Audit
|
||||
|
||||
**Check all systems comprehensively**:
|
||||
|
||||
1. **ZFS Pool Health**:
|
||||
```bash
|
||||
ssh pve 'zpool status -v'
|
||||
ssh pve2 'zpool status -v'
|
||||
ssh truenas 'zpool status -v vault'
|
||||
```
|
||||
Look for: errors, degraded vdevs, resilver operations
|
||||
|
||||
2. **SMART Health**:
|
||||
```bash
|
||||
# Run SMART health check script
|
||||
~/bin/smart-health-check.sh
|
||||
```
|
||||
Look for: reallocated sectors, pending sectors, failures
|
||||
|
||||
3. **Disk Space Trends**:
|
||||
```bash
|
||||
# Check growth rate
|
||||
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
|
||||
ssh truenas 'df -h /mnt/vault'
|
||||
```
|
||||
Plan for expansion if >80% full
|
||||
|
||||
4. **VM Resource Usage**:
|
||||
```bash
|
||||
# Check if VMs need more/less resources
|
||||
ssh pve 'qm list'
|
||||
ssh pve 'pvesh get /nodes/pve/status'
|
||||
```
|
||||
|
||||
5. **Network Performance**:
|
||||
```bash
|
||||
# Test bandwidth between critical nodes
|
||||
iperf3 -s # On one host
|
||||
iperf3 -c 10.10.10.120 # From another
|
||||
```
|
||||
|
||||
6. **Temperature Monitoring**:
|
||||
```bash
|
||||
# Check max temps over past quarter
|
||||
# TODO: Set up Prometheus/Grafana for historical data
|
||||
ssh pve 'sensors'
|
||||
ssh pve2 'sensors'
|
||||
```
|
||||
|
||||
### Service Dependency Testing
|
||||
|
||||
**Test critical paths**:
|
||||
|
||||
1. **Power failure recovery** (if safe to test):
|
||||
- See [UPS.md](UPS.md) for full procedure
|
||||
- Verify VM startup order works
|
||||
- Confirm all services come back online
|
||||
|
||||
2. **Failover testing**:
|
||||
- Tailscale subnet routing (PVE → UCG-Fiber)
|
||||
- NUT monitoring (PVE server → PVE2 client)
|
||||
|
||||
3. **Backup restoration** (when backups implemented):
|
||||
- Test restoring a VM from backup
|
||||
- Test restoring files from Syncthing versioning
|
||||
|
||||
### Documentation Review
|
||||
|
||||
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
|
||||
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
|
||||
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
|
||||
- [ ] Update any changed procedures in this document
|
||||
|
||||
---
|
||||
|
||||
## Annual Maintenance
|
||||
|
||||
### Hardware Maintenance
|
||||
|
||||
**Physical cleaning**:
|
||||
```bash
|
||||
# Shut down servers (coordinate with users)
|
||||
ssh pve 'shutdown -h now'
|
||||
ssh pve2 'shutdown -h now'
|
||||
|
||||
# Clean dust from:
|
||||
# - CPU heatsinks
|
||||
# - GPU fans
|
||||
# - Case fans
|
||||
# - PSU vents
|
||||
# - Storage enclosure fans
|
||||
|
||||
# Check for:
|
||||
# - Bulging capacitors on PSU/motherboard
|
||||
# - Loose cables
|
||||
# - Fan noise/vibration
|
||||
```
|
||||
|
||||
**Thermal paste inspection** (every 2-3 years):
|
||||
- Check CPU temps vs baseline
|
||||
- If temps >85°C under load, consider reapplying paste
|
||||
- Threadripper PRO: Tctl max safe = 90°C
|
||||
|
||||
**See**: [HARDWARE.md](HARDWARE.md) for component details
|
||||
|
||||
### UPS Battery Test
|
||||
|
||||
**Runtime test**:
|
||||
```bash
|
||||
# Check battery health
|
||||
ssh pve 'upsc cyberpower@localhost | grep battery'
|
||||
|
||||
# Perform runtime test (coordinate power loss)
|
||||
# 1. Note current runtime estimate
|
||||
# 2. Unplug UPS from wall
|
||||
# 3. Let battery drain to 20%
|
||||
# 4. Note actual runtime vs estimate
|
||||
# 5. Plug back in before shutdown triggers
|
||||
|
||||
# Battery replacement if:
|
||||
# - Runtime < 10 min at typical load
|
||||
# - Battery age > 3-5 years
|
||||
# - Battery charge < 100% when on AC for 24h
|
||||
```
|
||||
|
||||
**See**: [UPS.md](UPS.md) for full UPS details
|
||||
|
||||
### Drive Replacement Planning
|
||||
|
||||
**Check drive age and health**:
|
||||
```bash
|
||||
# Get drive hours and health
|
||||
ssh truenas 'smartctl --scan | while read dev type; do
|
||||
echo "=== $dev ===";
|
||||
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
|
||||
done'
|
||||
```
|
||||
|
||||
**Replace drives if**:
|
||||
- Reallocated sectors > 0
|
||||
- Pending sectors > 0
|
||||
- SMART pre-fail warnings
|
||||
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
|
||||
- Hours > 50,000 for consumer drives
|
||||
|
||||
**Budget for replacements**:
|
||||
- HDDs: WD Red 6TB (~$150/drive)
|
||||
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
|
||||
|
||||
### Capacity Planning
|
||||
|
||||
**Review growth trends**:
|
||||
```bash
|
||||
# Storage growth (compare to last year)
|
||||
ssh pve 'zpool list'
|
||||
ssh truenas 'df -h /mnt/vault'
|
||||
|
||||
# Network bandwidth (if monitoring in place)
|
||||
# Review Grafana dashboards
|
||||
|
||||
# Power consumption
|
||||
ssh pve 'upsc cyberpower@localhost ups.load'
|
||||
```
|
||||
|
||||
**Plan expansions**:
|
||||
- Storage: Add drives if >70% full
|
||||
- RAM: Check if VMs hitting limits
|
||||
- Network: Upgrade if bandwidth saturation
|
||||
- UPS: Upgrade if load >80%
|
||||
|
||||
### License and Subscription Review
|
||||
|
||||
**Proxmox subscription** (if applicable):
|
||||
- Community (free) or Enterprise subscription?
|
||||
- Check for updates to pricing/features
|
||||
|
||||
**Service subscriptions**:
|
||||
- Domain registration (htsn.io)
|
||||
- Cloudflare plan (currently free)
|
||||
- Let's Encrypt (free, no action needed)
|
||||
|
||||
---
|
||||
|
||||
## Update Schedules
|
||||
|
||||
### Proxmox
|
||||
|
||||
| Component | Frequency | Notes |
|
||||
|-----------|-----------|-------|
|
||||
| Security patches | Weekly | Via `apt upgrade` |
|
||||
| Minor updates | Monthly | Test on PVE2 first |
|
||||
| Major versions | Quarterly | Read release notes, plan downtime |
|
||||
| Kernel updates | Monthly | Requires reboot |
|
||||
|
||||
**Update procedure**:
|
||||
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
|
||||
2. Backup VM configs: `vzdump --dumpdir /tmp`
|
||||
3. Update: `apt update && apt dist-upgrade`
|
||||
4. Reboot if kernel changed: `reboot`
|
||||
5. Verify VMs auto-started: `qm list`
|
||||
|
||||
### Containers (LXC)
|
||||
|
||||
| Container | Update Frequency | Package Manager |
|
||||
|-----------|------------------|-----------------|
|
||||
| Pi-hole (200) | Weekly | `apt` |
|
||||
| Traefik (202) | Monthly | `apt` |
|
||||
| FindShyt (205) | As needed | `apt` |
|
||||
|
||||
**Update command**:
|
||||
```bash
|
||||
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
|
||||
```
|
||||
|
||||
### VMs
|
||||
|
||||
| VM | Update Frequency | Notes |
|
||||
|----|------------------|-------|
|
||||
| TrueNAS | Monthly | Via web UI or `apt` |
|
||||
| Saltbox | Weekly | Managed by Saltbox updates |
|
||||
| HomeAssistant | Monthly | Via HA supervisor |
|
||||
| Docker-host | Weekly | `apt` + Docker images |
|
||||
| Trading-VM | As needed | Via SSH |
|
||||
| Gitea-VM | Monthly | Via web UI + `apt` |
|
||||
|
||||
**Docker image updates**:
|
||||
```bash
|
||||
ssh docker-host 'docker-compose pull && docker-compose up -d'
|
||||
```
|
||||
|
||||
### Firmware Updates
|
||||
|
||||
| Component | Check Frequency | Update Method |
|
||||
|-----------|----------------|---------------|
|
||||
| Motherboard BIOS | Annually | Manual flash (high risk) |
|
||||
| GPU firmware | Rarely | `nvidia-smi` or manual |
|
||||
| SSD/NVMe firmware | Quarterly | Vendor tools |
|
||||
| HBA firmware | Annually | LSI tools |
|
||||
| UPS firmware | Annually | PowerPanel or manual |
|
||||
|
||||
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
|
||||
- Critical security issue
|
||||
- Needed for hardware compatibility
|
||||
- Fixing known bug affecting you
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklists
|
||||
|
||||
### Pre-Update Checklist
|
||||
|
||||
Before ANY system update:
|
||||
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
|
||||
- [ ] Verify backups are current (when backup system in place)
|
||||
- [ ] Check for critical VMs/services that can't have downtime
|
||||
- [ ] Review update changelog/release notes
|
||||
- [ ] Test on non-critical system first (PVE2 or test VM)
|
||||
- [ ] Plan rollback strategy if update fails
|
||||
- [ ] Notify users if downtime expected
|
||||
|
||||
### Post-Update Checklist
|
||||
|
||||
After system update:
|
||||
- [ ] Verify system booted correctly: `uptime`
|
||||
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
|
||||
- [ ] Test critical services:
|
||||
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
|
||||
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
|
||||
- [ ] NFS/SMB shares: Test mount from VM
|
||||
- [ ] Syncthing sync: Check all devices connected
|
||||
- [ ] Review logs for errors: `journalctl -p err -b`
|
||||
- [ ] Check temperatures: `sensors`
|
||||
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
|
||||
|
||||
### Disaster Recovery Test
|
||||
|
||||
**Quarterly test** (when backup system in place):
|
||||
- [ ] Simulate VM failure: Restore from backup
|
||||
- [ ] Simulate storage failure: Import pool on different system
|
||||
- [ ] Simulate network failure: Verify Tailscale failover
|
||||
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
|
||||
- [ ] Document recovery time and issues
|
||||
|
||||
---
|
||||
|
||||
## Log Rotation
|
||||
|
||||
**System logs** are automatically rotated by systemd-journald and logrotate.
|
||||
|
||||
**Check log sizes**:
|
||||
```bash
|
||||
# Journalctl size
|
||||
ssh pve 'journalctl --disk-usage'
|
||||
|
||||
# Traefik logs
|
||||
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
|
||||
```
|
||||
|
||||
**Configure retention**:
|
||||
```bash
|
||||
# Limit journald to 500MB
|
||||
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
|
||||
ssh pve 'systemctl restart systemd-journald'
|
||||
```
|
||||
|
||||
**Traefik log rotation** (already configured):
|
||||
```bash
|
||||
# /etc/logrotate.d/traefik on CT 202
|
||||
/var/log/traefik/*.log {
|
||||
daily
|
||||
rotate 7
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Integration
|
||||
|
||||
**TODO**: Set up automated monitoring for these procedures
|
||||
|
||||
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
|
||||
- ZFS scrub completion/errors
|
||||
- SMART test failures
|
||||
- Certificate expiry warnings (<30 days)
|
||||
- Update availability notifications
|
||||
- Disk space thresholds (>80%)
|
||||
- Temperature warnings (>85°C)
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
|
||||
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
|
||||
- [UPS.md](UPS.md) - Power failure procedures
|
||||
- [STORAGE.md](STORAGE.md) - ZFS pool management
|
||||
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
|
||||
- [SERVICES.md](SERVICES.md) - Service inventory
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-22
|
||||
**Status**: ⚠️ Manual procedures only - monitoring automation needed
|
||||
Reference in New Issue
Block a user