Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
546
MONITORING.md
Normal file
546
MONITORING.md
Normal file
@@ -0,0 +1,546 @@
|
||||
# Monitoring and Alerting
|
||||
|
||||
Documentation for system monitoring, health checks, and alerting across the homelab.
|
||||
|
||||
## Current Monitoring Status
|
||||
|
||||
| Component | Monitored? | Method | Alerts | Notes |
|
||||
|-----------|------------|--------|--------|-------|
|
||||
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
|
||||
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
|
||||
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
|
||||
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
|
||||
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
|
||||
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
|
||||
| **Network** | ❌ No | - | ❌ No | No uptime monitoring |
|
||||
| **Services** | ❌ No | - | ❌ No | No health checks |
|
||||
| **Backups** | ❌ No | - | ❌ No | No verification |
|
||||
|
||||
**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts
|
||||
|
||||
---
|
||||
|
||||
## Existing Monitoring
|
||||
|
||||
### UPS Monitoring (NUT)
|
||||
|
||||
**Status**: ✅ **Active and working**
|
||||
|
||||
**What's monitored**:
|
||||
- Battery charge percentage
|
||||
- Runtime remaining (seconds)
|
||||
- Load percentage
|
||||
- Input/output voltage
|
||||
- UPS status (OL/OB/LB)
|
||||
|
||||
**Access**:
|
||||
```bash
|
||||
# Full UPS status
|
||||
ssh pve 'upsc cyberpower@localhost'
|
||||
|
||||
# Key metrics
|
||||
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||
```
|
||||
|
||||
**Home Assistant Integration**:
|
||||
- Sensors: `sensor.cyberpower_*`
|
||||
- Can be used for automation/alerts
|
||||
- Currently: No alerts configured
|
||||
|
||||
**See**: [UPS.md](UPS.md)
|
||||
|
||||
---
|
||||
|
||||
### Syncthing Monitoring
|
||||
|
||||
**Status**: ⚠️ **Partial** - API available, no automated monitoring
|
||||
|
||||
**What's available**:
|
||||
- Device connection status
|
||||
- Folder sync status
|
||||
- Sync errors
|
||||
- Bandwidth usage
|
||||
|
||||
**Manual Checks**:
|
||||
```bash
|
||||
# Check connections (Mac Mini)
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||
|
||||
# Check folder status
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
|
||||
|
||||
# Check errors
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
|
||||
```
|
||||
|
||||
**Needs**: Automated monitoring script + alerts
|
||||
|
||||
**See**: [SYNCTHING.md](SYNCTHING.md)
|
||||
|
||||
---
|
||||
|
||||
### Temperature Monitoring
|
||||
|
||||
**Status**: ⚠️ **Manual only**
|
||||
|
||||
**Current Method**:
|
||||
```bash
|
||||
# CPU temperature (Threadripper Tctl)
|
||||
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
||||
label=$(cat ${f%_input}_label 2>/dev/null); \
|
||||
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||
|
||||
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
||||
label=$(cat ${f%_input}_label 2>/dev/null); \
|
||||
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||
```
|
||||
|
||||
**Thresholds**:
|
||||
- Healthy: 70-80°C under load
|
||||
- Warning: >85°C
|
||||
- Critical: >90°C (throttling)
|
||||
|
||||
**Needs**: Automated monitoring + alert if >85°C
|
||||
|
||||
---
|
||||
|
||||
### Proxmox VM Monitoring
|
||||
|
||||
**Status**: ⚠️ **Manual only**
|
||||
|
||||
**Current Access**:
|
||||
- Proxmox Web UI: Node → Summary
|
||||
- CLI: `ssh pve 'qm list'`
|
||||
|
||||
**Metrics Available** (via Proxmox):
|
||||
- CPU usage per VM
|
||||
- RAM usage per VM
|
||||
- Disk I/O
|
||||
- Network I/O
|
||||
- VM uptime
|
||||
|
||||
**Needs**: API-based monitoring + alerts for VM down
|
||||
|
||||
---
|
||||
|
||||
## Recommended Monitoring Stack
|
||||
|
||||
### Option 1: Prometheus + Grafana (Recommended)
|
||||
|
||||
**Why**:
|
||||
- Industry standard
|
||||
- Extensive integrations
|
||||
- Beautiful dashboards
|
||||
- Flexible alerting
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
|
||||
↓
|
||||
Alertmanager (alerts)
|
||||
```
|
||||
|
||||
**Required Exporters**:
|
||||
| Exporter | Monitors | Install On |
|
||||
|----------|----------|------------|
|
||||
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
|
||||
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
|
||||
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
|
||||
| nut_exporter | UPS metrics | PVE |
|
||||
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
|
||||
| cadvisor | Docker containers | Saltbox, docker-host |
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
# Create monitoring VM
|
||||
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
|
||||
--net0 virtio,bridge=vmbr0'
|
||||
|
||||
# Install Prometheus + Grafana (via Docker)
|
||||
# /opt/monitoring/docker-compose.yml
|
||||
```
|
||||
|
||||
**Estimated Setup Time**: 4-6 hours
|
||||
|
||||
---
|
||||
|
||||
### Option 2: Uptime Kuma (Simpler Alternative)
|
||||
|
||||
**Why**:
|
||||
- Lightweight
|
||||
- Easy to set up
|
||||
- Web-based dashboard
|
||||
- Built-in alerts (email, Slack, etc.)
|
||||
|
||||
**What it monitors**:
|
||||
- HTTP/HTTPS endpoints
|
||||
- Ping (ICMP)
|
||||
- Ports (TCP)
|
||||
- Docker containers
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
ssh docker-host 'mkdir -p /opt/uptime-kuma'
|
||||
cat > docker-compose.yml << 'EOF'
|
||||
version: "3.8"
|
||||
services:
|
||||
uptime-kuma:
|
||||
image: louislam/uptime-kuma:latest
|
||||
ports:
|
||||
- "3001:3001"
|
||||
volumes:
|
||||
- ./data:/app/data
|
||||
restart: unless-stopped
|
||||
EOF
|
||||
|
||||
# Access: http://10.10.10.206:3001
|
||||
# Add Traefik config for uptime.htsn.io
|
||||
```
|
||||
|
||||
**Estimated Setup Time**: 1-2 hours
|
||||
|
||||
---
|
||||
|
||||
### Option 3: Netdata (Real-time Monitoring)
|
||||
|
||||
**Why**:
|
||||
- Real-time metrics (1-second granularity)
|
||||
- Auto-discovers services
|
||||
- Low overhead
|
||||
- Beautiful web UI
|
||||
|
||||
**Deployment**:
|
||||
```bash
|
||||
# Install on each server
|
||||
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
|
||||
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
|
||||
|
||||
# Access:
|
||||
# http://10.10.10.120:19999 (PVE)
|
||||
# http://10.10.10.102:19999 (PVE2)
|
||||
```
|
||||
|
||||
**Parent-Child Setup** (optional):
|
||||
- Configure PVE as parent
|
||||
- Stream metrics from PVE2 → PVE
|
||||
- Single dashboard for both servers
|
||||
|
||||
**Estimated Setup Time**: 1 hour
|
||||
|
||||
---
|
||||
|
||||
## Critical Metrics to Monitor
|
||||
|
||||
### Server Health
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **CPU usage** | >90% for 5 min | Alert |
|
||||
| **CPU temp** | >85°C | Alert |
|
||||
| **CPU temp** | >90°C | Critical alert |
|
||||
| **RAM usage** | >95% | Alert |
|
||||
| **Disk space** | >80% | Warning |
|
||||
| **Disk space** | >90% | Alert |
|
||||
| **Load average** | >CPU count | Alert |
|
||||
|
||||
### Storage Health
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **ZFS pool errors** | >0 | Alert immediately |
|
||||
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
|
||||
| **ZFS scrub failed** | Last scrub error | Alert |
|
||||
| **SMART reallocated sectors** | >0 | Warning |
|
||||
| **SMART pending sectors** | >0 | Alert |
|
||||
| **SMART failure** | Pre-fail | Critical - replace drive |
|
||||
|
||||
### UPS
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **Battery charge** | <20% | Warning |
|
||||
| **Battery charge** | <10% | Alert |
|
||||
| **On battery** | >5 min | Alert |
|
||||
| **Runtime** | <5 min | Critical |
|
||||
|
||||
### Network
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **Device unreachable** | >2 min down | Alert |
|
||||
| **High packet loss** | >5% | Warning |
|
||||
| **Bandwidth saturation** | >90% | Warning |
|
||||
|
||||
### VMs/Services
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| **VM stopped** | Critical VM down | Alert immediately |
|
||||
| **Service unreachable** | HTTP 5xx or timeout | Alert |
|
||||
| **Backup failed** | Any backup failure | Alert |
|
||||
| **Certificate expiry** | <30 days | Warning |
|
||||
| **Certificate expiry** | <7 days | Alert |
|
||||
|
||||
---
|
||||
|
||||
## Alert Destinations
|
||||
|
||||
### Email Alerts
|
||||
|
||||
**Recommended**: Set up SMTP relay for email alerts
|
||||
|
||||
**Options**:
|
||||
1. Gmail SMTP (free, rate-limited)
|
||||
2. SendGrid (free tier: 100 emails/day)
|
||||
3. Mailgun (free tier available)
|
||||
4. Self-hosted mail server (complex)
|
||||
|
||||
**Configuration Example** (Prometheus Alertmanager):
|
||||
```yaml
|
||||
# /etc/alertmanager/alertmanager.yml
|
||||
receivers:
|
||||
- name: 'email'
|
||||
email_configs:
|
||||
- to: 'hutson@example.com'
|
||||
from: 'alerts@htsn.io'
|
||||
smarthost: 'smtp.gmail.com:587'
|
||||
auth_username: 'alerts@htsn.io'
|
||||
auth_password: 'app-password-here'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Push Notifications
|
||||
|
||||
**Options**:
|
||||
- **Pushover**: $5 one-time, reliable
|
||||
- **Pushbullet**: Free tier available
|
||||
- **Telegram Bot**: Free
|
||||
- **Discord Webhook**: Free
|
||||
- **Slack**: Free tier available
|
||||
|
||||
**Recommended**: Pushover or Telegram for mobile alerts
|
||||
|
||||
---
|
||||
|
||||
### Home Assistant Alerts
|
||||
|
||||
Since Home Assistant is already running, use it for alerts:
|
||||
|
||||
**Automation Example**:
|
||||
```yaml
|
||||
automation:
|
||||
- alias: "UPS Low Battery Alert"
|
||||
trigger:
|
||||
- platform: numeric_state
|
||||
entity_id: sensor.cyberpower_battery_charge
|
||||
below: 20
|
||||
action:
|
||||
- service: notify.mobile_app
|
||||
data:
|
||||
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
|
||||
|
||||
- alias: "Server High Temperature"
|
||||
trigger:
|
||||
- platform: template
|
||||
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
|
||||
action:
|
||||
- service: notify.mobile_app
|
||||
data:
|
||||
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
|
||||
```
|
||||
|
||||
**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Scripts
|
||||
|
||||
### Daily Health Check
|
||||
|
||||
Save as `~/bin/homelab-health-check.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Daily homelab health check
|
||||
|
||||
echo "=== Homelab Health Check ==="
|
||||
echo "Date: $(date)"
|
||||
echo ""
|
||||
|
||||
echo "=== Server Status ==="
|
||||
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
|
||||
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
|
||||
echo ""
|
||||
|
||||
echo "=== CPU Temperatures ==="
|
||||
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
|
||||
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
|
||||
echo ""
|
||||
|
||||
echo "=== UPS Status ==="
|
||||
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||
echo ""
|
||||
|
||||
echo "=== ZFS Pools ==="
|
||||
ssh pve 'zpool status -x' 2>/dev/null
|
||||
ssh pve2 'zpool status -x' 2>/dev/null
|
||||
ssh truenas 'zpool status -x vault'
|
||||
echo ""
|
||||
|
||||
echo "=== Disk Space ==="
|
||||
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
|
||||
ssh truenas 'df -h /mnt/vault'
|
||||
echo ""
|
||||
|
||||
echo "=== VM Status ==="
|
||||
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
|
||||
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
|
||||
echo ""
|
||||
|
||||
echo "=== Syncthing Connections ==="
|
||||
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||
echo ""
|
||||
|
||||
echo "=== Check Complete ==="
|
||||
```
|
||||
|
||||
**Run daily**:
|
||||
```cron
|
||||
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ZFS Scrub Checker
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Check last ZFS scrub status
|
||||
|
||||
echo "=== ZFS Scrub Status ==="
|
||||
|
||||
for host in pve pve2; do
|
||||
echo "--- $host ---"
|
||||
ssh $host 'zpool status | grep -A1 scrub'
|
||||
echo ""
|
||||
done
|
||||
|
||||
echo "--- TrueNAS ---"
|
||||
ssh truenas 'zpool status vault | grep -A1 scrub'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### SMART Health Checker
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Check SMART health on all drives
|
||||
|
||||
echo "=== SMART Health Check ==="
|
||||
|
||||
echo "--- TrueNAS Drives ---"
|
||||
ssh truenas 'smartctl --scan | while read dev type; do
|
||||
echo "=== $dev ===";
|
||||
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
|
||||
done'
|
||||
|
||||
echo "--- PVE Drives ---"
|
||||
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
|
||||
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
|
||||
done'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Recommendations
|
||||
|
||||
### Grafana Dashboard Layout
|
||||
|
||||
**Page 1: Overview**
|
||||
- Server uptime
|
||||
- CPU usage (all servers)
|
||||
- RAM usage (all servers)
|
||||
- Disk space (all pools)
|
||||
- Network traffic
|
||||
- UPS status
|
||||
|
||||
**Page 2: Storage**
|
||||
- ZFS pool health
|
||||
- SMART status for all drives
|
||||
- I/O latency
|
||||
- Scrub progress
|
||||
- Disk temperatures
|
||||
|
||||
**Page 3: VMs**
|
||||
- VM status (up/down)
|
||||
- VM resource usage
|
||||
- VM disk I/O
|
||||
- VM network traffic
|
||||
|
||||
**Page 4: Services**
|
||||
- Service health checks
|
||||
- HTTP response times
|
||||
- Certificate expiry dates
|
||||
- Syncthing sync status
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Basic Monitoring (Week 1)
|
||||
|
||||
- [ ] Install Uptime Kuma or Netdata
|
||||
- [ ] Add HTTP checks for all services
|
||||
- [ ] Configure UPS alerts in Home Assistant
|
||||
- [ ] Set up daily health check email
|
||||
|
||||
**Estimated Time**: 4-6 hours
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Advanced Monitoring (Week 2-3)
|
||||
|
||||
- [ ] Install Prometheus + Grafana
|
||||
- [ ] Deploy node_exporter on all servers
|
||||
- [ ] Deploy zfs_exporter
|
||||
- [ ] Deploy smartmon_exporter
|
||||
- [ ] Create Grafana dashboards
|
||||
|
||||
**Estimated Time**: 8-12 hours
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Alerting (Week 4)
|
||||
|
||||
- [ ] Configure Alertmanager
|
||||
- [ ] Set up email/push notifications
|
||||
- [ ] Create alert rules for all critical metrics
|
||||
- [ ] Test all alert paths
|
||||
- [ ] Document alert procedures
|
||||
|
||||
**Estimated Time**: 4-6 hours
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [UPS.md](UPS.md) - UPS monitoring details
|
||||
- [STORAGE.md](STORAGE.md) - ZFS health checks
|
||||
- [SERVICES.md](SERVICES.md) - Service inventory
|
||||
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
|
||||
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-22
|
||||
**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed**
|
||||
Reference in New Issue
Block a user