homelab-docs/MONITORING.md

# Monitoring and Alerting

Documentation for system monitoring, health checks, and alerting across the homelab.

## Current Monitoring Status

| Component | Monitored? | Method | Alerts | Notes |
|-----------|------------|--------|--------|-------|
| **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
| **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
| **Services** | ❌ No | - | ❌ No | No health checks |
| **Backups** | ❌ No | - | ❌ No | No verification |

**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual

---

## Existing Monitoring

### UPS Monitoring (NUT)

**Status**: ✅ **Active and working**

**What's monitored**:
- Battery charge percentage
- Runtime remaining (seconds)
- Load percentage
- Input/output voltage
- UPS status (OL/OB/LB)

**Access**:
```bash
# Full UPS status
ssh pve 'upsc cyberpower@localhost'

# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
```

**Home Assistant Integration**:
- Sensors: `sensor.cyberpower_*`
- Can be used for automation/alerts
- Currently: No alerts configured

**See**: [UPS.md](UPS.md)

---

### Gateway Monitoring

**Status**: ✅ **Active with auto-recovery**

Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):

**1. Internet Watchdog** (`internet-watchdog.service`)
- Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
- Auto-reboots gateway after 5 consecutive failures (~5 minutes)
- Logs to `/var/log/internet-watchdog.log`

**2. Memory Monitor** (`memory-monitor.service`)
- Logs memory usage and top processes every 10 minutes
- Logs to `/data/logs/memory-history.log`
- Auto-rotates when log exceeds 10MB

**Quick Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'

# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'

# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'

# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
```

**See**: [GATEWAY.md](GATEWAY.md)

---

### Syncthing Monitoring

**Status**: ⚠️ **Partial** - API available, no automated monitoring

**What's available**:
- Device connection status
- Folder sync status
- Sync errors
- Bandwidth usage

**Manual Checks**:
```bash
# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq

# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
```

**Needs**: Automated monitoring script + alerts

**See**: [SYNCTHING.md](SYNCTHING.md)

---

### Temperature Monitoring

**Status**: ⚠️ **Manual only**

**Current Method**:
```bash
# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```

**Thresholds**:
- Healthy: 70-80°C under load
- Warning: >85°C
- Critical: >90°C (throttling)

**Needs**: Automated monitoring + alert if >85°C

---

### Proxmox VM Monitoring

**Status**: ⚠️ **Manual only**

**Current Access**:
- Proxmox Web UI: Node → Summary
- CLI: `ssh pve 'qm list'`

**Metrics Available** (via Proxmox):
- CPU usage per VM
- RAM usage per VM
- Disk I/O
- Network I/O
- VM uptime

**Needs**: API-based monitoring + alerts for VM down

---

## Recommended Monitoring Stack

### Option 1: Prometheus + Grafana (Recommended)

**Why**:
- Industry standard
- Extensive integrations
- Beautiful dashboards
- Flexible alerting

**Architecture**:
```
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)
```

**Required Exporters**:
| Exporter | Monitors | Install On |
|----------|----------|------------|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
| nut_exporter | UPS metrics | PVE |
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
| cadvisor | Docker containers | Saltbox, docker-host |

**Deployment**:
```bash
# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'

# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml
```

**Estimated Setup Time**: 4-6 hours

---

### Option 2: Uptime Kuma (Simpler Alternative)

**Why**:
- Lightweight
- Easy to set up
- Web-based dashboard
- Built-in alerts (email, Slack, etc.)

**What it monitors**:
- HTTP/HTTPS endpoints
- Ping (ICMP)
- Ports (TCP)
- Docker containers

**Deployment**:
```bash
ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
EOF

# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io
```

**Estimated Setup Time**: 1-2 hours

---

### Option 3: Netdata (Real-time Monitoring)

**Why**:
- Real-time metrics (1-second granularity)
- Auto-discovers services
- Low overhead
- Beautiful web UI

**Deployment**:
```bash
# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'

# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)
```

**Parent-Child Setup** (optional):
- Configure PVE as parent
- Stream metrics from PVE2 → PVE
- Single dashboard for both servers

**Estimated Setup Time**: 1 hour

---

## Critical Metrics to Monitor

### Server Health

| Metric | Threshold | Action |
|--------|-----------|--------|
| **CPU usage** | >90% for 5 min | Alert |
| **CPU temp** | >85°C | Alert |
| **CPU temp** | >90°C | Critical alert |
| **RAM usage** | >95% | Alert |
| **Disk space** | >80% | Warning |
| **Disk space** | >90% | Alert |
| **Load average** | >CPU count | Alert |

### Storage Health

| Metric | Threshold | Action |
|--------|-----------|--------|
| **ZFS pool errors** | >0 | Alert immediately |
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
| **ZFS scrub failed** | Last scrub error | Alert |
| **SMART reallocated sectors** | >0 | Warning |
| **SMART pending sectors** | >0 | Alert |
| **SMART failure** | Pre-fail | Critical - replace drive |

### UPS

| Metric | Threshold | Action |
|--------|-----------|--------|
| **Battery charge** | <20% | Warning |
| **Battery charge** | <10% | Alert |
| **On battery** | >5 min | Alert |
| **Runtime** | <5 min | Critical |

### Network

| Metric | Threshold | Action |
|--------|-----------|--------|
| **Device unreachable** | >2 min down | Alert |
| **High packet loss** | >5% | Warning |
| **Bandwidth saturation** | >90% | Warning |

### VMs/Services

| Metric | Threshold | Action |
|--------|-----------|--------|
| **VM stopped** | Critical VM down | Alert immediately |
| **Service unreachable** | HTTP 5xx or timeout | Alert |
| **Backup failed** | Any backup failure | Alert |
| **Certificate expiry** | <30 days | Warning |
| **Certificate expiry** | <7 days | Alert |

---

## Alert Destinations

### Email Alerts

**Recommended**: Set up SMTP relay for email alerts

**Options**:
1. Gmail SMTP (free, rate-limited)
2. SendGrid (free tier: 100 emails/day)
3. Mailgun (free tier available)
4. Self-hosted mail server (complex)

**Configuration Example** (Prometheus Alertmanager):
```yaml
# /etc/alertmanager/alertmanager.yml
receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'
```

---

### Push Notifications

**Options**:
- **Pushover**: $5 one-time, reliable
- **Pushbullet**: Free tier available
- **Telegram Bot**: Free
- **Discord Webhook**: Free
- **Slack**: Free tier available

**Recommended**: Pushover or Telegram for mobile alerts

---

### Home Assistant Alerts

Since Home Assistant is already running, use it for alerts:

**Automation Example**:
```yaml
automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"

  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
```

**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant

---

## Monitoring Scripts

### Daily Health Check

Save as `~/bin/homelab-health-check.sh`:

```bash
#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="
```

**Run daily**:
```cron
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```

---

### ZFS Scrub Checker

```bash
#!/bin/bash
# Check last ZFS scrub status

echo "=== ZFS Scrub Status ==="

for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
done

echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'
```

---

### SMART Health Checker

```bash
#!/bin/bash
# Check SMART health on all drives

echo "=== SMART Health Check ==="

echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'

echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'
```

---

## Dashboard Recommendations

### Grafana Dashboard Layout

**Page 1: Overview**
- Server uptime
- CPU usage (all servers)
- RAM usage (all servers)
- Disk space (all pools)
- Network traffic
- UPS status

**Page 2: Storage**
- ZFS pool health
- SMART status for all drives
- I/O latency
- Scrub progress
- Disk temperatures

**Page 3: VMs**
- VM status (up/down)
- VM resource usage
- VM disk I/O
- VM network traffic

**Page 4: Services**
- Service health checks
- HTTP response times
- Certificate expiry dates
- Syncthing sync status

---

## Implementation Plan

### Phase 1: Basic Monitoring (Week 1)

- [ ] Install Uptime Kuma or Netdata
- [ ] Add HTTP checks for all services
- [ ] Configure UPS alerts in Home Assistant
- [ ] Set up daily health check email

**Estimated Time**: 4-6 hours

---

### Phase 2: Advanced Monitoring (Week 2-3)

- [ ] Install Prometheus + Grafana
- [ ] Deploy node_exporter on all servers
- [ ] Deploy zfs_exporter
- [ ] Deploy smartmon_exporter
- [ ] Create Grafana dashboards

**Estimated Time**: 8-12 hours

---

### Phase 3: Alerting (Week 4)

- [ ] Configure Alertmanager
- [ ] Set up email/push notifications
- [ ] Create alert rules for all critical metrics
- [ ] Test all alert paths
- [ ] Document alert procedures

**Estimated Time**: 4-6 hours

---

## Related Documentation

- [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting
- [UPS.md](UPS.md) - UPS monitoring details
- [STORAGE.md](STORAGE.md) - ZFS health checks
- [SERVICES.md](SERVICES.md) - Service inventory
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks

---

**Last Updated**: 2026-01-02
**Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**