# Monitoring and Alerting Documentation for system monitoring, health checks, and alerting across the homelab. ## Current Monitoring Status | Component | Monitored? | Method | Alerts | Notes | |-----------|------------|--------|--------|-------| | **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor | | **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked | | **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available | | **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command | | **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring | | **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks | | **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks | | **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s | | **Services** | ❌ No | - | ❌ No | No health checks | | **Backups** | ❌ No | - | ❌ No | No verification | | **Claude Code** | ✅ Yes | Prometheus + Grafana | ✅ Yes | Token usage, burn rate, cost tracking | **Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, Claude Code active, most else is manual --- ## Existing Monitoring ### UPS Monitoring (NUT) **Status**: ✅ **Active and working** **What's monitored**: - Battery charge percentage - Runtime remaining (seconds) - Load percentage - Input/output voltage - UPS status (OL/OB/LB) **Access**: ```bash # Full UPS status ssh pve 'upsc cyberpower@localhost' # Key metrics ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' ``` **Home Assistant Integration**: - Sensors: `sensor.cyberpower_*` - Can be used for automation/alerts - Currently: No alerts configured **See**: [UPS.md](UPS.md) --- ### Gateway Monitoring **Status**: ✅ **Active with auto-recovery** Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1): **1. Internet Watchdog** (`internet-watchdog.service`) - Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds - Auto-reboots gateway after 5 consecutive failures (~5 minutes) - Logs to `/var/log/internet-watchdog.log` **2. Memory Monitor** (`memory-monitor.service`) - Logs memory usage and top processes every 10 minutes - Logs to `/data/logs/memory-history.log` - Auto-rotates when log exceeds 10MB **Quick Commands**: ```bash # Check service status ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor' # View watchdog activity ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log' # View memory history ssh ucg-fiber 'tail -100 /data/logs/memory-history.log' # Current memory usage ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12' ``` **See**: [GATEWAY.md](GATEWAY.md) --- ### Claude Code Token Monitoring **Status**: ✅ **Active with alerts** Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits. **Architecture**: ``` Claude Code (MacBook/Mac Mini) │ ▼ (OTLP HTTP push) │ OTEL Collector (docker-host:4318) │ ▼ (Remote Write) │ Prometheus (docker-host:9090) │ ├──► Grafana Dashboard │ └──► Alertmanager (burn rate alerts) ``` **Monitored Devices**: All Claude Code sessions on any device automatically push metrics via OTLP. **What's monitored**: - Token usage (input/output/cache) over time - Burn rate (tokens/hour) - Cost tracking (USD) - Usage by model (Opus, Sonnet, Haiku) - Session count - Per-device breakdown **Dashboard**: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage **Alerts Configured**: | Alert | Threshold | Severity | |-------|-----------|----------| | High Burn Rate | >100k tokens/hour for 15min | Warning | | Weekly Limit Risk | Projected >5M tokens/week | Critical | | No Metrics | Scrape fails for 5min | Info | **Configuration Files**: - Claude settings: `~/.claude/settings.json` (on each Mac - synced via Syncthing) - OTEL Collector: `/opt/monitoring/otel-collector/config.yaml` (docker-host) - Alert rules: `/opt/monitoring/prometheus/rules/claude-code.yml` (docker-host) **Claude Code Settings** (in `~/.claude/settings.json`): ```json { "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "1", "OTEL_METRICS_EXPORTER": "otlp", "OTEL_EXPORTER_OTLP_ENDPOINT": "http://10.10.10.206:4318", "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf", "OTEL_METRIC_EXPORT_INTERVAL": "60000" } } ``` **OTEL Collector Config** (`/opt/monitoring/otel-collector/config.yaml`): ```yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 10s exporters: prometheusremotewrite: endpoint: "http://prometheus:9090/api/v1/write" service: pipelines: metrics: receivers: [otlp] processors: [batch] exporters: [prometheusremotewrite] ``` **Useful PromQL Queries**: ```promql # Total tokens this session sum(claude_code_token_usage_total) # Burn rate (tokens/hour) sum(rate(claude_code_token_usage_total[1h])) * 3600 # Usage by device sum(claude_code_token_usage_total) by (device) # Projected weekly usage sum(increase(claude_code_token_usage_total[24h])) * 7 ``` **Important Notes**: - Claude Code must be restarted after changing telemetry settings - Metrics only flow while Claude Code is running - Weekly subscription resets Monday 1am (America/New_York) **Added**: 2026-01-16 --- ### Syncthing Monitoring **Status**: ⚠️ **Partial** - API available, no automated monitoring **What's available**: - Device connection status - Folder sync status - Sync errors - Bandwidth usage **Manual Checks**: ```bash # Check connections (Mac Mini) curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/system/connections" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" # Check folder status curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq # Check errors curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq ``` **Needs**: Automated monitoring script + alerts **See**: [SYNCTHING.md](SYNCTHING.md) --- ### Temperature Monitoring **Status**: ⚠️ **Manual only** **Current Method**: ```bash # CPU temperature (Threadripper Tctl) ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ label=$(cat ${f%_input}_label 2>/dev/null); \ if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ label=$(cat ${f%_input}_label 2>/dev/null); \ if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' ``` **Thresholds**: - Healthy: 70-80°C under load - Warning: >85°C - Critical: >90°C (throttling) **Needs**: Automated monitoring + alert if >85°C --- ### Proxmox VM Monitoring **Status**: ⚠️ **Manual only** **Current Access**: - Proxmox Web UI: Node → Summary - CLI: `ssh pve 'qm list'` **Metrics Available** (via Proxmox): - CPU usage per VM - RAM usage per VM - Disk I/O - Network I/O - VM uptime **Needs**: API-based monitoring + alerts for VM down --- ## Recommended Monitoring Stack ### Option 1: Prometheus + Grafana (Recommended) **Why**: - Industry standard - Extensive integrations - Beautiful dashboards - Flexible alerting **Architecture**: ``` Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection) ↓ Alertmanager (alerts) ``` **Required Exporters**: | Exporter | Monitors | Install On | |----------|----------|------------| | node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs | | zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS | | smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS | | nut_exporter | UPS metrics | PVE | | proxmox_exporter | VM/CT stats | PVE, PVE2 | | cadvisor | Docker containers | Saltbox, docker-host | **Deployment**: ```bash # Create monitoring VM ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \ --net0 virtio,bridge=vmbr0' # Install Prometheus + Grafana (via Docker) # /opt/monitoring/docker-compose.yml ``` **Estimated Setup Time**: 4-6 hours --- ### Option 2: Uptime Kuma (Simpler Alternative) **Why**: - Lightweight - Easy to set up - Web-based dashboard - Built-in alerts (email, Slack, etc.) **What it monitors**: - HTTP/HTTPS endpoints - Ping (ICMP) - Ports (TCP) - Docker containers **Deployment**: ```bash ssh docker-host 'mkdir -p /opt/uptime-kuma' cat > docker-compose.yml << 'EOF' version: "3.8" services: uptime-kuma: image: louislam/uptime-kuma:latest ports: - "3001:3001" volumes: - ./data:/app/data restart: unless-stopped EOF # Access: http://10.10.10.206:3001 # Add Traefik config for uptime.htsn.io ``` **Estimated Setup Time**: 1-2 hours --- ### Option 3: Netdata (Real-time Monitoring) **Why**: - Real-time metrics (1-second granularity) - Auto-discovers services - Low overhead - Beautiful web UI **Deployment**: ```bash # Install on each server ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)' ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)' # Access: # http://10.10.10.120:19999 (PVE) # http://10.10.10.102:19999 (PVE2) ``` **Parent-Child Setup** (optional): - Configure PVE as parent - Stream metrics from PVE2 → PVE - Single dashboard for both servers **Estimated Setup Time**: 1 hour --- ## Critical Metrics to Monitor ### Server Health | Metric | Threshold | Action | |--------|-----------|--------| | **CPU usage** | >90% for 5 min | Alert | | **CPU temp** | >85°C | Alert | | **CPU temp** | >90°C | Critical alert | | **RAM usage** | >95% | Alert | | **Disk space** | >80% | Warning | | **Disk space** | >90% | Alert | | **Load average** | >CPU count | Alert | ### Storage Health | Metric | Threshold | Action | |--------|-----------|--------| | **ZFS pool errors** | >0 | Alert immediately | | **ZFS pool degraded** | Any degraded vdev | Critical alert | | **ZFS scrub failed** | Last scrub error | Alert | | **SMART reallocated sectors** | >0 | Warning | | **SMART pending sectors** | >0 | Alert | | **SMART failure** | Pre-fail | Critical - replace drive | ### UPS | Metric | Threshold | Action | |--------|-----------|--------| | **Battery charge** | <20% | Warning | | **Battery charge** | <10% | Alert | | **On battery** | >5 min | Alert | | **Runtime** | <5 min | Critical | ### Network | Metric | Threshold | Action | |--------|-----------|--------| | **Device unreachable** | >2 min down | Alert | | **High packet loss** | >5% | Warning | | **Bandwidth saturation** | >90% | Warning | ### VMs/Services | Metric | Threshold | Action | |--------|-----------|--------| | **VM stopped** | Critical VM down | Alert immediately | | **Service unreachable** | HTTP 5xx or timeout | Alert | | **Backup failed** | Any backup failure | Alert | | **Certificate expiry** | <30 days | Warning | | **Certificate expiry** | <7 days | Alert | --- ## Alert Destinations ### Email Alerts **Recommended**: Set up SMTP relay for email alerts **Options**: 1. Gmail SMTP (free, rate-limited) 2. SendGrid (free tier: 100 emails/day) 3. Mailgun (free tier available) 4. Self-hosted mail server (complex) **Configuration Example** (Prometheus Alertmanager): ```yaml # /etc/alertmanager/alertmanager.yml receivers: - name: 'email' email_configs: - to: 'hutson@example.com' from: 'alerts@htsn.io' smarthost: 'smtp.gmail.com:587' auth_username: 'alerts@htsn.io' auth_password: 'app-password-here' ``` --- ### Push Notifications **Options**: - **Pushover**: $5 one-time, reliable - **Pushbullet**: Free tier available - **Telegram Bot**: Free - **Discord Webhook**: Free - **Slack**: Free tier available **Recommended**: Pushover or Telegram for mobile alerts --- ### Home Assistant Alerts Since Home Assistant is already running, use it for alerts: **Automation Example**: ```yaml automation: - alias: "UPS Low Battery Alert" trigger: - platform: numeric_state entity_id: sensor.cyberpower_battery_charge below: 20 action: - service: notify.mobile_app data: message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%" - alias: "Server High Temperature" trigger: - platform: template value_template: "{{ sensor.pve_cpu_temp > 85 }}" action: - service: notify.mobile_app data: message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C" ``` **Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant --- ## Monitoring Scripts ### Daily Health Check Save as `~/bin/homelab-health-check.sh`: ```bash #!/bin/bash # Daily homelab health check echo "=== Homelab Health Check ===" echo "Date: $(date)" echo "" echo "=== Server Status ===" ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE" ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE" echo "" echo "=== CPU Temperatures ===" ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done' ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done' echo "" echo "=== UPS Status ===" ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' echo "" echo "=== ZFS Pools ===" ssh pve 'zpool status -x' 2>/dev/null ssh pve2 'zpool status -x' 2>/dev/null ssh truenas 'zpool status -x vault' echo "" echo "=== Disk Space ===" ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"' ssh truenas 'df -h /mnt/vault' echo "" echo "=== VM Status ===" ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:" ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:" echo "" echo "=== Syncthing Connections ===" curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ "http://127.0.0.1:8384/rest/system/connections" | \ python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" echo "" echo "=== Check Complete ===" ``` **Run daily**: ```cron 0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com ``` --- ### ZFS Scrub Checker ```bash #!/bin/bash # Check last ZFS scrub status echo "=== ZFS Scrub Status ===" for host in pve pve2; do echo "--- $host ---" ssh $host 'zpool status | grep -A1 scrub' echo "" done echo "--- TrueNAS ---" ssh truenas 'zpool status vault | grep -A1 scrub' ``` --- ### SMART Health Checker ```bash #!/bin/bash # Check SMART health on all drives echo "=== SMART Health Check ===" echo "--- TrueNAS Drives ---" ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED"; done' echo "--- PVE Drives ---" ssh pve 'for dev in /dev/nvme* /dev/sd*; do [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED"; done' ``` --- ## Dashboard Recommendations ### Grafana Dashboard Layout **Page 1: Overview** - Server uptime - CPU usage (all servers) - RAM usage (all servers) - Disk space (all pools) - Network traffic - UPS status **Page 2: Storage** - ZFS pool health - SMART status for all drives - I/O latency - Scrub progress - Disk temperatures **Page 3: VMs** - VM status (up/down) - VM resource usage - VM disk I/O - VM network traffic **Page 4: Services** - Service health checks - HTTP response times - Certificate expiry dates - Syncthing sync status --- ## Implementation Plan ### Phase 1: Basic Monitoring (Week 1) - [ ] Install Uptime Kuma or Netdata - [ ] Add HTTP checks for all services - [ ] Configure UPS alerts in Home Assistant - [ ] Set up daily health check email **Estimated Time**: 4-6 hours --- ### Phase 2: Advanced Monitoring (Week 2-3) - [ ] Install Prometheus + Grafana - [ ] Deploy node_exporter on all servers - [ ] Deploy zfs_exporter - [ ] Deploy smartmon_exporter - [ ] Create Grafana dashboards **Estimated Time**: 8-12 hours --- ### Phase 3: Alerting (Week 4) - [ ] Configure Alertmanager - [ ] Set up email/push notifications - [ ] Create alert rules for all critical metrics - [ ] Test all alert paths - [ ] Document alert procedures **Estimated Time**: 4-6 hours --- ## Related Documentation - [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting - [UPS.md](UPS.md) - UPS monitoring details - [STORAGE.md](STORAGE.md) - ZFS health checks - [SERVICES.md](SERVICES.md) - Service inventory - [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations - [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks --- **Last Updated**: 2026-01-02 **Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**