Files
homelab-docs/MONITORING.md
2026-01-16 15:50:17 -05:00

18 KiB

Monitoring and Alerting

Documentation for system monitoring, health checks, and alerting across the homelab.

Current Monitoring Status

Component Monitored? Method Alerts Notes
Gateway Yes Custom services Auto-reboot Internet watchdog + memory monitor
UPS Yes NUT + Home Assistant No Battery, load, runtime tracked
Syncthing Partial API (manual checks) No Connection status available
Server temps Partial Manual checks No Via sensors command
VM status Partial Proxmox UI No Manual monitoring
ZFS health No Manual zpool status No No automated checks
Disk health (SMART) No Manual smartctl No No automated checks
Network Partial Gateway watchdog Auto-reboot Connectivity check every 60s
Services No - No No health checks
Backups No - No No verification
Claude Code Yes Prometheus + Grafana Yes Token usage, burn rate, cost tracking

Overall Status: ⚠️ PARTIAL - Gateway monitoring active, Claude Code active, most else is manual


Existing Monitoring

UPS Monitoring (NUT)

Status: Active and working

What's monitored:

  • Battery charge percentage
  • Runtime remaining (seconds)
  • Load percentage
  • Input/output voltage
  • UPS status (OL/OB/LB)

Access:

# Full UPS status
ssh pve 'upsc cyberpower@localhost'

# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

Home Assistant Integration:

  • Sensors: sensor.cyberpower_*
  • Can be used for automation/alerts
  • Currently: No alerts configured

See: UPS.md


Gateway Monitoring

Status: Active with auto-recovery

Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):

1. Internet Watchdog (internet-watchdog.service)

  • Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
  • Auto-reboots gateway after 5 consecutive failures (~5 minutes)
  • Logs to /var/log/internet-watchdog.log

2. Memory Monitor (memory-monitor.service)

  • Logs memory usage and top processes every 10 minutes
  • Logs to /data/logs/memory-history.log
  • Auto-rotates when log exceeds 10MB

Quick Commands:

# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'

# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'

# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'

# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'

See: GATEWAY.md


Claude Code Token Monitoring

Status: Active with alerts

Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits.

Architecture:

Claude Code (MacBook/Mac Mini)
      │
      ▼ (OTLP HTTP push every 60s)
      │
OTEL Collector (docker-host:4318)
      │
      ▼ (Prometheus exporter on :8889)
      │
Prometheus (docker-host:9090) ─── scrapes ───► otel-collector:8889
      │
      ├──► Grafana Dashboard
      │
      └──► Alertmanager (burn rate alerts)

Note: Uses Prometheus exporter instead of Remote Write because Claude Code sends Delta temporality metrics, which Remote Write doesn't support.

Monitored Devices: All Claude Code sessions on any device automatically push metrics via OTLP.

What's monitored:

  • Token usage (input/output/cache) over time
  • Burn rate (tokens/hour)
  • Cost tracking (USD)
  • Usage by model (Opus, Sonnet, Haiku)
  • Session count
  • Per-device breakdown

Dashboard: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage

Alerts Configured:

Alert Threshold Severity
High Burn Rate >100k tokens/hour for 15min Warning
Weekly Limit Risk Projected >5M tokens/week Critical
No Metrics Scrape fails for 5min Info

Configuration Files:

  • Shell config: ~/.zshrc (on each Mac - synced via Syncthing)
  • OTEL Collector: /opt/monitoring/otel-collector/config.yaml (docker-host)
  • Alert rules: /opt/monitoring/prometheus/rules/claude-code.yml (docker-host)

Shell Environment Setup (in ~/.zshrc):

# Claude Code OpenTelemetry Metrics (push to OTEL Collector)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://10.10.10.206:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_METRIC_EXPORT_INTERVAL=60000

Note: These can be set either in shell environment (~/.zshrc) or in ~/.claude/settings.json under the env block. Both methods work.

OTEL Collector Config (/opt/monitoring/otel-collector/config.yaml):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Prometheus Scrape Config (add to /opt/monitoring/prometheus/prometheus.yml):

  - job_name: "claude-code"
    static_configs:
      - targets: ["otel-collector:8889"]
        labels:
          group: "claude-code"

Useful PromQL Queries:

# Total tokens by model
sum(claude_code_token_usage_tokens_total) by (model)

# Burn rate (tokens/hour)
sum(rate(claude_code_token_usage_tokens_total[1h])) * 3600

# Total cost by model
sum(claude_code_cost_usage_USD_total) by (model)

# Usage by type (input, output, cacheRead, cacheCreation)
sum(claude_code_token_usage_tokens_total) by (type)

# Projected weekly usage (rough estimate)
sum(increase(claude_code_token_usage_tokens_total[24h])) * 7

Important Notes:

  • After changing ~/.zshrc, start a new terminal/shell session before running Claude Code
  • Metrics only flow while Claude Code is running
  • Weekly subscription resets Monday 1am (America/New_York)
  • Verify env vars are set: env | grep OTEL

Added: 2026-01-16


Syncthing Monitoring

Status: ⚠️ Partial - API available, no automated monitoring

What's available:

  • Device connection status
  • Folder sync status
  • Sync errors
  • Bandwidth usage

Manual Checks:

# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq

# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq

Needs: Automated monitoring script + alerts

See: SYNCTHING.md


Temperature Monitoring

Status: ⚠️ Manual only

Current Method:

# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Thresholds:

  • Healthy: 70-80°C under load
  • Warning: >85°C
  • Critical: >90°C (throttling)

Needs: Automated monitoring + alert if >85°C


Proxmox VM Monitoring

Status: ⚠️ Manual only

Current Access:

  • Proxmox Web UI: Node → Summary
  • CLI: ssh pve 'qm list'

Metrics Available (via Proxmox):

  • CPU usage per VM
  • RAM usage per VM
  • Disk I/O
  • Network I/O
  • VM uptime

Needs: API-based monitoring + alerts for VM down


Why:

  • Industry standard
  • Extensive integrations
  • Beautiful dashboards
  • Flexible alerting

Architecture:

Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)

Required Exporters:

Exporter Monitors Install On
node_exporter CPU, RAM, disk, network PVE, PVE2, TrueNAS, all VMs
zfs_exporter ZFS pool health PVE, PVE2, TrueNAS
smartmon_exporter Drive SMART data PVE, PVE2, TrueNAS
nut_exporter UPS metrics PVE
proxmox_exporter VM/CT stats PVE, PVE2
cadvisor Docker containers Saltbox, docker-host

Deployment:

# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'

# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml

Estimated Setup Time: 4-6 hours


Option 2: Uptime Kuma (Simpler Alternative)

Why:

  • Lightweight
  • Easy to set up
  • Web-based dashboard
  • Built-in alerts (email, Slack, etc.)

What it monitors:

  • HTTP/HTTPS endpoints
  • Ping (ICMP)
  • Ports (TCP)
  • Docker containers

Deployment:

ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
EOF

# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io

Estimated Setup Time: 1-2 hours


Option 3: Netdata (Real-time Monitoring)

Why:

  • Real-time metrics (1-second granularity)
  • Auto-discovers services
  • Low overhead
  • Beautiful web UI

Deployment:

# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'

# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)

Parent-Child Setup (optional):

  • Configure PVE as parent
  • Stream metrics from PVE2 → PVE
  • Single dashboard for both servers

Estimated Setup Time: 1 hour


Critical Metrics to Monitor

Server Health

Metric Threshold Action
CPU usage >90% for 5 min Alert
CPU temp >85°C Alert
CPU temp >90°C Critical alert
RAM usage >95% Alert
Disk space >80% Warning
Disk space >90% Alert
Load average >CPU count Alert

Storage Health

Metric Threshold Action
ZFS pool errors >0 Alert immediately
ZFS pool degraded Any degraded vdev Critical alert
ZFS scrub failed Last scrub error Alert
SMART reallocated sectors >0 Warning
SMART pending sectors >0 Alert
SMART failure Pre-fail Critical - replace drive

UPS

Metric Threshold Action
Battery charge <20% Warning
Battery charge <10% Alert
On battery >5 min Alert
Runtime <5 min Critical

Network

Metric Threshold Action
Device unreachable >2 min down Alert
High packet loss >5% Warning
Bandwidth saturation >90% Warning

VMs/Services

Metric Threshold Action
VM stopped Critical VM down Alert immediately
Service unreachable HTTP 5xx or timeout Alert
Backup failed Any backup failure Alert
Certificate expiry <30 days Warning
Certificate expiry <7 days Alert

Alert Destinations

Email Alerts

Recommended: Set up SMTP relay for email alerts

Options:

  1. Gmail SMTP (free, rate-limited)
  2. SendGrid (free tier: 100 emails/day)
  3. Mailgun (free tier available)
  4. Self-hosted mail server (complex)

Configuration Example (Prometheus Alertmanager):

# /etc/alertmanager/alertmanager.yml
receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'

Push Notifications

Options:

  • Pushover: $5 one-time, reliable
  • Pushbullet: Free tier available
  • Telegram Bot: Free
  • Discord Webhook: Free
  • Slack: Free tier available

Recommended: Pushover or Telegram for mobile alerts


Home Assistant Alerts

Since Home Assistant is already running, use it for alerts:

Automation Example:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"

  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"

Needs: Sensors for CPU temp, disk space, etc. in Home Assistant


Monitoring Scripts

Daily Health Check

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily:

0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

ZFS Scrub Checker

#!/bin/bash
# Check last ZFS scrub status

echo "=== ZFS Scrub Status ==="

for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
done

echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'

SMART Health Checker

#!/bin/bash
# Check SMART health on all drives

echo "=== SMART Health Check ==="

echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'

echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'

Dashboard Recommendations

Grafana Dashboard Layout

Page 1: Overview

  • Server uptime
  • CPU usage (all servers)
  • RAM usage (all servers)
  • Disk space (all pools)
  • Network traffic
  • UPS status

Page 2: Storage

  • ZFS pool health
  • SMART status for all drives
  • I/O latency
  • Scrub progress
  • Disk temperatures

Page 3: VMs

  • VM status (up/down)
  • VM resource usage
  • VM disk I/O
  • VM network traffic

Page 4: Services

  • Service health checks
  • HTTP response times
  • Certificate expiry dates
  • Syncthing sync status

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

  • Install Uptime Kuma or Netdata
  • Add HTTP checks for all services
  • Configure UPS alerts in Home Assistant
  • Set up daily health check email

Estimated Time: 4-6 hours


Phase 2: Advanced Monitoring (Week 2-3)

  • Install Prometheus + Grafana
  • Deploy node_exporter on all servers
  • Deploy zfs_exporter
  • Deploy smartmon_exporter
  • Create Grafana dashboards

Estimated Time: 8-12 hours


Phase 3: Alerting (Week 4)

  • Configure Alertmanager
  • Set up email/push notifications
  • Create alert rules for all critical metrics
  • Test all alert paths
  • Document alert procedures

Estimated Time: 4-6 hours



Last Updated: 2026-01-02 Status: ⚠️ Partial monitoring - Gateway active, other systems need implementation