hutson/homelab-docs

Fork 0

Files

Hutson 42cfdd8552 Auto-sync: 20260116-155016

2026-01-16 15:50:17 -05:00

18 KiB

Raw Permalink Blame History

Monitoring and Alerting

Documentation for system monitoring, health checks, and alerting across the homelab.

Current Monitoring Status

Component	Monitored?	Method	Alerts	Notes
Gateway	✅ Yes	Custom services	✅ Auto-reboot	Internet watchdog + memory monitor
UPS	✅ Yes	NUT + Home Assistant	❌ No	Battery, load, runtime tracked
Syncthing	✅ Partial	API (manual checks)	❌ No	Connection status available
Server temps	✅ Partial	Manual checks	❌ No	Via `sensors` command
VM status	✅ Partial	Proxmox UI	❌ No	Manual monitoring
ZFS health	❌ No	Manual `zpool status`	❌ No	No automated checks
Disk health (SMART)	❌ No	Manual `smartctl`	❌ No	No automated checks
Network	✅ Partial	Gateway watchdog	✅ Auto-reboot	Connectivity check every 60s
Services	❌ No	-	❌ No	No health checks
Backups	❌ No	-	❌ No	No verification
Claude Code	✅ Yes	Prometheus + Grafana	✅ Yes	Token usage, burn rate, cost tracking

Overall Status: ⚠️ PARTIAL - Gateway monitoring active, Claude Code active, most else is manual

Existing Monitoring

UPS Monitoring (NUT)

Status: ✅ Active and working

What's monitored:

Battery charge percentage
Runtime remaining (seconds)
Load percentage
Input/output voltage
UPS status (OL/OB/LB)

Access:

# Full UPS status
ssh pve 'upsc cyberpower@localhost'

# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

Home Assistant Integration:

Sensors: sensor.cyberpower_*
Can be used for automation/alerts
Currently: No alerts configured

See: UPS.md

Gateway Monitoring

Status: ✅ Active with auto-recovery

Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):

1. Internet Watchdog (internet-watchdog.service)

Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
Auto-reboots gateway after 5 consecutive failures (~5 minutes)
Logs to /var/log/internet-watchdog.log

2. Memory Monitor (memory-monitor.service)

Logs memory usage and top processes every 10 minutes
Logs to /data/logs/memory-history.log
Auto-rotates when log exceeds 10MB

Quick Commands:

# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'

# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'

# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'

# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'

See: GATEWAY.md

Claude Code Token Monitoring

Status: ✅ Active with alerts

Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits.

Architecture:

Claude Code (MacBook/Mac Mini)
      │
      ▼ (OTLP HTTP push every 60s)
      │
OTEL Collector (docker-host:4318)
      │
      ▼ (Prometheus exporter on :8889)
      │
Prometheus (docker-host:9090) ─── scrapes ───► otel-collector:8889
      │
      ├──► Grafana Dashboard
      │
      └──► Alertmanager (burn rate alerts)

Note: Uses Prometheus exporter instead of Remote Write because Claude Code sends Delta temporality metrics, which Remote Write doesn't support.

Monitored Devices: All Claude Code sessions on any device automatically push metrics via OTLP.

What's monitored:

Token usage (input/output/cache) over time
Burn rate (tokens/hour)
Cost tracking (USD)
Usage by model (Opus, Sonnet, Haiku)
Session count
Per-device breakdown

Dashboard: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage

Alerts Configured:

Alert	Threshold	Severity
High Burn Rate	>100k tokens/hour for 15min	Warning
Weekly Limit Risk	Projected >5M tokens/week	Critical
No Metrics	Scrape fails for 5min	Info

Configuration Files:

Shell config: ~/.zshrc (on each Mac - synced via Syncthing)
OTEL Collector: /opt/monitoring/otel-collector/config.yaml (docker-host)
Alert rules: /opt/monitoring/prometheus/rules/claude-code.yml (docker-host)

Shell Environment Setup (in ~/.zshrc):

# Claude Code OpenTelemetry Metrics (push to OTEL Collector)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://10.10.10.206:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_METRIC_EXPORT_INTERVAL=60000

Note: These can be set either in shell environment (~/.zshrc) or in ~/.claude/settings.json under the env block. Both methods work.

OTEL Collector Config (/opt/monitoring/otel-collector/config.yaml):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Prometheus Scrape Config (add to /opt/monitoring/prometheus/prometheus.yml):

  - job_name: "claude-code"
    static_configs:
      - targets: ["otel-collector:8889"]
        labels:
          group: "claude-code"

Useful PromQL Queries:

# Total tokens by model
sum(claude_code_token_usage_tokens_total) by (model)

# Burn rate (tokens/hour)
sum(rate(claude_code_token_usage_tokens_total[1h])) * 3600

# Total cost by model
sum(claude_code_cost_usage_USD_total) by (model)

# Usage by type (input, output, cacheRead, cacheCreation)
sum(claude_code_token_usage_tokens_total) by (type)

# Projected weekly usage (rough estimate)
sum(increase(claude_code_token_usage_tokens_total[24h])) * 7

Important Notes:

After changing ~/.zshrc, start a new terminal/shell session before running Claude Code
Metrics only flow while Claude Code is running
Weekly subscription resets Monday 1am (America/New_York)
Verify env vars are set: env | grep OTEL

Added: 2026-01-16

Syncthing Monitoring

Status: ⚠️ Partial - API available, no automated monitoring

What's available:

Device connection status
Folder sync status
Sync errors
Bandwidth usage

Manual Checks:

# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq

# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq

Needs: Automated monitoring script + alerts

See: SYNCTHING.md

Temperature Monitoring

Status: ⚠️ Manual only

Current Method:

# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Thresholds:

Healthy: 70-80°C under load
Warning: >85°C
Critical: >90°C (throttling)

Needs: Automated monitoring + alert if >85°C

Proxmox VM Monitoring

Status: ⚠️ Manual only

Current Access:

Proxmox Web UI: Node → Summary
CLI: ssh pve 'qm list'

Metrics Available (via Proxmox):

CPU usage per VM
RAM usage per VM
Disk I/O
Network I/O
VM uptime

Needs: API-based monitoring + alerts for VM down

Recommended Monitoring Stack

Option 1: Prometheus + Grafana (Recommended)

Why:

Industry standard
Extensive integrations
Beautiful dashboards
Flexible alerting

Architecture:

Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)

Required Exporters:

Exporter	Monitors	Install On
node_exporter	CPU, RAM, disk, network	PVE, PVE2, TrueNAS, all VMs
zfs_exporter	ZFS pool health	PVE, PVE2, TrueNAS
smartmon_exporter	Drive SMART data	PVE, PVE2, TrueNAS
nut_exporter	UPS metrics	PVE
proxmox_exporter	VM/CT stats	PVE, PVE2
cadvisor	Docker containers	Saltbox, docker-host

Deployment:

# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'

# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml

Estimated Setup Time: 4-6 hours

Option 2: Uptime Kuma (Simpler Alternative)

Why:

Lightweight
Easy to set up
Web-based dashboard
Built-in alerts (email, Slack, etc.)

What it monitors:

HTTP/HTTPS endpoints
Ping (ICMP)
Ports (TCP)
Docker containers

Deployment:

ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
EOF

# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io

Estimated Setup Time: 1-2 hours

Option 3: Netdata (Real-time Monitoring)

Why:

Real-time metrics (1-second granularity)
Auto-discovers services
Low overhead
Beautiful web UI

Deployment:

# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'

# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)

Parent-Child Setup (optional):

Configure PVE as parent
Stream metrics from PVE2 → PVE
Single dashboard for both servers

Estimated Setup Time: 1 hour

Critical Metrics to Monitor

Server Health

Metric	Threshold	Action
CPU usage	>90% for 5 min	Alert
CPU temp	>85°C	Alert
CPU temp	>90°C	Critical alert
RAM usage	>95%	Alert
Disk space	>80%	Warning
Disk space	>90%	Alert
Load average	>CPU count	Alert

Storage Health

Metric	Threshold	Action
ZFS pool errors	>0	Alert immediately
ZFS pool degraded	Any degraded vdev	Critical alert
ZFS scrub failed	Last scrub error	Alert
SMART reallocated sectors	>0	Warning
SMART pending sectors	>0	Alert
SMART failure	Pre-fail	Critical - replace drive

UPS

Metric	Threshold	Action
Battery charge	<20%	Warning
Battery charge	<10%	Alert
On battery	>5 min	Alert
Runtime	<5 min	Critical

Network

Metric	Threshold	Action
Device unreachable	>2 min down	Alert
High packet loss	>5%	Warning
Bandwidth saturation	>90%	Warning

VMs/Services

Metric	Threshold	Action
VM stopped	Critical VM down	Alert immediately
Service unreachable	HTTP 5xx or timeout	Alert
Backup failed	Any backup failure	Alert
Certificate expiry	<30 days	Warning
Certificate expiry	<7 days	Alert

Alert Destinations

Email Alerts

Recommended: Set up SMTP relay for email alerts

Options:

Gmail SMTP (free, rate-limited)
SendGrid (free tier: 100 emails/day)
Mailgun (free tier available)
Self-hosted mail server (complex)

Configuration Example (Prometheus Alertmanager):

# /etc/alertmanager/alertmanager.yml
receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'

Push Notifications

Options:

Pushover: $5 one-time, reliable
Pushbullet: Free tier available
Telegram Bot: Free
Discord Webhook: Free
Slack: Free tier available

Recommended: Pushover or Telegram for mobile alerts

Home Assistant Alerts

Since Home Assistant is already running, use it for alerts:

Automation Example:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"

  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"

Needs: Sensors for CPU temp, disk space, etc. in Home Assistant

Monitoring Scripts

Daily Health Check

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily:

0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

ZFS Scrub Checker

#!/bin/bash
# Check last ZFS scrub status

echo "=== ZFS Scrub Status ==="

for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
done

echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'

SMART Health Checker

#!/bin/bash
# Check SMART health on all drives

echo "=== SMART Health Check ==="

echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'

echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'

Dashboard Recommendations

Grafana Dashboard Layout

Page 1: Overview

Server uptime
CPU usage (all servers)
RAM usage (all servers)
Disk space (all pools)
Network traffic
UPS status

Page 2: Storage

ZFS pool health
SMART status for all drives
I/O latency
Scrub progress
Disk temperatures

Page 3: VMs

VM status (up/down)
VM resource usage
VM disk I/O
VM network traffic

Page 4: Services

Service health checks
HTTP response times
Certificate expiry dates
Syncthing sync status

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

Install Uptime Kuma or Netdata
Add HTTP checks for all services
Configure UPS alerts in Home Assistant
Set up daily health check email

Estimated Time: 4-6 hours

Phase 2: Advanced Monitoring (Week 2-3)

Install Prometheus + Grafana
Deploy node_exporter on all servers
Deploy zfs_exporter
Deploy smartmon_exporter
Create Grafana dashboards

Estimated Time: 8-12 hours

Phase 3: Alerting (Week 4)

Configure Alertmanager
Set up email/push notifications
Create alert rules for all critical metrics
Test all alert paths
Document alert procedures

Estimated Time: 4-6 hours

GATEWAY.md - Gateway monitoring and troubleshooting
UPS.md - UPS monitoring details
STORAGE.md - ZFS health checks
SERVICES.md - Service inventory
HOMEASSISTANT.md - Home Assistant automations
MAINTENANCE.md - Regular maintenance checks

Last Updated: 2026-01-02 Status: ⚠️ Partial monitoring - Gateway active, other systems need implementation

18 KiB Raw Permalink Blame History

Monitoring and Alerting

Current Monitoring Status

Existing Monitoring

UPS Monitoring (NUT)

Gateway Monitoring

Claude Code Token Monitoring

Syncthing Monitoring

Temperature Monitoring

Proxmox VM Monitoring

Recommended Monitoring Stack

Option 1: Prometheus + Grafana (Recommended)

Option 2: Uptime Kuma (Simpler Alternative)

Option 3: Netdata (Real-time Monitoring)

Critical Metrics to Monitor

Server Health

Storage Health

UPS

Network

VMs/Services

Alert Destinations

Email Alerts

Push Notifications

Home Assistant Alerts

Monitoring Scripts

Daily Health Check

ZFS Scrub Checker

SMART Health Checker

Dashboard Recommendations

Grafana Dashboard Layout

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

Phase 2: Advanced Monitoring (Week 2-3)

Phase 3: Alerting (Week 4)

Related Documentation

18 KiB

Raw Permalink Blame History