18 KiB
Monitoring and Alerting
Documentation for system monitoring, health checks, and alerting across the homelab.
Current Monitoring Status
| Component | Monitored? | Method | Alerts | Notes |
|---|---|---|---|---|
| Gateway | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
| UPS | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
| Syncthing | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
| Server temps | ✅ Partial | Manual checks | ❌ No | Via sensors command |
| VM status | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
| ZFS health | ❌ No | Manual zpool status |
❌ No | No automated checks |
| Disk health (SMART) | ❌ No | Manual smartctl |
❌ No | No automated checks |
| Network | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
| Services | ❌ No | - | ❌ No | No health checks |
| Backups | ❌ No | - | ❌ No | No verification |
| Claude Code | ✅ Yes | Prometheus + Grafana | ✅ Yes | Token usage, burn rate, cost tracking |
Overall Status: ⚠️ PARTIAL - Gateway monitoring active, Claude Code active, most else is manual
Existing Monitoring
UPS Monitoring (NUT)
Status: ✅ Active and working
What's monitored:
- Battery charge percentage
- Runtime remaining (seconds)
- Load percentage
- Input/output voltage
- UPS status (OL/OB/LB)
Access:
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
Home Assistant Integration:
- Sensors:
sensor.cyberpower_* - Can be used for automation/alerts
- Currently: No alerts configured
See: UPS.md
Gateway Monitoring
Status: ✅ Active with auto-recovery
Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):
1. Internet Watchdog (internet-watchdog.service)
- Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
- Auto-reboots gateway after 5 consecutive failures (~5 minutes)
- Logs to
/var/log/internet-watchdog.log
2. Memory Monitor (memory-monitor.service)
- Logs memory usage and top processes every 10 minutes
- Logs to
/data/logs/memory-history.log - Auto-rotates when log exceeds 10MB
Quick Commands:
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
See: GATEWAY.md
Claude Code Token Monitoring
Status: ✅ Active with alerts
Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits.
Architecture:
Claude Code (MacBook/Mac Mini)
│
▼ (OTLP HTTP push every 60s)
│
OTEL Collector (docker-host:4318)
│
▼ (Prometheus exporter on :8889)
│
Prometheus (docker-host:9090) ─── scrapes ───► otel-collector:8889
│
├──► Grafana Dashboard
│
└──► Alertmanager (burn rate alerts)
Note: Uses Prometheus exporter instead of Remote Write because Claude Code sends Delta temporality metrics, which Remote Write doesn't support.
Monitored Devices: All Claude Code sessions on any device automatically push metrics via OTLP.
What's monitored:
- Token usage (input/output/cache) over time
- Burn rate (tokens/hour)
- Cost tracking (USD)
- Usage by model (Opus, Sonnet, Haiku)
- Session count
- Per-device breakdown
Dashboard: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage
Alerts Configured:
| Alert | Threshold | Severity |
|---|---|---|
| High Burn Rate | >100k tokens/hour for 15min | Warning |
| Weekly Limit Risk | Projected >5M tokens/week | Critical |
| No Metrics | Scrape fails for 5min | Info |
Configuration Files:
- Shell config:
~/.zshrc(on each Mac - synced via Syncthing) - OTEL Collector:
/opt/monitoring/otel-collector/config.yaml(docker-host) - Alert rules:
/opt/monitoring/prometheus/rules/claude-code.yml(docker-host)
Shell Environment Setup (in ~/.zshrc):
# Claude Code OpenTelemetry Metrics (push to OTEL Collector)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://10.10.10.206:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_METRIC_EXPORT_INTERVAL=60000
Note: These can be set either in shell environment (~/.zshrc) or in ~/.claude/settings.json under the env block. Both methods work.
OTEL Collector Config (/opt/monitoring/otel-collector/config.yaml):
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
prometheus:
endpoint: 0.0.0.0:8889
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Prometheus Scrape Config (add to /opt/monitoring/prometheus/prometheus.yml):
- job_name: "claude-code"
static_configs:
- targets: ["otel-collector:8889"]
labels:
group: "claude-code"
Useful PromQL Queries:
# Total tokens by model
sum(claude_code_token_usage_tokens_total) by (model)
# Burn rate (tokens/hour)
sum(rate(claude_code_token_usage_tokens_total[1h])) * 3600
# Total cost by model
sum(claude_code_cost_usage_USD_total) by (model)
# Usage by type (input, output, cacheRead, cacheCreation)
sum(claude_code_token_usage_tokens_total) by (type)
# Projected weekly usage (rough estimate)
sum(increase(claude_code_token_usage_tokens_total[24h])) * 7
Important Notes:
- After changing
~/.zshrc, start a new terminal/shell session before running Claude Code - Metrics only flow while Claude Code is running
- Weekly subscription resets Monday 1am (America/New_York)
- Verify env vars are set:
env | grep OTEL
Added: 2026-01-16
Syncthing Monitoring
Status: ⚠️ Partial - API available, no automated monitoring
What's available:
- Device connection status
- Folder sync status
- Sync errors
- Bandwidth usage
Manual Checks:
# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
Needs: Automated monitoring script + alerts
See: SYNCTHING.md
Temperature Monitoring
Status: ⚠️ Manual only
Current Method:
# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
Thresholds:
- Healthy: 70-80°C under load
- Warning: >85°C
- Critical: >90°C (throttling)
Needs: Automated monitoring + alert if >85°C
Proxmox VM Monitoring
Status: ⚠️ Manual only
Current Access:
- Proxmox Web UI: Node → Summary
- CLI:
ssh pve 'qm list'
Metrics Available (via Proxmox):
- CPU usage per VM
- RAM usage per VM
- Disk I/O
- Network I/O
- VM uptime
Needs: API-based monitoring + alerts for VM down
Recommended Monitoring Stack
Option 1: Prometheus + Grafana (Recommended)
Why:
- Industry standard
- Extensive integrations
- Beautiful dashboards
- Flexible alerting
Architecture:
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
↓
Alertmanager (alerts)
Required Exporters:
| Exporter | Monitors | Install On |
|---|---|---|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
| nut_exporter | UPS metrics | PVE |
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
| cadvisor | Docker containers | Saltbox, docker-host |
Deployment:
# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
--net0 virtio,bridge=vmbr0'
# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml
Estimated Setup Time: 4-6 hours
Option 2: Uptime Kuma (Simpler Alternative)
Why:
- Lightweight
- Easy to set up
- Web-based dashboard
- Built-in alerts (email, Slack, etc.)
What it monitors:
- HTTP/HTTPS endpoints
- Ping (ICMP)
- Ports (TCP)
- Docker containers
Deployment:
ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
ports:
- "3001:3001"
volumes:
- ./data:/app/data
restart: unless-stopped
EOF
# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io
Estimated Setup Time: 1-2 hours
Option 3: Netdata (Real-time Monitoring)
Why:
- Real-time metrics (1-second granularity)
- Auto-discovers services
- Low overhead
- Beautiful web UI
Deployment:
# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)
Parent-Child Setup (optional):
- Configure PVE as parent
- Stream metrics from PVE2 → PVE
- Single dashboard for both servers
Estimated Setup Time: 1 hour
Critical Metrics to Monitor
Server Health
| Metric | Threshold | Action |
|---|---|---|
| CPU usage | >90% for 5 min | Alert |
| CPU temp | >85°C | Alert |
| CPU temp | >90°C | Critical alert |
| RAM usage | >95% | Alert |
| Disk space | >80% | Warning |
| Disk space | >90% | Alert |
| Load average | >CPU count | Alert |
Storage Health
| Metric | Threshold | Action |
|---|---|---|
| ZFS pool errors | >0 | Alert immediately |
| ZFS pool degraded | Any degraded vdev | Critical alert |
| ZFS scrub failed | Last scrub error | Alert |
| SMART reallocated sectors | >0 | Warning |
| SMART pending sectors | >0 | Alert |
| SMART failure | Pre-fail | Critical - replace drive |
UPS
| Metric | Threshold | Action |
|---|---|---|
| Battery charge | <20% | Warning |
| Battery charge | <10% | Alert |
| On battery | >5 min | Alert |
| Runtime | <5 min | Critical |
Network
| Metric | Threshold | Action |
|---|---|---|
| Device unreachable | >2 min down | Alert |
| High packet loss | >5% | Warning |
| Bandwidth saturation | >90% | Warning |
VMs/Services
| Metric | Threshold | Action |
|---|---|---|
| VM stopped | Critical VM down | Alert immediately |
| Service unreachable | HTTP 5xx or timeout | Alert |
| Backup failed | Any backup failure | Alert |
| Certificate expiry | <30 days | Warning |
| Certificate expiry | <7 days | Alert |
Alert Destinations
Email Alerts
Recommended: Set up SMTP relay for email alerts
Options:
- Gmail SMTP (free, rate-limited)
- SendGrid (free tier: 100 emails/day)
- Mailgun (free tier available)
- Self-hosted mail server (complex)
Configuration Example (Prometheus Alertmanager):
# /etc/alertmanager/alertmanager.yml
receivers:
- name: 'email'
email_configs:
- to: 'hutson@example.com'
from: 'alerts@htsn.io'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alerts@htsn.io'
auth_password: 'app-password-here'
Push Notifications
Options:
- Pushover: $5 one-time, reliable
- Pushbullet: Free tier available
- Telegram Bot: Free
- Discord Webhook: Free
- Slack: Free tier available
Recommended: Pushover or Telegram for mobile alerts
Home Assistant Alerts
Since Home Assistant is already running, use it for alerts:
Automation Example:
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_charge
below: 20
action:
- service: notify.mobile_app
data:
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
- alias: "Server High Temperature"
trigger:
- platform: template
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
action:
- service: notify.mobile_app
data:
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
Needs: Sensors for CPU temp, disk space, etc. in Home Assistant
Monitoring Scripts
Daily Health Check
Save as ~/bin/homelab-health-check.sh:
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
Run daily:
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
ZFS Scrub Checker
#!/bin/bash
# Check last ZFS scrub status
echo "=== ZFS Scrub Status ==="
for host in pve pve2; do
echo "--- $host ---"
ssh $host 'zpool status | grep -A1 scrub'
echo ""
done
echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'
SMART Health Checker
#!/bin/bash
# Check SMART health on all drives
echo "=== SMART Health Check ==="
echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'
echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'
Dashboard Recommendations
Grafana Dashboard Layout
Page 1: Overview
- Server uptime
- CPU usage (all servers)
- RAM usage (all servers)
- Disk space (all pools)
- Network traffic
- UPS status
Page 2: Storage
- ZFS pool health
- SMART status for all drives
- I/O latency
- Scrub progress
- Disk temperatures
Page 3: VMs
- VM status (up/down)
- VM resource usage
- VM disk I/O
- VM network traffic
Page 4: Services
- Service health checks
- HTTP response times
- Certificate expiry dates
- Syncthing sync status
Implementation Plan
Phase 1: Basic Monitoring (Week 1)
- Install Uptime Kuma or Netdata
- Add HTTP checks for all services
- Configure UPS alerts in Home Assistant
- Set up daily health check email
Estimated Time: 4-6 hours
Phase 2: Advanced Monitoring (Week 2-3)
- Install Prometheus + Grafana
- Deploy node_exporter on all servers
- Deploy zfs_exporter
- Deploy smartmon_exporter
- Create Grafana dashboards
Estimated Time: 8-12 hours
Phase 3: Alerting (Week 4)
- Configure Alertmanager
- Set up email/push notifications
- Create alert rules for all critical metrics
- Test all alert paths
- Document alert procedures
Estimated Time: 4-6 hours
Related Documentation
- GATEWAY.md - Gateway monitoring and troubleshooting
- UPS.md - UPS monitoring details
- STORAGE.md - ZFS health checks
- SERVICES.md - Service inventory
- HOMEASSISTANT.md - Home Assistant automations
- MAINTENANCE.md - Regular maintenance checks
Last Updated: 2026-01-02 Status: ⚠️ Partial monitoring - Gateway active, other systems need implementation