Files
homelab-docs/MONITORING.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

13 KiB

Monitoring and Alerting

Documentation for system monitoring, health checks, and alerting across the homelab.

Current Monitoring Status

Component Monitored? Method Alerts Notes
UPS Yes NUT + Home Assistant No Battery, load, runtime tracked
Syncthing Partial API (manual checks) No Connection status available
Server temps Partial Manual checks No Via sensors command
VM status Partial Proxmox UI No Manual monitoring
ZFS health No Manual zpool status No No automated checks
Disk health (SMART) No Manual smartctl No No automated checks
Network No - No No uptime monitoring
Services No - No No health checks
Backups No - No No verification

Overall Status: ⚠️ MINIMAL - Most monitoring is manual, no automated alerts


Existing Monitoring

UPS Monitoring (NUT)

Status: Active and working

What's monitored:

  • Battery charge percentage
  • Runtime remaining (seconds)
  • Load percentage
  • Input/output voltage
  • UPS status (OL/OB/LB)

Access:

# Full UPS status
ssh pve 'upsc cyberpower@localhost'

# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

Home Assistant Integration:

  • Sensors: sensor.cyberpower_*
  • Can be used for automation/alerts
  • Currently: No alerts configured

See: UPS.md


Syncthing Monitoring

Status: ⚠️ Partial - API available, no automated monitoring

What's available:

  • Device connection status
  • Folder sync status
  • Sync errors
  • Bandwidth usage

Manual Checks:

# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq

# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq

Needs: Automated monitoring script + alerts

See: SYNCTHING.md


Temperature Monitoring

Status: ⚠️ Manual only

Current Method:

# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Thresholds:

  • Healthy: 70-80°C under load
  • Warning: >85°C
  • Critical: >90°C (throttling)

Needs: Automated monitoring + alert if >85°C


Proxmox VM Monitoring

Status: ⚠️ Manual only

Current Access:

  • Proxmox Web UI: Node → Summary
  • CLI: ssh pve 'qm list'

Metrics Available (via Proxmox):

  • CPU usage per VM
  • RAM usage per VM
  • Disk I/O
  • Network I/O
  • VM uptime

Needs: API-based monitoring + alerts for VM down


Why:

  • Industry standard
  • Extensive integrations
  • Beautiful dashboards
  • Flexible alerting

Architecture:

Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)

Required Exporters:

Exporter Monitors Install On
node_exporter CPU, RAM, disk, network PVE, PVE2, TrueNAS, all VMs
zfs_exporter ZFS pool health PVE, PVE2, TrueNAS
smartmon_exporter Drive SMART data PVE, PVE2, TrueNAS
nut_exporter UPS metrics PVE
proxmox_exporter VM/CT stats PVE, PVE2
cadvisor Docker containers Saltbox, docker-host

Deployment:

# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'

# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml

Estimated Setup Time: 4-6 hours


Option 2: Uptime Kuma (Simpler Alternative)

Why:

  • Lightweight
  • Easy to set up
  • Web-based dashboard
  • Built-in alerts (email, Slack, etc.)

What it monitors:

  • HTTP/HTTPS endpoints
  • Ping (ICMP)
  • Ports (TCP)
  • Docker containers

Deployment:

ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
EOF

# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io

Estimated Setup Time: 1-2 hours


Option 3: Netdata (Real-time Monitoring)

Why:

  • Real-time metrics (1-second granularity)
  • Auto-discovers services
  • Low overhead
  • Beautiful web UI

Deployment:

# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'

# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)

Parent-Child Setup (optional):

  • Configure PVE as parent
  • Stream metrics from PVE2 → PVE
  • Single dashboard for both servers

Estimated Setup Time: 1 hour


Critical Metrics to Monitor

Server Health

Metric Threshold Action
CPU usage >90% for 5 min Alert
CPU temp >85°C Alert
CPU temp >90°C Critical alert
RAM usage >95% Alert
Disk space >80% Warning
Disk space >90% Alert
Load average >CPU count Alert

Storage Health

Metric Threshold Action
ZFS pool errors >0 Alert immediately
ZFS pool degraded Any degraded vdev Critical alert
ZFS scrub failed Last scrub error Alert
SMART reallocated sectors >0 Warning
SMART pending sectors >0 Alert
SMART failure Pre-fail Critical - replace drive

UPS

Metric Threshold Action
Battery charge <20% Warning
Battery charge <10% Alert
On battery >5 min Alert
Runtime <5 min Critical

Network

Metric Threshold Action
Device unreachable >2 min down Alert
High packet loss >5% Warning
Bandwidth saturation >90% Warning

VMs/Services

Metric Threshold Action
VM stopped Critical VM down Alert immediately
Service unreachable HTTP 5xx or timeout Alert
Backup failed Any backup failure Alert
Certificate expiry <30 days Warning
Certificate expiry <7 days Alert

Alert Destinations

Email Alerts

Recommended: Set up SMTP relay for email alerts

Options:

  1. Gmail SMTP (free, rate-limited)
  2. SendGrid (free tier: 100 emails/day)
  3. Mailgun (free tier available)
  4. Self-hosted mail server (complex)

Configuration Example (Prometheus Alertmanager):

# /etc/alertmanager/alertmanager.yml
receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'

Push Notifications

Options:

  • Pushover: $5 one-time, reliable
  • Pushbullet: Free tier available
  • Telegram Bot: Free
  • Discord Webhook: Free
  • Slack: Free tier available

Recommended: Pushover or Telegram for mobile alerts


Home Assistant Alerts

Since Home Assistant is already running, use it for alerts:

Automation Example:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"

  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"

Needs: Sensors for CPU temp, disk space, etc. in Home Assistant


Monitoring Scripts

Daily Health Check

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily:

0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

ZFS Scrub Checker

#!/bin/bash
# Check last ZFS scrub status

echo "=== ZFS Scrub Status ==="

for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
done

echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'

SMART Health Checker

#!/bin/bash
# Check SMART health on all drives

echo "=== SMART Health Check ==="

echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'

echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'

Dashboard Recommendations

Grafana Dashboard Layout

Page 1: Overview

  • Server uptime
  • CPU usage (all servers)
  • RAM usage (all servers)
  • Disk space (all pools)
  • Network traffic
  • UPS status

Page 2: Storage

  • ZFS pool health
  • SMART status for all drives
  • I/O latency
  • Scrub progress
  • Disk temperatures

Page 3: VMs

  • VM status (up/down)
  • VM resource usage
  • VM disk I/O
  • VM network traffic

Page 4: Services

  • Service health checks
  • HTTP response times
  • Certificate expiry dates
  • Syncthing sync status

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

  • Install Uptime Kuma or Netdata
  • Add HTTP checks for all services
  • Configure UPS alerts in Home Assistant
  • Set up daily health check email

Estimated Time: 4-6 hours


Phase 2: Advanced Monitoring (Week 2-3)

  • Install Prometheus + Grafana
  • Deploy node_exporter on all servers
  • Deploy zfs_exporter
  • Deploy smartmon_exporter
  • Create Grafana dashboards

Estimated Time: 8-12 hours


Phase 3: Alerting (Week 4)

  • Configure Alertmanager
  • Set up email/push notifications
  • Create alert rules for all critical metrics
  • Test all alert paths
  • Document alert procedures

Estimated Time: 4-6 hours



Last Updated: 2025-12-22 Status: ⚠️ Minimal monitoring currently in place - implementation needed