Files

Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-23 00:34:21 -05:00

13 KiB

Raw Blame History

Monitoring and Alerting

Documentation for system monitoring, health checks, and alerting across the homelab.

Current Monitoring Status

Component	Monitored?	Method	Alerts	Notes
UPS	✅ Yes	NUT + Home Assistant	❌ No	Battery, load, runtime tracked
Syncthing	✅ Partial	API (manual checks)	❌ No	Connection status available
Server temps	✅ Partial	Manual checks	❌ No	Via `sensors` command
VM status	✅ Partial	Proxmox UI	❌ No	Manual monitoring
ZFS health	❌ No	Manual `zpool status`	❌ No	No automated checks
Disk health (SMART)	❌ No	Manual `smartctl`	❌ No	No automated checks
Network	❌ No	-	❌ No	No uptime monitoring
Services	❌ No	-	❌ No	No health checks
Backups	❌ No	-	❌ No	No verification

Overall Status: ⚠️ MINIMAL - Most monitoring is manual, no automated alerts

Existing Monitoring

UPS Monitoring (NUT)

Status: ✅ Active and working

What's monitored:

Battery charge percentage
Runtime remaining (seconds)
Load percentage
Input/output voltage
UPS status (OL/OB/LB)

Access:

# Full UPS status
ssh pve 'upsc cyberpower@localhost'

# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

Home Assistant Integration:

Sensors: sensor.cyberpower_*
Can be used for automation/alerts
Currently: No alerts configured

See: UPS.md

Syncthing Monitoring

Status: ⚠️ Partial - API available, no automated monitoring

What's available:

Device connection status
Folder sync status
Sync errors
Bandwidth usage

Manual Checks:

# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"

# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq

# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq

Needs: Automated monitoring script + alerts

See: SYNCTHING.md

Temperature Monitoring

Status: ⚠️ Manual only

Current Method:

# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Thresholds:

Healthy: 70-80°C under load
Warning: >85°C
Critical: >90°C (throttling)

Needs: Automated monitoring + alert if >85°C

Proxmox VM Monitoring

Status: ⚠️ Manual only

Current Access:

Proxmox Web UI: Node → Summary
CLI: ssh pve 'qm list'

Metrics Available (via Proxmox):

CPU usage per VM
RAM usage per VM
Disk I/O
Network I/O
VM uptime

Needs: API-based monitoring + alerts for VM down

Recommended Monitoring Stack

Option 1: Prometheus + Grafana (Recommended)

Why:

Industry standard
Extensive integrations
Beautiful dashboards
Flexible alerting

Architecture:

Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)

Required Exporters:

Exporter	Monitors	Install On
node_exporter	CPU, RAM, disk, network	PVE, PVE2, TrueNAS, all VMs
zfs_exporter	ZFS pool health	PVE, PVE2, TrueNAS
smartmon_exporter	Drive SMART data	PVE, PVE2, TrueNAS
nut_exporter	UPS metrics	PVE
proxmox_exporter	VM/CT stats	PVE, PVE2
cadvisor	Docker containers	Saltbox, docker-host

Deployment:

# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'

# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml

Estimated Setup Time: 4-6 hours

Option 2: Uptime Kuma (Simpler Alternative)

Why:

Lightweight
Easy to set up
Web-based dashboard
Built-in alerts (email, Slack, etc.)

What it monitors:

HTTP/HTTPS endpoints
Ping (ICMP)
Ports (TCP)
Docker containers

Deployment:

ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
EOF

# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io

Estimated Setup Time: 1-2 hours

Option 3: Netdata (Real-time Monitoring)

Why:

Real-time metrics (1-second granularity)
Auto-discovers services
Low overhead
Beautiful web UI

Deployment:

# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'

# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)

Parent-Child Setup (optional):

Configure PVE as parent
Stream metrics from PVE2 → PVE
Single dashboard for both servers

Estimated Setup Time: 1 hour

Critical Metrics to Monitor

Server Health

Metric	Threshold	Action
CPU usage	>90% for 5 min	Alert
CPU temp	>85°C	Alert
CPU temp	>90°C	Critical alert
RAM usage	>95%	Alert
Disk space	>80%	Warning
Disk space	>90%	Alert
Load average	>CPU count	Alert

Storage Health

Metric	Threshold	Action
ZFS pool errors	>0	Alert immediately
ZFS pool degraded	Any degraded vdev	Critical alert
ZFS scrub failed	Last scrub error	Alert
SMART reallocated sectors	>0	Warning
SMART pending sectors	>0	Alert
SMART failure	Pre-fail	Critical - replace drive

UPS

Metric	Threshold	Action
Battery charge	<20%	Warning
Battery charge	<10%	Alert
On battery	>5 min	Alert
Runtime	<5 min	Critical

Network

Metric	Threshold	Action
Device unreachable	>2 min down	Alert
High packet loss	>5%	Warning
Bandwidth saturation	>90%	Warning

VMs/Services

Metric	Threshold	Action
VM stopped	Critical VM down	Alert immediately
Service unreachable	HTTP 5xx or timeout	Alert
Backup failed	Any backup failure	Alert
Certificate expiry	<30 days	Warning
Certificate expiry	<7 days	Alert

Alert Destinations

Email Alerts

Recommended: Set up SMTP relay for email alerts

Options:

Gmail SMTP (free, rate-limited)
SendGrid (free tier: 100 emails/day)
Mailgun (free tier available)
Self-hosted mail server (complex)

Configuration Example (Prometheus Alertmanager):

# /etc/alertmanager/alertmanager.yml
receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'

Push Notifications

Options:

Pushover: $5 one-time, reliable
Pushbullet: Free tier available
Telegram Bot: Free
Discord Webhook: Free
Slack: Free tier available

Recommended: Pushover or Telegram for mobile alerts

Home Assistant Alerts

Since Home Assistant is already running, use it for alerts:

Automation Example:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"

  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"

Needs: Sensors for CPU temp, disk space, etc. in Home Assistant

Monitoring Scripts

Daily Health Check

Save as ~/bin/homelab-health-check.sh:

#!/bin/bash
# Daily homelab health check

echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""

echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""

echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""

echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""

echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""

echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""

echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""

echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""

echo "=== Check Complete ==="

Run daily:

0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com

ZFS Scrub Checker

#!/bin/bash
# Check last ZFS scrub status

echo "=== ZFS Scrub Status ==="

for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
done

echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'

SMART Health Checker

#!/bin/bash
# Check SMART health on all drives

echo "=== SMART Health Check ==="

echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'

echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'

Dashboard Recommendations

Grafana Dashboard Layout

Page 1: Overview

Server uptime
CPU usage (all servers)
RAM usage (all servers)
Disk space (all pools)
Network traffic
UPS status

Page 2: Storage

ZFS pool health
SMART status for all drives
I/O latency
Scrub progress
Disk temperatures

Page 3: VMs

VM status (up/down)
VM resource usage
VM disk I/O
VM network traffic

Page 4: Services

Service health checks
HTTP response times
Certificate expiry dates
Syncthing sync status

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

Install Uptime Kuma or Netdata
Add HTTP checks for all services
Configure UPS alerts in Home Assistant
Set up daily health check email

Estimated Time: 4-6 hours

Phase 2: Advanced Monitoring (Week 2-3)

Install Prometheus + Grafana
Deploy node_exporter on all servers
Deploy zfs_exporter
Deploy smartmon_exporter
Create Grafana dashboards

Estimated Time: 8-12 hours

Phase 3: Alerting (Week 4)

Configure Alertmanager
Set up email/push notifications
Create alert rules for all critical metrics
Test all alert paths
Document alert procedures

Estimated Time: 4-6 hours

UPS.md - UPS monitoring details
STORAGE.md - ZFS health checks
SERVICES.md - Service inventory
HOMEASSISTANT.md - Home Assistant automations
MAINTENANCE.md - Regular maintenance checks

Last Updated: 2025-12-22 Status: ⚠️ Minimal monitoring currently in place - implementation needed

13 KiB Raw Blame History

Monitoring and Alerting

Current Monitoring Status

Existing Monitoring

UPS Monitoring (NUT)

Syncthing Monitoring

Temperature Monitoring

Proxmox VM Monitoring

Recommended Monitoring Stack

Option 1: Prometheus + Grafana (Recommended)

Option 2: Uptime Kuma (Simpler Alternative)

Option 3: Netdata (Real-time Monitoring)

Critical Metrics to Monitor

Server Health

Storage Health

UPS

Network

VMs/Services

Alert Destinations

Email Alerts

Push Notifications

Home Assistant Alerts

Monitoring Scripts

Daily Health Check

ZFS Scrub Checker

SMART Health Checker

Dashboard Recommendations

Grafana Dashboard Layout

Implementation Plan

Phase 1: Basic Monitoring (Week 1)

Phase 2: Advanced Monitoring (Week 2-3)

Phase 3: Alerting (Week 4)

Related Documentation

13 KiB

Raw Blame History