Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/MONITORING.md
+++ b/MONITORING.md
@@ -0,0 +1,546 @@
+# Monitoring and Alerting
+
+Documentation for system monitoring, health checks, and alerting across the homelab.
+
+## Current Monitoring Status
+
+| Component | Monitored? | Method | Alerts | Notes |
+|-----------|------------|--------|--------|-------|
+| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
+| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
+| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
+| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
+| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
+| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
+| **Network** | ❌ No | - | ❌ No | No uptime monitoring |
+| **Services** | ❌ No | - | ❌ No | No health checks |
+| **Backups** | ❌ No | - | ❌ No | No verification |
+
+**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts
+
+---
+
+## Existing Monitoring
+
+### UPS Monitoring (NUT)
+
+**Status**: ✅ **Active and working**
+
+**What's monitored**:
+- Battery charge percentage
+- Runtime remaining (seconds)
+- Load percentage
+- Input/output voltage
+- UPS status (OL/OB/LB)
+
+**Access**:
+```bash
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+
+# Key metrics
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+```
+
+**Home Assistant Integration**:
+- Sensors: `sensor.cyberpower_*`
+- Can be used for automation/alerts
+- Currently: No alerts configured
+
+**See**: [UPS.md](UPS.md)
+
+---
+
+### Syncthing Monitoring
+
+**Status**: ⚠️ **Partial** - API available, no automated monitoring
+
+**What's available**:
+- Device connection status
+- Folder sync status
+- Sync errors
+- Bandwidth usage
+
+**Manual Checks**:
+```bash
+# Check connections (Mac Mini)
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+
+# Check folder status
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
+
+# Check errors
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
+```
+
+**Needs**: Automated monitoring script + alerts
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+### Temperature Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Method**:
+```bash
+# CPU temperature (Threadripper Tctl)
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Thresholds**:
+- Healthy: 70-80°C under load
+- Warning: >85°C
+- Critical: >90°C (throttling)
+
+**Needs**: Automated monitoring + alert if >85°C
+
+---
+
+### Proxmox VM Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Access**:
+- Proxmox Web UI: Node → Summary
+- CLI: `ssh pve 'qm list'`
+
+**Metrics Available** (via Proxmox):
+- CPU usage per VM
+- RAM usage per VM
+- Disk I/O
+- Network I/O
+- VM uptime
+
+**Needs**: API-based monitoring + alerts for VM down
+
+---
+
+## Recommended Monitoring Stack
+
+### Option 1: Prometheus + Grafana (Recommended)
+
+**Why**:
+- Industry standard
+- Extensive integrations
+- Beautiful dashboards
+- Flexible alerting
+
+**Architecture**:
+```
+Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
+                              ↓
+                          Alertmanager (alerts)
+```
+
+**Required Exporters**:
+| Exporter | Monitors | Install On |
+|----------|----------|------------|
+| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
+| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
+| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
+| nut_exporter | UPS metrics | PVE |
+| proxmox_exporter | VM/CT stats | PVE, PVE2 |
+| cadvisor | Docker containers | Saltbox, docker-host |
+
+**Deployment**:
+```bash
+# Create monitoring VM
+ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
+  --net0 virtio,bridge=vmbr0'
+
+# Install Prometheus + Grafana (via Docker)
+# /opt/monitoring/docker-compose.yml
+```
+
+**Estimated Setup Time**: 4-6 hours
+
+---
+
+### Option 2: Uptime Kuma (Simpler Alternative)
+
+**Why**:
+- Lightweight
+- Easy to set up
+- Web-based dashboard
+- Built-in alerts (email, Slack, etc.)
+
+**What it monitors**:
+- HTTP/HTTPS endpoints
+- Ping (ICMP)
+- Ports (TCP)
+- Docker containers
+
+**Deployment**:
+```bash
+ssh docker-host 'mkdir -p /opt/uptime-kuma'
+cat > docker-compose.yml << 'EOF'
+version: "3.8"
+services:
+  uptime-kuma:
+    image: louislam/uptime-kuma:latest
+    ports:
+      - "3001:3001"
+    volumes:
+      - ./data:/app/data
+    restart: unless-stopped
+EOF
+
+# Access: http://10.10.10.206:3001
+# Add Traefik config for uptime.htsn.io
+```
+
+**Estimated Setup Time**: 1-2 hours
+
+---
+
+### Option 3: Netdata (Real-time Monitoring)
+
+**Why**:
+- Real-time metrics (1-second granularity)
+- Auto-discovers services
+- Low overhead
+- Beautiful web UI
+
+**Deployment**:
+```bash
+# Install on each server
+ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+
+# Access:
+# http://10.10.10.120:19999 (PVE)
+# http://10.10.10.102:19999 (PVE2)
+```
+
+**Parent-Child Setup** (optional):
+- Configure PVE as parent
+- Stream metrics from PVE2 → PVE
+- Single dashboard for both servers
+
+**Estimated Setup Time**: 1 hour
+
+---
+
+## Critical Metrics to Monitor
+
+### Server Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **CPU usage** | >90% for 5 min | Alert |
+| **CPU temp** | >85°C | Alert |
+| **CPU temp** | >90°C | Critical alert |
+| **RAM usage** | >95% | Alert |
+| **Disk space** | >80% | Warning |
+| **Disk space** | >90% | Alert |
+| **Load average** | >CPU count | Alert |
+
+### Storage Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **ZFS pool errors** | >0 | Alert immediately |
+| **ZFS pool degraded** | Any degraded vdev | Critical alert |
+| **ZFS scrub failed** | Last scrub error | Alert |
+| **SMART reallocated sectors** | >0 | Warning |
+| **SMART pending sectors** | >0 | Alert |
+| **SMART failure** | Pre-fail | Critical - replace drive |
+
+### UPS
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Battery charge** | <20% | Warning |
+| **Battery charge** | <10% | Alert |
+| **On battery** | >5 min | Alert |
+| **Runtime** | <5 min | Critical |
+
+### Network
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Device unreachable** | >2 min down | Alert |
+| **High packet loss** | >5% | Warning |
+| **Bandwidth saturation** | >90% | Warning |
+
+### VMs/Services
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **VM stopped** | Critical VM down | Alert immediately |
+| **Service unreachable** | HTTP 5xx or timeout | Alert |
+| **Backup failed** | Any backup failure | Alert |
+| **Certificate expiry** | <30 days | Warning |
+| **Certificate expiry** | <7 days | Alert |
+
+---
+
+## Alert Destinations
+
+### Email Alerts
+
+**Recommended**: Set up SMTP relay for email alerts
+
+**Options**:
+1. Gmail SMTP (free, rate-limited)
+2. SendGrid (free tier: 100 emails/day)
+3. Mailgun (free tier available)
+4. Self-hosted mail server (complex)
+
+**Configuration Example** (Prometheus Alertmanager):
+```yaml
+# /etc/alertmanager/alertmanager.yml
+receivers:
+  - name: 'email'
+    email_configs:
+      - to: 'hutson@example.com'
+        from: 'alerts@htsn.io'
+        smarthost: 'smtp.gmail.com:587'
+        auth_username: 'alerts@htsn.io'
+        auth_password: 'app-password-here'
+```
+
+---
+
+### Push Notifications
+
+**Options**:
+- **Pushover**: $5 one-time, reliable
+- **Pushbullet**: Free tier available
+- **Telegram Bot**: Free
+- **Discord Webhook**: Free
+- **Slack**: Free tier available
+
+**Recommended**: Pushover or Telegram for mobile alerts
+
+---
+
+### Home Assistant Alerts
+
+Since Home Assistant is already running, use it for alerts:
+
+**Automation Example**:
+```yaml
+automation:
+  - alias: "UPS Low Battery Alert"
+    trigger:
+      - platform: numeric_state
+        entity_id: sensor.cyberpower_battery_charge
+        below: 20
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
+
+  - alias: "Server High Temperature"
+    trigger:
+      - platform: template
+        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
+```
+
+**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
+
+---
+
+## Monitoring Scripts
+
+### Daily Health Check
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily**:
+```cron
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+### ZFS Scrub Checker
+
+```bash
+#!/bin/bash
+# Check last ZFS scrub status
+
+echo "=== ZFS Scrub Status ==="
+
+for host in pve pve2; do
+  echo "--- $host ---"
+  ssh $host 'zpool status | grep -A1 scrub'
+  echo ""
+done
+
+echo "--- TrueNAS ---"
+ssh truenas 'zpool status vault | grep -A1 scrub'
+```
+
+---
+
+### SMART Health Checker
+
+```bash
+#!/bin/bash
+# Check SMART health on all drives
+
+echo "=== SMART Health Check ==="
+
+echo "--- TrueNAS Drives ---"
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
+done'
+
+echo "--- PVE Drives ---"
+ssh pve 'for dev in /dev/nvme* /dev/sd*; do
+  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
+done'
+```
+
+---
+
+## Dashboard Recommendations
+
+### Grafana Dashboard Layout
+
+**Page 1: Overview**
+- Server uptime
+- CPU usage (all servers)
+- RAM usage (all servers)
+- Disk space (all pools)
+- Network traffic
+- UPS status
+
+**Page 2: Storage**
+- ZFS pool health
+- SMART status for all drives
+- I/O latency
+- Scrub progress
+- Disk temperatures
+
+**Page 3: VMs**
+- VM status (up/down)
+- VM resource usage
+- VM disk I/O
+- VM network traffic
+
+**Page 4: Services**
+- Service health checks
+- HTTP response times
+- Certificate expiry dates
+- Syncthing sync status
+
+---
+
+## Implementation Plan
+
+### Phase 1: Basic Monitoring (Week 1)
+
+- [ ] Install Uptime Kuma or Netdata
+- [ ] Add HTTP checks for all services
+- [ ] Configure UPS alerts in Home Assistant
+- [ ] Set up daily health check email
+
+**Estimated Time**: 4-6 hours
+
+---
+
+### Phase 2: Advanced Monitoring (Week 2-3)
+
+- [ ] Install Prometheus + Grafana
+- [ ] Deploy node_exporter on all servers
+- [ ] Deploy zfs_exporter
+- [ ] Deploy smartmon_exporter
+- [ ] Create Grafana dashboards
+
+**Estimated Time**: 8-12 hours
+
+---
+
+### Phase 3: Alerting (Week 4)
+
+- [ ] Configure Alertmanager
+- [ ] Set up email/push notifications
+- [ ] Create alert rules for all critical metrics
+- [ ] Test all alert paths
+- [ ] Document alert procedures
+
+**Estimated Time**: 4-6 hours
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS monitoring details
+- [STORAGE.md](STORAGE.md) - ZFS health checks
+- [SERVICES.md](SERVICES.md) - Service inventory
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
+- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed**