Auto-sync: 20260116-150510

2026-01-16 15:05:12 -05:00
parent d38de8bfb1
commit 8c1cbf3dac
1 changed files with 98 additions and 1 deletions
--- a/MONITORING.md
+++ b/MONITORING.md
@@ -16,8 +16,9 @@ Documentation for system monitoring, health checks, and alerting across the home
 | **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
 | **Services** | ❌ No | - | ❌ No | No health checks |
 | **Backups** | ❌ No | - | ❌ No | No verification |
 | **Claude Code** | ✅ Yes | Prometheus + Grafana | ✅ Yes | Token usage, burn rate, cost tracking |
-**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual
+**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, Claude Code active, most else is manual
 ---
@@ -87,6 +88,102 @@ ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
 ---
 ### Claude Code Token Monitoring
 **Status**: ✅ **Active with alerts**
 Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits.
 **Architecture**:
 ```
 Claude Code (MacBook/Mac Mini)
      │
      ▼ (OpenTelemetry Prometheus exporter :9464)
      │
 Prometheus (docker-host:9090)
      │
      ├──► Grafana Dashboard
      │
      └──► Alertmanager (burn rate alerts)
 ```
 **Monitored Devices**:
 | Device | IP Address | Metrics Port |
 |--------|------------|--------------|
 | MacBook | 10.10.10.147 | 9464 |
 | Mac Mini | 10.10.10.123 | 9464 |
 **What's monitored**:
 - Token usage (input/output/cache) over time
 - Burn rate (tokens/hour)
 - Cost tracking (USD)
 - Usage by model (Opus, Sonnet, Haiku)
 - Session count
 - Per-device breakdown
 **Dashboard**: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage
 **Alerts Configured**:
 | Alert | Threshold | Severity |
 |-------|-----------|----------|
 | High Burn Rate | >100k tokens/hour for 15min | Warning |
 | Weekly Limit Risk | Projected >5M tokens/week | Critical |
 | No Metrics | Scrape fails for 5min | Info |
 **Configuration Files**:
 - Claude settings: `~/.claude/settings.json` (on each Mac)
 - Prometheus scrape: `/opt/monitoring/prometheus/prometheus.yml` (docker-host)
 - Alert rules: `/opt/monitoring/prometheus/rules/claude-code.yml` (docker-host)
 **Claude Code Settings** (in `~/.claude/settings.json`):
 ```json
 {
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "prometheus",
    "OTEL_EXPORTER_PROMETHEUS_PORT": "9464",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000"
  }
 }
 ```
 **Prometheus Scrape Config**:
 ```yaml
 - job_name: "claude-code"
  scrape_interval: 60s
  static_configs:
    - targets: ["10.10.10.147:9464"]
      labels:
        device: "macbook"
    - targets: ["10.10.10.123:9464"]
      labels:
        device: "mac-mini"
 ```
 **Useful PromQL Queries**:
 ```promql
 # Total tokens this session
 sum(claude_code_token_usage_total)
 # Burn rate (tokens/hour)
 sum(rate(claude_code_token_usage_total[1h])) * 3600
 # Usage by device
 sum(claude_code_token_usage_total) by (device)
 # Projected weekly usage
 sum(increase(claude_code_token_usage_total[24h])) * 7
 ```
 **Important Notes**:
 - Claude Code must be restarted after changing telemetry settings
 - Metrics only flow while Claude Code is running
 - Weekly subscription resets Monday 1am (America/New_York)
 **Added**: 2026-01-16
 ---
 ### Syncthing Monitoring
 **Status**: ⚠️ **Partial** - API available, no automated monitoring