diff --git a/MONITORING.md b/MONITORING.md index a1eb47f..1f7a94a 100644 --- a/MONITORING.md +++ b/MONITORING.md @@ -16,8 +16,9 @@ Documentation for system monitoring, health checks, and alerting across the home | **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s | | **Services** | ❌ No | - | ❌ No | No health checks | | **Backups** | ❌ No | - | ❌ No | No verification | +| **Claude Code** | ✅ Yes | Prometheus + Grafana | ✅ Yes | Token usage, burn rate, cost tracking | -**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual +**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, Claude Code active, most else is manual --- @@ -87,6 +88,102 @@ ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12' --- +### Claude Code Token Monitoring + +**Status**: ✅ **Active with alerts** + +Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits. + +**Architecture**: +``` +Claude Code (MacBook/Mac Mini) + │ + ▼ (OpenTelemetry Prometheus exporter :9464) + │ +Prometheus (docker-host:9090) + │ + ├──► Grafana Dashboard + │ + └──► Alertmanager (burn rate alerts) +``` + +**Monitored Devices**: +| Device | IP Address | Metrics Port | +|--------|------------|--------------| +| MacBook | 10.10.10.147 | 9464 | +| Mac Mini | 10.10.10.123 | 9464 | + +**What's monitored**: +- Token usage (input/output/cache) over time +- Burn rate (tokens/hour) +- Cost tracking (USD) +- Usage by model (Opus, Sonnet, Haiku) +- Session count +- Per-device breakdown + +**Dashboard**: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage + +**Alerts Configured**: +| Alert | Threshold | Severity | +|-------|-----------|----------| +| High Burn Rate | >100k tokens/hour for 15min | Warning | +| Weekly Limit Risk | Projected >5M tokens/week | Critical | +| No Metrics | Scrape fails for 5min | Info | + +**Configuration Files**: +- Claude settings: `~/.claude/settings.json` (on each Mac) +- Prometheus scrape: `/opt/monitoring/prometheus/prometheus.yml` (docker-host) +- Alert rules: `/opt/monitoring/prometheus/rules/claude-code.yml` (docker-host) + +**Claude Code Settings** (in `~/.claude/settings.json`): +```json +{ + "env": { + "CLAUDE_CODE_ENABLE_TELEMETRY": "1", + "OTEL_METRICS_EXPORTER": "prometheus", + "OTEL_EXPORTER_PROMETHEUS_PORT": "9464", + "OTEL_METRIC_EXPORT_INTERVAL": "60000" + } +} +``` + +**Prometheus Scrape Config**: +```yaml +- job_name: "claude-code" + scrape_interval: 60s + static_configs: + - targets: ["10.10.10.147:9464"] + labels: + device: "macbook" + - targets: ["10.10.10.123:9464"] + labels: + device: "mac-mini" +``` + +**Useful PromQL Queries**: +```promql +# Total tokens this session +sum(claude_code_token_usage_total) + +# Burn rate (tokens/hour) +sum(rate(claude_code_token_usage_total[1h])) * 3600 + +# Usage by device +sum(claude_code_token_usage_total) by (device) + +# Projected weekly usage +sum(increase(claude_code_token_usage_total[24h])) * 7 +``` + +**Important Notes**: +- Claude Code must be restarted after changing telemetry settings +- Metrics only flow while Claude Code is running +- Weekly subscription resets Monday 1am (America/New_York) + +**Added**: 2026-01-16 + +--- + ### Syncthing Monitoring **Status**: ⚠️ **Partial** - API available, no automated monitoring