Auto-sync: 20260105-122831

Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-05 12:28:33 -05:00 · 2025-12-23 00:34:21 -05:00 · 2025-12-21 13:28:30 -05:00
25 changed files with 8079 additions and 1017 deletions
--- a/.stfolder/syncthing-folder-8be0b5.txt
+++ b/.stfolder/syncthing-folder-8be0b5.txt
@@ -0,0 +1,5 @@
 # This directory is a Syncthing folder marker.
 # Do not delete.
 folderID: homelab
 created: 2025-12-23T00:39:52-05:00
--- a/BACKUP-STRATEGY.md
+++ b/BACKUP-STRATEGY.md
@@ -0,0 +1,358 @@
 # Backup Strategy
 ## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
 This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
 ## Executive Summary
 ### What We Have ✅
 - **Syncthing**: File synchronization across 5+ devices
 - **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
 - **Proxmox**: Built-in backup capabilities (not yet configured)
 ### What We DON'T Have 🚨
 - ❌ No documented VM/CT backups
 - ❌ No ZFS snapshot schedule
 - ❌ No offsite backups
 - ❌ No disaster recovery plan
 - ❌ No tested restore procedures
 - ❌ No configuration backups
 **Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
 ---
 ## Current State Analysis
 ### Syncthing (File Synchronization)
 **What it is**: Real-time file sync across devices
 **What it is NOT**: A backup solution
 | Folder | Devices | Size | Protected? |
 |--------|---------|------|------------|
 | documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
 | downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
 | pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
 | notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
 | config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
 **Limitations**:
 - ❌ Accidental deletion → deleted everywhere
 - ❌ Ransomware/corruption → spreads everywhere
 - ❌ No point-in-time recovery
 - ❌ No version history (unless file versioning enabled - not documented)
 **Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
 ### ZFS on TrueNAS (Potential Backup Target)
 **Current Status**: ❓ Unknown - snapshots may or may not be configured
 **Needs Investigation**:
 ```bash
 # Check if snapshots exist
 ssh truenas 'zfs list -t snapshot'
 # Check if automated snapshots are configured
 ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
 # Check snapshot schedule via TrueNAS API/UI
 ```
 **If configured**, ZFS snapshots provide:
 - ✅ Point-in-time recovery
 - ✅ Protection against accidental deletion
 - ✅ Fast rollback capability
 - ⚠️ Still single location (no offsite protection)
 ### Proxmox VM/CT Backups
 **Current Status**: ❓ Unknown - no backup jobs documented
 **Needs Investigation**:
 ```bash
 # Check backup configuration
 ssh pve 'pvesh get /cluster/backup'
 # Check if any backups exist
 ssh pve 'ls -lh /var/lib/vz/dump/'
 ssh pve2 'ls -lh /var/lib/vz/dump/'
 ```
 **Critical VMs Needing Backup**:
 | VM/CT | VMID | Priority | Notes |
 |-------|------|----------|-------|
 | TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
 | Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
 | homeassistant | 110 | 🟡 HIGH | Home automation config |
 | gitea-vm | 300 | 🟡 HIGH | Git repositories |
 | pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
 | traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
 | trading-vm | 301 | 🟡 HIGH | AI trading platform |
 | lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
 ---
 ## Recommended Backup Strategy
 ### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
 **ZFS Snapshots on TrueNAS**
 Schedule automatic snapshots for all datasets:
 | Dataset | Frequency | Retention |
 |---------|-----------|-----------|
 | vault/documents | Every 15 min | 1 hour |
 | vault/documents | Hourly | 24 hours |
 | vault/documents | Daily | 30 days |
 | vault/documents | Weekly | 12 weeks |
 | vault/documents | Monthly | 12 months |
 **Implementation**:
 ```bash
 # Via TrueNAS UI: Storage → Snapshots → Add
 # Or via CLI:
 ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
 ```
 **Proxmox VM Backups**
 Configure weekly backups to local storage:
 ```bash
 # Create backup job via Proxmox UI:
 # Datacenter → Backup → Add
 # - Schedule: Weekly (Sunday 2 AM)
 # - Storage: local-zfs or nvme-mirror1
 # - Mode: Snapshot (fast)
 # - Retention: 4 backups
 ```
 **Or via CLI**:
 ```bash
 ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
 ```
 ### Tier 2: Offsite Backups (CRITICAL GAP)
 **Option A: Cloud Storage (Recommended)**
 Use **rclone** or **restic** to sync critical data to cloud:
 | Provider | Cost | Pros | Cons |
 |----------|------|------|------|
 | Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
 | AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
 | Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
 **Implementation Example (Backblaze B2)**:
 ```bash
 # Install on TrueNAS
 ssh truenas 'pkg install rclone restic'
 # Configure B2
 rclone config  # Follow prompts for B2
 # Daily backup critical folders
 0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
 ```
 **Option B: Offsite TrueNAS Replication**
 - Set up second TrueNAS at friend/family member's house
 - Use ZFS replication to sync snapshots
 - Requires: Static IP or Tailscale, trust
 **Option C: USB Drive Rotation**
 - Weekly backup to external USB drive
 - Rotate 2-3 drives (one always offsite)
 - Manual but simple
 ### Tier 3: Configuration Backups
 **Proxmox Configuration**
 ```bash
 # Backup /etc/pve (configs are already in cluster filesystem)
 # But also backup to external location:
 ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
 # Copy to safe location
 scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
 ```
 **VM-Specific Configs**
 - Traefik configs: `/etc/traefik/` on CT 202
 - Saltbox configs: `/srv/git/saltbox/` on VM 101
 - Home Assistant: `/config/` on VM 110
 **Script to backup all configs**:
 ```bash
 #!/bin/bash
 # Save as ~/bin/backup-homelab-configs.sh
 DATE=$(date +%Y%m%d)
 BACKUP_DIR=~/Backups/homelab-configs/$DATE
 mkdir -p $BACKUP_DIR
 # Proxmox configs
 ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
 ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
 # Traefik
 ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
 # Saltbox
 ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
 # Home Assistant
 ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
 echo "Configs backed up to $BACKUP_DIR"
 ```
 ---
 ## Disaster Recovery Scenarios
 ### Scenario 1: Single VM Failure
 **Impact**: Medium
 **Recovery Time**: 30-60 minutes
 1. Restore from Proxmox backup:
   ```bash
   ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
   ```
 2. Start VM and verify
 3. Update IP if needed
 ### Scenario 2: TrueNAS Failure
 **Impact**: CATASTROPHIC (all storage lost)
 **Recovery Time**: Unknown - NO PLAN
 **Current State**: 🚨 NO RECOVERY PLAN
 **Needed**:
 - Offsite backup of critical datasets
 - Documented ZFS pool creation steps
 - Share configuration export
 ### Scenario 3: Complete PVE Server Failure
 **Impact**: SEVERE
 **Recovery Time**: 4-8 hours
 **Current State**: ⚠️ PARTIALLY RECOVERABLE
 **Needed**:
 - VM backups stored on TrueNAS or PVE2
 - Proxmox reinstall procedure
 - Network config documentation
 ### Scenario 4: Complete Site Disaster (Fire/Flood)
 **Impact**: TOTAL LOSS
 **Recovery Time**: Unknown
 **Current State**: 🚨 NO RECOVERY PLAN
 **Needed**:
 - Offsite backups (cloud or physical)
 - Critical data prioritization
 - Restore procedures
 ---
 ## Action Plan
 ### Immediate (Next 7 Days)
 - [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
  ```bash
  ssh truenas 'zfs list -t snapshot'
  ssh pve 'ls -lh /var/lib/vz/dump/'
  ```
 - [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
 - [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
 - [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
 ### Short-term (Next 30 Days)
 - [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
 - [ ] **Install backup tools**: rclone or restic on TrueNAS
 - [ ] **Configure daily cloud sync**: Critical folders to cloud storage
 - [ ] **Document restore procedures**: Step-by-step guides for each scenario
 ### Long-term (Next 90 Days)
 - [ ] **Implement monitoring**: Alerts for backup failures
 - [ ] **Quarterly restore test**: Verify backups actually work
 - [ ] **Backup rotation policy**: Automate old backup cleanup
 - [ ] **Configuration backup automation**: Weekly cron job
 ---
 ## Monitoring & Validation
 ### Backup Health Checks
 ```bash
 # Check last ZFS snapshot
 ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
 # Check Proxmox backup status
 ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
 # Check cloud sync status (if using rclone)
 ssh truenas 'rclone ls b2:homelab-backup | wc -l'
 ```
 ### Alerts to Set Up
 - Email alert if no snapshot created in 24 hours
 - Email alert if Proxmox backup fails
 - Email alert if cloud sync fails
 - Weekly backup status report
 ---
 ## Cost Estimate
 **Monthly Backup Costs**:
 | Component | Cost | Notes |
 |-----------|------|-------|
 | Local storage (already owned) | $0 | Using existing TrueNAS |
 | Proxmox backups (local) | $0 | Using existing storage |
 | Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
 | **Total** | **~$10/mo** | Minimal cost for peace of mind |
 **One-time**:
 - External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
 ---
 ## Related Documentation
 - [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
 - [VMS.md](VMS.md) - VM inventory and prioritization
 - [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
 ---
 **Last Updated**: 2025-12-22
 **Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -36,12 +36,12 @@ Investigated UPS power limit issues across both Proxmox servers.
  [Unit]
  Description=Disable KSM (Kernel Same-page Merging)
  After=multi-user.target
-
+  
  [Service]
  Type=oneshot
  ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
  RemainAfterExit=yes
-
+  
  [Install]
  WantedBy=multi-user.target
  ```
@@ -108,12 +108,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
  [Unit]
  Description=Set CPU governor to powersave with balance_power EPP
  After=multi-user.target
-
+  
  [Service]
  Type=oneshot
  ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
  RemainAfterExit=yes
-
+  
  [Install]
  WantedBy=multi-user.target
  ```
@@ -127,12 +127,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
  [Unit]
  Description=Set CPU governor to schedutil for power savings
  After=multi-user.target
-
+  
  [Service]
  Type=oneshot
  ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
  RemainAfterExit=yes
-
+  
  [Install]
  WantedBy=multi-user.target
  ```
@@ -194,4 +194,4 @@ Not useful when:
 - `general_profit` is negative
 ### What is Memory Ballooning?
-Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.
+Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes. 
--- a/CLAUDE.md
+++ b/CLAUDE.md
--- a/GATEWAY.md
+++ b/GATEWAY.md
@@ -0,0 +1,339 @@
 # UniFi Gateway (UCG-Fiber)
 Documentation for the UniFi Cloud Gateway Fiber (10.10.10.1) - the primary network gateway and router.
 ## Overview
 | Property | Value |
 |----------|-------|
 | **Device** | UniFi Cloud Gateway Fiber (UCG-Fiber) |
 | **IP Address** | 10.10.10.1 |
 | **SSH User** | root |
 | **SSH Auth** | SSH key (`~/.ssh/id_ed25519`) |
 | **Host Aliases** | `ucg-fiber`, `gateway` |
 | **Firmware** | v4.4.9 (as of 2026-01-02) |
 | **UniFi Core** | 4.4.19 |
 | **RAM** | 2.9 GB (shared with UniFi apps) |
 ---
 ## SSH Access
 SSH key authentication is configured. Use host aliases:
 ```bash
 # Quick access
 ssh ucg-fiber 'hostname'
 ssh gateway 'free -m'
 # Or use IP directly
 ssh root@10.10.10.1 'uptime'
 ```
 **Note**: SSH key may need re-deployment after firmware updates if UniFi clears authorized_keys.
 ---
 ## Monitoring Services
 Two custom monitoring services run on the gateway to prevent and diagnose issues.
 ### Internet Watchdog Service
 **Purpose**: Auto-reboots gateway if internet connectivity is lost for 5+ minutes
 **Location**: `/data/scripts/internet-watchdog.sh`
 **How it works**:
 1. Pings 1.1.1.1, 8.8.8.8, 208.67.222.222 every 60 seconds
 2. If all three fail, increments failure counter
 3. After 5 consecutive failures (~5 minutes), triggers reboot
 4. Logs all activity to `/var/log/internet-watchdog.log`
 **Commands**:
 ```bash
 # Check service status
 ssh ucg-fiber 'systemctl status internet-watchdog'
 # View recent logs
 ssh ucg-fiber 'tail -50 /var/log/internet-watchdog.log'
 # Stop temporarily (if troubleshooting)
 ssh ucg-fiber 'systemctl stop internet-watchdog'
 # Restart
 ssh ucg-fiber 'systemctl restart internet-watchdog'
 ```
 **Log Format**:
 ```
 2026-01-02 22:45:01 - Watchdog started
 2026-01-02 22:46:01 - Internet check failed (1/5)
 2026-01-02 22:47:01 - Internet restored after 1 failures
 ```
 ---
 ### Memory Monitor Service
 **Purpose**: Logs memory usage and top processes every 10 minutes for diagnostics
 **Location**: `/data/scripts/memory-monitor.sh`
 **Log File**: `/data/logs/memory-history.log`
 **How it works**:
 1. Every 10 minutes, logs current memory usage (`free -m`)
 2. Logs top 12 memory-consuming processes
 3. Auto-rotates log when it exceeds 10MB (keeps one .old file)
 **Commands**:
 ```bash
 # Check service status
 ssh ucg-fiber 'systemctl status memory-monitor'
 # View recent memory history
 ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
 # Check current memory usage
 ssh ucg-fiber 'free -m'
 # See top memory consumers right now
 ssh ucg-fiber 'ps -eo pid,rss,comm --sort=-rss | head -12'
 ```
 **Log Format**:
 ```
 ========== 2026-01-02 22:30:00 ==========
 --- MEMORY ---
              total        used        free      shared  buff/cache   available
 Mem:           2892        1890         102         456         899        1002
 Swap:           512          88         424
 --- TOP MEMORY PROCESSES ---
  PID   RSS COMMAND
 1234 327456 unifi-protect
 2345 252108 mongod
 3456 236544 java
 ...
 ```
 ---
 ## Known Memory Consumers
 | Process | Typical Memory | Purpose |
 |---------|----------------|---------|
 | unifi-protect | ~320 MB | Camera/NVR management |
 | mongod | ~250 MB | UniFi configuration database |
 | java (controller) | ~230 MB | UniFi Network controller |
 | postgres | ~180 MB | PostgreSQL database |
 | unifi-core | ~150 MB | UniFi OS core |
 | tailscaled | ~80 MB | Tailscale VPN |
 **Total available**: ~2.9 GB
 **Typical usage**: ~1.8-2.0 GB (leaves ~1 GB free)
 **Warning threshold**: <500 MB free
 **Critical**: <200 MB free or swap >50% used
 ---
 ## Disabled Services
 The following services were disabled to reduce memory usage:
 | Service | Memory Saved | Reason Disabled |
 |---------|--------------|-----------------|
 | UniFi Connect | ~200 MB | Not needed (cameras use Protect) |
 To re-enable if needed:
 ```bash
 ssh ucg-fiber 'systemctl enable unifi-connect && systemctl start unifi-connect'
 ```
 ---
 ## Common Issues
 ### Gateway Freeze / Network Loss
 **Symptoms**:
 - All devices lose internet
 - Cannot ping 10.10.10.1
 - Physical reboot required
 **Root Cause**: Memory exhaustion causing soft lockup
 **Prevention**:
 1. Internet watchdog auto-reboots after 5 min outage
 2. Memory monitor logs help identify runaway processes
 3. UniFi Connect disabled to free ~200 MB
 **Post-Incident Analysis**:
 ```bash
 # Check memory history for spike before freeze
 ssh ucg-fiber 'grep -B5 "Swap:" /data/logs/memory-history.log | tail -50'
 # Check watchdog logs
 ssh ucg-fiber 'cat /var/log/internet-watchdog.log'
 # Check system logs for errors
 ssh ucg-fiber 'dmesg | tail -100'
 ssh ucg-fiber 'journalctl -p err --since "1 hour ago"'
 ```
 ---
 ### High Memory Usage
 **Check current state**:
 ```bash
 ssh ucg-fiber 'free -m && echo "---" && ps -eo pid,rss,comm --sort=-rss | head -15'
 ```
 **If swap is heavily used**:
 ```bash
 # Check swap usage
 ssh ucg-fiber 'cat /proc/swaps'
 # See what's in swap
 ssh ucg-fiber 'for pid in $(ls /proc | grep -E "^[0-9]+$"); do
  swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk "{print \$2}");
  [ "$swap" -gt 10000 ] 2>/dev/null && echo "$pid: ${swap}kB - $(cat /proc/$pid/comm)";
 done | sort -t: -k2 -rn | head -10'
 ```
 **Consider reboot if**:
 - Available memory <200 MB
 - Swap usage >300 MB
 - System becoming unresponsive
 ---
 ### Tailscale Issues
 **Check Tailscale status**:
 ```bash
 ssh ucg-fiber 'tailscale status'
 ```
 **Common errors and fixes**:
 | Error | Fix |
 |-------|-----|
 | `DNS resolution failed` | Check upstream DNS (Pi-hole at 10.10.10.10) |
 | `TLS handshake failed` | Usually temporary; Tailscale auto-reconnects |
 | `Not connected` | `ssh ucg-fiber 'tailscale up'` |
 ---
 ## Firmware Updates
 **Check current version**:
 ```bash
 ssh ucg-fiber 'ubnt-systool version'
 ```
 **Update process**:
 1. Check UniFi site for latest stable firmware
 2. Download via UI or CLI
 3. Schedule update during low-usage time
 **After update**:
 - Verify SSH key still works
 - Check custom services still running
 - Verify Tailscale reconnects
 **Re-deploy SSH key if needed**:
 ```bash
 ssh-copy-id -i ~/.ssh/id_ed25519 root@10.10.10.1
 ```
 ---
 ## Service Locations
 | File | Purpose |
 |------|---------|
 | `/data/scripts/internet-watchdog.sh` | Watchdog script |
 | `/data/scripts/memory-monitor.sh` | Memory monitor script |
 | `/etc/systemd/system/internet-watchdog.service` | Watchdog systemd unit |
 | `/etc/systemd/system/memory-monitor.service` | Memory monitor systemd unit |
 | `/var/log/internet-watchdog.log` | Watchdog log |
 | `/data/logs/memory-history.log` | Memory history log |
 **Note**: `/data/` persists across firmware updates. `/var/log/` may not.
 ---
 ## Quick Reference Commands
 ```bash
 # System status
 ssh ucg-fiber 'uptime && free -m'
 # Check both monitoring services
 ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
 # Memory history (last hour)
 ssh ucg-fiber 'tail -60 /data/logs/memory-history.log'
 # Watchdog activity
 ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
 # Network devices (ARP table)
 ssh ucg-fiber 'cat /proc/net/arp'
 # Tailscale status
 ssh ucg-fiber 'tailscale status'
 # System logs
 ssh ucg-fiber 'journalctl -p warning --since "1 hour ago" | head -50'
 ```
 ---
 ## Backup Considerations
 Custom services in `/data/scripts/` persist across firmware updates but may need:
 - Systemd services re-enabled after major updates
 - Script permissions re-applied if wiped
 **Backup critical files**:
 ```bash
 # Copy scripts locally for reference
 scp ucg-fiber:/data/scripts/*.sh ~/Projects/homelab/data/scripts/
 ```
 ---
 ## Related Documentation
 - [SSH-ACCESS.md](SSH-ACCESS.md) - SSH configuration and host aliases
 - [NETWORK.md](NETWORK.md) - Network architecture
 - [MONITORING.md](MONITORING.md) - Overall monitoring strategy
 - [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant integration
 ---
 ## Incident History
 ### 2025-12-27 to 2025-12-29: Gateway Freeze
 **Timeline**:
 - Dec 7: Firmware update to v4.4.9
 - Dec 24: Last healthy system logs
 - Dec 27-29: "No internet detected" errors in logs
 - Dec 29+: Complete silence (gateway frozen)
 - Jan 2: Physical reboot restored access
 **Root Cause**: Memory exhaustion causing soft lockup (no crash dump saved)
 **Resolution**:
 - Deployed internet-watchdog service
 - Deployed memory-monitor service
 - Disabled UniFi Connect (~200 MB saved)
 - Configured SSH key auth
 ---
 **Last Updated**: 2026-01-02
--- a/HARDWARE.md
+++ b/HARDWARE.md
@@ -0,0 +1,455 @@
 # Hardware Inventory
 Complete hardware specifications for all homelab equipment.
 ## Servers
 ### PVE (10.10.10.120) - Primary Proxmox Server
 #### CPU
 - **Model**: AMD Ryzen Threadripper PRO 3975WX
 - **Cores**: 32 cores / 64 threads
 - **Base Clock**: 3.5 GHz
 - **Boost Clock**: 4.2 GHz
 - **TDP**: 280W
 - **Architecture**: Zen 2 (7nm)
 - **Socket**: sTRX4
 - **Features**: ECC support, PCIe 4.0
 #### RAM
 - **Capacity**: 128 GB
 - **Type**: DDR4 ECC Registered
 - **Speed**: Unknown (needs investigation)
 - **Channels**: 8-channel (quad-channel per socket)
 - **Idle Power**: ~30-40W
 #### Storage
 **OS/VM Storage:**
 | Pool | Devices | Type | Capacity | Purpose |
 |------|---------|------|----------|---------|
 | `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
 | `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
 | `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
 **Total Storage**: ~9 TB usable
 #### GPUs
 | Model | Slot | VRAM | TDP | Purpose | Passed To |
 |-------|------|------|-----|---------|-----------|
 | NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
 | NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
 **Total GPU Power**: 75W + 280W = 355W (under load)
 #### Network Cards
 | Interface | Model | Speed | Purpose | Bridge |
 |-----------|-------|-------|---------|--------|
 | enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
 | enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
 | enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
 **10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
 #### Storage Controllers
 | Model | Interface | Purpose |
 |-------|-----------|---------|
 | LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
 | Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
 #### Motherboard
 - **Model**: Unknown - needs investigation
 - **Chipset**: AMD TRX40
 - **Form Factor**: ATX/EATX
 - **PCIe Slots**: Multiple PCIe 4.0 slots
 - **Features**: IOMMU support, ECC memory
 #### Power Supply
 - **Model**: Unknown
 - **Wattage**: Likely 1000W+ (needs investigation)
 - **Type**: ATX, 80+ certification unknown
 #### Cooling
 - **CPU Cooler**: Unknown - likely large tower or AIO
 - **Case Fans**: Unknown quantity
 - **Note**: CPU temps 70-80°C under load (healthy)
 ---
 ### PVE2 (10.10.10.102) - Secondary Proxmox Server
 #### CPU
 - **Model**: AMD Ryzen Threadripper PRO 3975WX
 - **Specs**: Same as PVE (32C/64T, 280W TDP)
 #### RAM
 - **Capacity**: 128 GB DDR4 ECC
 - **Same specs as PVE**
 #### Storage
 | Pool | Devices | Type | Capacity | Purpose |
 |------|---------|------|----------|---------|
 | `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
 | `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
 **HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
 #### GPUs
 | Model | Slot | VRAM | TDP | Purpose | Passed To |
 |-------|------|------|-----|---------|-----------|
 | NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
 #### Network Cards
 | Interface | Model | Speed | Purpose |
 |-----------|-------|-------|---------|
 | nic1 | Unknown (onboard) | 1 Gb | Management |
 **Note**: MTU set to 9000 for jumbo frames
 #### Motherboard
 - **Model**: Unknown
 - **Chipset**: AMD TRX40
 - **Similar to PVE**
 ---
 ## Network Equipment
 ### UniFi Dream Machine Pro (UCG-Fiber)
 - **Model**: UniFi Cloud Gateway Fiber
 - **IP**: 10.10.10.1
 - **Ports**: Multiple 1Gb + SFP+ uplink
 - **Features**: Router, firewall, VPN, IDS/IPS
 - **MTU**: 9216 (supports jumbo frames)
 - **Tailscale**: Installed for VPN failover
 ### Switches
 **Details needed** - investigate current switch setup:
 - 10Gb switch for high-speed connections?
 - 1Gb switch for general devices?
 - PoE capabilities?
 ```bash
 # Check what's connected to 10Gb interfaces
 ssh pve 'ip link show enp35s0f0'
 ssh pve 'ip link show enp35s0f1'
 ```
 ---
 ## Storage Hardware
 ### EMC Storage Enclosure
 **See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
 - **Model**: EMC KTN-STL4 (or similar)
 - **Form Factor**: 4U rackmount
 - **Drive Bays**: 25x 3.5" SAS/SATA
 - **Controllers**: Dual LCC (Link Control Cards)
 - **Connection**: SAS via LSI SAS2308 HBA
 - **Passed to**: TrueNAS VM (VMID 100)
 **Current Status**:
 - LCC A: Active (working)
 - LCC B: Failed (replacement ordered)
 **Drive Inventory**: Unknown - needs audit
 ```bash
 # Get drive list from TrueNAS
 ssh truenas 'smartctl --scan'
 ssh truenas 'lsblk'
 ```
 ### NVMe Drives
 | Model | Quantity | Capacity | Location | Pool |
 |-------|----------|----------|----------|------|
 | Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
 | Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
 | Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
 | Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
 ### SSDs
 | Model | Quantity | Capacity | Location | Pool |
 |-------|----------|----------|----------|------|
 | Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
 ### HDDs
 | Model | Quantity | Capacity | Location | Pool |
 |-------|----------|----------|----------|------|
 | WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
 | Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
 ---
 ## UPS
 ### Current UPS
 | Specification | Value |
 |---------------|-------|
 | **Model** | CyberPower OR2200PFCRT2U |
 | **Capacity** | 2200VA / 1320W |
 | **Form Factor** | 2U rackmount |
 | **Input** | NEMA 5-15P (rewired from 5-20P) |
 | **Outlets** | 2x 5-20R + 6x 5-15R |
 | **Output** | PFC Sinewave |
 | **Runtime** | ~15-20 min @ 33% load |
 | **Interface** | USB (connected to PVE) |
 **See [UPS.md](UPS.md) for configuration details**
 ---
 ## Client Devices
 ### Mac Mini (Hutson's Workstation)
 - **Model**: Unknown generation
 - **CPU**: Unknown
 - **RAM**: Unknown
 - **Storage**: Unknown
 - **Network**: 1Gb Ethernet (en0) - MTU 9000
 - **Tailscale IP**: 100.108.89.58
 - **Local IP**: 10.10.10.125 (static)
 - **Purpose**: Primary workstation, Happy Coder daemon host
 ### MacBook (Mobile)
 - **Model**: Unknown
 - **Network**: Wi-Fi + Ethernet adapter
 - **Tailscale IP**: Unknown
 - **Purpose**: Mobile work, development
 ### Windows PC
 - **Model**: Unknown
 - **CPU**: Unknown
 - **Network**: 1Gb Ethernet
 - **IP**: 10.10.10.150
 - **Purpose**: Gaming, Windows development, Syncthing node
 ### Phone (Android)
 - **Model**: Unknown
 - **IP**: 10.10.10.54 (when on Wi-Fi)
 - **Purpose**: Syncthing mobile node, Happy Coder client
 ---
 ## Rack Layout (If Applicable)
 **Needs documentation** - Current rack configuration unknown
 Suggested format:
 ```
 U42: Blank panel
 U41: UPS (CyberPower 2U)
 U40: UPS (CyberPower 2U)
 U39: Switch (10Gb)
 U38-U35: EMC Storage Enclosure (4U)
 U34: PVE Server
 U33: PVE2 Server
 ...
 ```
 ---
 ## Power Consumption
 ### Measured Power Draw
 | Component | Idle | Typical | Peak | Notes |
 |-----------|------|---------|------|-------|
 | PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
 | PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
 | Network Gear | ~50W | ~50W | ~50W | Router + switches |
 | **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
 **UPS Capacity**: 1320W
 **Typical Load**: 33-50% (safe margin)
 **Peak Load**: Can exceed UPS capacity temporarily (acceptable)
 ### Power Optimizations Applied
 **See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
 - KSMD disabled: ~60-80W saved
 - CPU governors: ~60-120W saved
 - Syncthing rescans: ~60-80W saved
 - HDD spindown: ~10-16W saved when idle
 - **Total savings**: ~150-300W
 ---
 ## Thermal Management
 ### CPU Cooling
 **PVE & PVE2**:
 - CPU cooler: Unknown model
 - Thermal paste: Unknown, likely needs refresh if temps >85°C
 - Target temp: 70-80°C under load
 - Max safe: 90°C Tctl (Threadripper PRO spec)
 ### GPU Cooling
 All GPUs are passively managed (stock coolers):
 - TITAN RTX: 2-3W idle, 280W load
 - RTX A6000: 11W idle, 300W load
 - Quadro P2000: 25W constant (Plex active)
 ### Case Airflow
 **Unknown** - needs investigation:
 - Case model?
 - Fan configuration?
 - Positive or negative pressure?
 ---
 ## Cable Management
 ### Network Cables
 | Connection | Type | Length | Speed |
 |------------|------|--------|-------|
 | PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
 | PVE2 → Router | Cat6 | Unknown | 1Gb |
 | Mac Mini → Switch | Cat6 | Unknown | 1Gb |
 | TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
 ### Power Cables
 **Critical**: All servers on UPS battery-backed outlets
 ---
 ## Maintenance Schedule
 ### Annual Maintenance
 - [ ] Clean dust from servers (every 6-12 months)
 - [ ] Check thermal paste on CPUs (every 2-3 years)
 - [ ] Test UPS battery runtime (annually)
 - [ ] Verify all fans operational
 - [ ] Check for bulging capacitors on PSUs
 ### Drive Health
 ```bash
 # Check SMART status on all drives
 ssh pve 'smartctl -a /dev/nvme0'
 ssh pve2 'smartctl -a /dev/sda'
 ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
 ```
 ### Temperature Monitoring
 ```bash
 # Check all temps (needs lm-sensors installed)
 ssh pve 'sensors'
 ssh pve2 'sensors'
 ```
 ---
 ## Warranty & Purchase Info
 **Needs documentation**:
 - When were servers purchased?
 - Where were components bought?
 - Any warranties still active?
 - Replacement part sources?
 ---
 ## Upgrade Path
 ### Short-term Upgrades (< 6 months)
 - [ ] 20A circuit for UPS (restore original 5-20P plug)
 - [ ] Document missing hardware specs
 - [ ] Label all cables
 - [ ] Create rack diagram
 ### Medium-term Upgrades (6-12 months)
 - [ ] Additional 10Gb NIC for PVE2?
 - [ ] More NVMe storage?
 - [ ] Upgrade network switches?
 - [ ] Replace EMC enclosure with newer model?
 ### Long-term Upgrades (1-2 years)
 - [ ] CPU upgrade to newer Threadripper?
 - [ ] RAM expansion to 256GB?
 - [ ] Additional GPU for AI workloads?
 - [ ] Migrate to PCIe 5.0 storage?
 ---
 ## Investigation Needed
 High-priority items to document:
 - [ ] Get exact motherboard model (both servers)
 - [ ] Get PSU model and wattage
 - [ ] CPU cooler models
 - [ ] Network switch models and configuration
 - [ ] Complete drive inventory in EMC enclosure
 - [ ] RAM speed and timings
 - [ ] Case models
 - [ ] Exact NVMe models for all drives
 **Commands to gather info**:
 ```bash
 # Motherboard
 ssh pve 'dmidecode -t baseboard'
 # CPU details
 ssh pve 'lscpu'
 # RAM details
 ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
 # Storage devices
 ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
 # Network cards
 ssh pve 'lspci | grep -i network'
 # GPU details
 ssh pve 'lspci | grep -i vga'
 ssh pve 'nvidia-smi -L'  # If nvidia-smi available
 ```
 ---
 ## Related Documentation
 - [VMS.md](VMS.md) - VM resource allocation
 - [STORAGE.md](STORAGE.md) - Storage pools and usage
 - [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
 - [UPS.md](UPS.md) - UPS configuration
 - [NETWORK.md](NETWORK.md) - Network configuration
 - [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
 ---
 **Last Updated**: 2025-12-22
 **Status**: ⚠️ Incomplete - many specs need investigation
--- a/HOMEASSISTANT.md
+++ b/HOMEASSISTANT.md
@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
 - **Philips Hue** - Lights
 - **Sonos** - Speakers
 - **Motion Sensors** - Various locations
 - **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
 ### NUT / UPS Integration
 Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
 **Connection:**
 - Host: 10.10.10.120
 - Port: 3493
 - Username: upsmon
 - Password: upsmon123
 **Entities:**
 | Entity ID | Description |
 |-----------|-------------|
 | `sensor.cyberpower_battery_charge` | Battery percentage |
 | `sensor.cyberpower_load` | Current load % |
 | `sensor.cyberpower_input_voltage` | Input voltage |
 | `sensor.cyberpower_output_voltage` | Output voltage |
 | `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
 | `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
 **Dashboard Card Example:**
 ```yaml
 type: entities
 title: UPS Status
 entities:
  - entity: sensor.cyberpower_status
    name: Status
  - entity: sensor.cyberpower_battery_charge
    name: Battery
  - entity: sensor.cyberpower_load
    name: Load
  - entity: sensor.cyberpower_input_voltage
    name: Input Voltage
 ```
 ## Automations
--- a/IP-ASSIGNMENTS.md
+++ b/IP-ASSIGNMENTS.md
@@ -45,6 +45,7 @@ This document tracks all IP addresses in the homelab infrastructure.
 |------|------|------------|---------|--------|
 | 300 | gitea-vm | 10.10.10.220 | Git server | Running |
 | 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running |
 | 302 | docker-host2 | 10.10.10.207 | Docker services (n8n, future apps) | Running |
 ## Workstations & Personal Devices
@@ -69,6 +70,9 @@ This document tracks all IP addresses in the homelab infrastructure.
 | CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary |
 | LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary |
 | Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary |
 | MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 | Traefik-Primary |
 | n8n | n8n.htsn.io | 10.10.10.207:5678 | Traefik-Primary |
 | Crafty Controller | mc.htsn.io | 10.10.10.207:8443 | Traefik-Primary |
 | Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox |
 | Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox |
 | Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox |
@@ -92,6 +96,7 @@ This document tracks all IP addresses in the homelab infrastructure.
 - .200 - TrueNAS
 - .201 - CopyParty
 - .206 - Docker-host
 - .207 - Docker-host2
 - .220 - Gitea
 - .221 - Trading VM
 - .250 - Traefik-Primary
@@ -110,7 +115,7 @@ This document tracks all IP addresses in the homelab infrastructure.
 - 10.10.10.148 - 10.10.10.149 (2 IPs)
 - 10.10.10.151 - 10.10.10.199 (49 IPs)
 - 10.10.10.202 - 10.10.10.205 (4 IPs)
- 10.10.10.207 - 10.10.10.219 (13 IPs)
+- 10.10.10.208 - 10.10.10.219 (12 IPs)
 - 10.10.10.222 - 10.10.10.249 (28 IPs)
 - 10.10.10.251 - 10.10.10.254 (4 IPs)
@@ -123,6 +128,18 @@ This document tracks all IP addresses in the homelab infrastructure.
 | Portainer Agent | 9001 | Remote management from other Portainer |
 | Gotenberg | 3000 | PDF generation API |
 ## Docker Host 2 Services (10.10.10.207) - PVE2
 | Service | Port | Purpose |
 |---------|------|---------|
 | MetaMCP | 12008 | MCP Aggregator/Gateway (metamcp.htsn.io) |
 | n8n | 5678 | Workflow automation |
 | Crafty Controller | 8443 | Minecraft server management (mc.htsn.io) |
 | Minecraft Java | 25565 | Minecraft Java Edition server |
 | Minecraft Bedrock | 19132/udp | Minecraft Bedrock Edition (Geyser) |
 | Trading Redis | 6379 | Redis for trading platform |
 | Trading TimescaleDB | 5433 | TimescaleDB for trading platform |
 ## Syncthing API Endpoints
 | Device | IP | Port | API Key |
--- a/MAINTENANCE.md
+++ b/MAINTENANCE.md
@@ -0,0 +1,618 @@
 # Maintenance Procedures and Schedules
 Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
 ## Overview
 | Frequency | Tasks | Estimated Time |
 |-----------|-------|----------------|
 | **Daily** | Quick health check | 2-5 min |
 | **Weekly** | Service status, logs review | 15-30 min |
 | **Monthly** | Updates, backups verification | 1-2 hours |
 | **Quarterly** | Full system audit, testing | 2-4 hours |
 | **Annual** | Hardware maintenance, planning | 4-8 hours |
 ---
 ## Daily Maintenance (Automated)
 ### Quick Health Check Script
 Save as `~/bin/homelab-health-check.sh`:
 ```bash
 #!/bin/bash
 # Daily homelab health check
 echo "=== Homelab Health Check ==="
 echo "Date: $(date)"
 echo ""
 echo "=== Server Status ==="
 ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
 ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
 echo ""
 echo "=== CPU Temperatures ==="
 ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
 ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
 echo ""
 echo "=== UPS Status ==="
 ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
 echo ""
 echo "=== ZFS Pools ==="
 ssh pve 'zpool status -x' 2>/dev/null
 ssh pve2 'zpool status -x' 2>/dev/null
 ssh truenas 'zpool status -x vault'
 echo ""
 echo "=== Disk Space ==="
 ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
 ssh truenas 'df -h /mnt/vault'
 echo ""
 echo "=== VM Status ==="
 ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
 ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
 echo ""
 echo "=== Syncthing Connections ==="
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
 echo ""
 echo "=== Check Complete ==="
 ```
 **Run daily via cron**:
 ```bash
 # Add to crontab
 0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
 ```
 ---
 ## Weekly Maintenance
 ### Service Status Review
 **Check all critical services**:
 ```bash
 # Proxmox services
 ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
 ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
 # NUT (UPS monitoring)
 ssh pve 'systemctl status nut-server nut-monitor'
 ssh pve2 'systemctl status nut-monitor'
 # Container services
 ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
 ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik
 # VM services (via QEMU agent)
 ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS
 ```
 ### Log Review
 **Check for errors in critical logs**:
 ```bash
 # Proxmox system logs
 ssh pve 'journalctl -p err -b | tail -50'
 ssh pve2 'journalctl -p err -b | tail -50'
 # VM logs (if QEMU agent available)
 ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
 # Traefik access logs
 ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
 ```
 ### Syncthing Sync Status
 **Check for sync errors**:
 ```bash
 # Check all folder errors
 for folder in documents downloads desktop movies pictures notes config; do
  echo "=== $folder ==="
  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
 done
 ```
 **See**: [SYNCTHING.md](SYNCTHING.md)
 ---
 ## Monthly Maintenance
 ### System Updates
 #### Proxmox Updates
 **Check for updates**:
 ```bash
 ssh pve 'apt update && apt list --upgradable'
 ssh pve2 'apt update && apt list --upgradable'
 ```
 **Apply updates**:
 ```bash
 # PVE
 ssh pve 'apt update && apt dist-upgrade -y'
 # PVE2
 ssh pve2 'apt update && apt dist-upgrade -y'
 # Reboot if kernel updated
 ssh pve 'reboot'
 ssh pve2 'reboot'
 ```
 **⚠️ Important**:
 - Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
 - Test on PVE2 first if possible
 - Ensure all VMs are backed up before updating
 - Monitor VMs after reboot - some may need manual restart
 #### Container Updates (LXC)
 ```bash
 # Update all containers
 ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
 ```
 #### VM Updates
 **Update VMs individually via SSH**:
 ```bash
 # Ubuntu/Debian VMs
 ssh truenas 'apt update && apt upgrade -y'
 ssh docker-host 'apt update && apt upgrade -y'
 ssh fs-dev 'apt update && apt upgrade -y'
 # Check if reboot required
 ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
 ```
 ### ZFS Scrubs
 **Schedule**: Run monthly on all pools
 **PVE**:
 ```bash
 # Start scrub on all pools
 ssh pve 'zpool scrub nvme-mirror1'
 ssh pve 'zpool scrub nvme-mirror2'
 ssh pve 'zpool scrub rpool'
 # Check scrub status
 ssh pve 'zpool status | grep -A2 scrub'
 ```
 **PVE2**:
 ```bash
 ssh pve2 'zpool scrub nvme-mirror3'
 ssh pve2 'zpool scrub local-zfs2'
 ssh pve2 'zpool status | grep -A2 scrub'
 ```
 **TrueNAS**:
 ```bash
 # Scrub via TrueNAS web UI or SSH
 ssh truenas 'zpool scrub vault'
 ssh truenas 'zpool status vault | grep -A2 scrub'
 ```
 **Automate scrubs**:
 ```bash
 # Add to crontab (run on 1st of month at 2 AM)
 0 2 1 * * /sbin/zpool scrub nvme-mirror1
 0 2 1 * * /sbin/zpool scrub nvme-mirror2
 0 2 1 * * /sbin/zpool scrub rpool
 ```
 **See**: [STORAGE.md](STORAGE.md) for pool details
 ### SMART Tests
 **Run extended SMART tests monthly**:
 ```bash
 # TrueNAS drives (via QEMU agent)
 ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
 # Check results after 4-8 hours
 ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
 # PVE drives
 ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
 # PVE2 drives
 ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
 ```
 **Automate SMART tests**:
 ```bash
 # Add to crontab (run on 15th of month at 3 AM)
 0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
 0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
 ```
 ### Certificate Renewal Verification
 **Check SSL certificate expiry**:
 ```bash
 # Check Traefik certificates
 ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
 # Check specific service
 echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
 ```
 **Certificates should auto-renew 30 days before expiry via Traefik**
 **See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
 ### Backup Verification
 **⚠️ TODO**: No backup strategy currently in place
 **See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
 ---
 ## Quarterly Maintenance
 ### Full System Audit
 **Check all systems comprehensively**:
 1. **ZFS Pool Health**:
   ```bash
   ssh pve 'zpool status -v'
   ssh pve2 'zpool status -v'
   ssh truenas 'zpool status -v vault'
   ```
   Look for: errors, degraded vdevs, resilver operations
 2. **SMART Health**:
   ```bash
   # Run SMART health check script
   ~/bin/smart-health-check.sh
   ```
   Look for: reallocated sectors, pending sectors, failures
 3. **Disk Space Trends**:
   ```bash
   # Check growth rate
   ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
   ssh truenas 'df -h /mnt/vault'
   ```
   Plan for expansion if >80% full
 4. **VM Resource Usage**:
   ```bash
   # Check if VMs need more/less resources
   ssh pve 'qm list'
   ssh pve 'pvesh get /nodes/pve/status'
   ```
 5. **Network Performance**:
   ```bash
   # Test bandwidth between critical nodes
   iperf3 -s  # On one host
   iperf3 -c 10.10.10.120  # From another
   ```
 6. **Temperature Monitoring**:
   ```bash
   # Check max temps over past quarter
   # TODO: Set up Prometheus/Grafana for historical data
   ssh pve 'sensors'
   ssh pve2 'sensors'
   ```
 ### Service Dependency Testing
 **Test critical paths**:
 1. **Power failure recovery** (if safe to test):
   - See [UPS.md](UPS.md) for full procedure
   - Verify VM startup order works
   - Confirm all services come back online
 2. **Failover testing**:
   - Tailscale subnet routing (PVE → UCG-Fiber)
   - NUT monitoring (PVE server → PVE2 client)
 3. **Backup restoration** (when backups implemented):
   - Test restoring a VM from backup
   - Test restoring files from Syncthing versioning
 ### Documentation Review
 - [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
 - [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
 - [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
 - [ ] Update any changed procedures in this document
 ---
 ## Annual Maintenance
 ### Hardware Maintenance
 **Physical cleaning**:
 ```bash
 # Shut down servers (coordinate with users)
 ssh pve 'shutdown -h now'
 ssh pve2 'shutdown -h now'
 # Clean dust from:
 # - CPU heatsinks
 # - GPU fans
 # - Case fans
 # - PSU vents
 # - Storage enclosure fans
 # Check for:
 # - Bulging capacitors on PSU/motherboard
 # - Loose cables
 # - Fan noise/vibration
 ```
 **Thermal paste inspection** (every 2-3 years):
 - Check CPU temps vs baseline
 - If temps >85°C under load, consider reapplying paste
 - Threadripper PRO: Tctl max safe = 90°C
 **See**: [HARDWARE.md](HARDWARE.md) for component details
 ### UPS Battery Test
 **Runtime test**:
 ```bash
 # Check battery health
 ssh pve 'upsc cyberpower@localhost | grep battery'
 # Perform runtime test (coordinate power loss)
 # 1. Note current runtime estimate
 # 2. Unplug UPS from wall
 # 3. Let battery drain to 20%
 # 4. Note actual runtime vs estimate
 # 5. Plug back in before shutdown triggers
 # Battery replacement if:
 # - Runtime < 10 min at typical load
 # - Battery age > 3-5 years
 # - Battery charge < 100% when on AC for 24h
 ```
 **See**: [UPS.md](UPS.md) for full UPS details
 ### Drive Replacement Planning
 **Check drive age and health**:
 ```bash
 # Get drive hours and health
 ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
 done'
 ```
 **Replace drives if**:
 - Reallocated sectors > 0
 - Pending sectors > 0
 - SMART pre-fail warnings
 - Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
 - Hours > 50,000 for consumer drives
 **Budget for replacements**:
 - HDDs: WD Red 6TB (~$150/drive)
 - NVMe: Samsung/Kingston 2TB (~$150-200/drive)
 ### Capacity Planning
 **Review growth trends**:
 ```bash
 # Storage growth (compare to last year)
 ssh pve 'zpool list'
 ssh truenas 'df -h /mnt/vault'
 # Network bandwidth (if monitoring in place)
 # Review Grafana dashboards
 # Power consumption
 ssh pve 'upsc cyberpower@localhost ups.load'
 ```
 **Plan expansions**:
 - Storage: Add drives if >70% full
 - RAM: Check if VMs hitting limits
 - Network: Upgrade if bandwidth saturation
 - UPS: Upgrade if load >80%
 ### License and Subscription Review
 **Proxmox subscription** (if applicable):
 - Community (free) or Enterprise subscription?
 - Check for updates to pricing/features
 **Service subscriptions**:
 - Domain registration (htsn.io)
 - Cloudflare plan (currently free)
 - Let's Encrypt (free, no action needed)
 ---
 ## Update Schedules
 ### Proxmox
 | Component | Frequency | Notes |
 |-----------|-----------|-------|
 | Security patches | Weekly | Via `apt upgrade` |
 | Minor updates | Monthly | Test on PVE2 first |
 | Major versions | Quarterly | Read release notes, plan downtime |
 | Kernel updates | Monthly | Requires reboot |
 **Update procedure**:
 1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
 2. Backup VM configs: `vzdump --dumpdir /tmp`
 3. Update: `apt update && apt dist-upgrade`
 4. Reboot if kernel changed: `reboot`
 5. Verify VMs auto-started: `qm list`
 ### Containers (LXC)
 | Container | Update Frequency | Package Manager |
 |-----------|------------------|-----------------|
 | Pi-hole (200) | Weekly | `apt` |
 | Traefik (202) | Monthly | `apt` |
 | FindShyt (205) | As needed | `apt` |
 **Update command**:
 ```bash
 ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
 ```
 ### VMs
 | VM | Update Frequency | Notes |
 |----|------------------|-------|
 | TrueNAS | Monthly | Via web UI or `apt` |
 | Saltbox | Weekly | Managed by Saltbox updates |
 | HomeAssistant | Monthly | Via HA supervisor |
 | Docker-host | Weekly | `apt` + Docker images |
 | Trading-VM | As needed | Via SSH |
 | Gitea-VM | Monthly | Via web UI + `apt` |
 **Docker image updates**:
 ```bash
 ssh docker-host 'docker-compose pull && docker-compose up -d'
 ```
 ### Firmware Updates
 | Component | Check Frequency | Update Method |
 |-----------|----------------|---------------|
 | Motherboard BIOS | Annually | Manual flash (high risk) |
 | GPU firmware | Rarely | `nvidia-smi` or manual |
 | SSD/NVMe firmware | Quarterly | Vendor tools |
 | HBA firmware | Annually | LSI tools |
 | UPS firmware | Annually | PowerPanel or manual |
 **⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
 - Critical security issue
 - Needed for hardware compatibility
 - Fixing known bug affecting you
 ---
 ## Testing Checklists
 ### Pre-Update Checklist
 Before ANY system update:
 - [ ] Check current system state: `uptime`, `qm list`, `zpool status`
 - [ ] Verify backups are current (when backup system in place)
 - [ ] Check for critical VMs/services that can't have downtime
 - [ ] Review update changelog/release notes
 - [ ] Test on non-critical system first (PVE2 or test VM)
 - [ ] Plan rollback strategy if update fails
 - [ ] Notify users if downtime expected
 ### Post-Update Checklist
 After system update:
 - [ ] Verify system booted correctly: `uptime`
 - [ ] Check all VMs/CTs started: `qm list`, `pct list`
 - [ ] Test critical services:
  - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
  - [ ] Traefik routing: `curl -I https://plex.htsn.io`
  - [ ] NFS/SMB shares: Test mount from VM
  - [ ] Syncthing sync: Check all devices connected
 - [ ] Review logs for errors: `journalctl -p err -b`
 - [ ] Check temperatures: `sensors`
 - [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
 ### Disaster Recovery Test
 **Quarterly test** (when backup system in place):
 - [ ] Simulate VM failure: Restore from backup
 - [ ] Simulate storage failure: Import pool on different system
 - [ ] Simulate network failure: Verify Tailscale failover
 - [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
 - [ ] Document recovery time and issues
 ---
 ## Log Rotation
 **System logs** are automatically rotated by systemd-journald and logrotate.
 **Check log sizes**:
 ```bash
 # Journalctl size
 ssh pve 'journalctl --disk-usage'
 # Traefik logs
 ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
 ```
 **Configure retention**:
 ```bash
 # Limit journald to 500MB
 ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
 ssh pve 'systemctl restart systemd-journald'
 ```
 **Traefik log rotation** (already configured):
 ```bash
 # /etc/logrotate.d/traefik on CT 202
 /var/log/traefik/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
 }
 ```
 ---
 ## Monitoring Integration
 **TODO**: Set up automated monitoring for these procedures
 **When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
 - ZFS scrub completion/errors
 - SMART test failures
 - Certificate expiry warnings (<30 days)
 - Update availability notifications
 - Disk space thresholds (>80%)
 - Temperature warnings (>85°C)
 ---
 ## Related Documentation
 - [MONITORING.md](MONITORING.md) - Automated health checks and alerts
 - [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
 - [UPS.md](UPS.md) - Power failure procedures
 - [STORAGE.md](STORAGE.md) - ZFS pool management
 - [HARDWARE.md](HARDWARE.md) - Hardware specifications
 - [SERVICES.md](SERVICES.md) - Service inventory
 ---
 **Last Updated**: 2025-12-22
 **Status**: ⚠️ Manual procedures only - monitoring automation needed
--- a/MINECRAFT.md
+++ b/MINECRAFT.md
@@ -0,0 +1,478 @@
 # Minecraft Server - Hutworld
 Minecraft server running on docker-host2 via Crafty Controller 4.
 ---
 ## Quick Reference
 | Setting | Value |
 |---------|-------|
 | **Web GUI** | https://mc.htsn.io |
 | **Game Server (Java)** | hutworld.htsn.io:25565 |
 | **Game Server (Bedrock)** | hutworld.htsn.io:19132 |
 | **Host** | docker-host2 (10.10.10.207) |
 | **Server Type** | Paper 1.21.11 |
 | **World Name** | hutworld |
 | **Memory** | 2GB min / 4GB max |
 ---
 ## Crafty Controller Access
 | Setting | Value |
 |---------|-------|
 | **URL** | https://mc.htsn.io |
 | **Username** | admin |
 | **Password** | See `/crafty/data/config/default-creds.txt` on docker-host2 |
 **Get password:**
 ```bash
 ssh docker-host2 'cat ~/crafty/data/config/default-creds.txt'
 ```
 ---
 ## Current Status
 ### Completed
 - [x] Crafty Controller 4.4.7 deployed on docker-host2
 - [x] Traefik reverse proxy configured (mc.htsn.io → 10.10.10.207:8443)
 - [x] DNS A record created for hutworld.htsn.io (non-proxied, points to public IP)
 - [x] Port forwarding configured via UniFi API:
  - TCP/UDP 25565 → 10.10.10.207 (Java Edition)
  - UDP 19132 → 10.10.10.207 (Bedrock via Geyser)
 - [x] Server files transferred from Windows PC (D:\Minecraft\mcss\servers\hutworld)
 - [x] Server imported into Crafty and running
 - [x] Paper upgraded from 1.21.5 to 1.21.11
 - [x] Plugins updated (GSit 3.1.1, LuckPerms 5.5.22)
 - [x] Orphaned plugin data cleaned up
 - [x] LuckPerms database restored with original permissions
 - [x] Automated backups to TrueNAS configured (every 6 hours)
 ### Pending
 - [ ] Change Crafty admin password to something memorable
 - [ ] Test external connectivity from outside network
 ---
 ## Import Instructions
 To import the hutworld server in Crafty:
 1. Go to **Servers** → Click **+ Create New Server**
 2. Select **Import Server** tab
 3. Fill in:
   - **Server Name:** `Hutworld`
   - **Import Path:** `/crafty/import/hutworld`
   - **Server JAR:** `paper.jar`
   - **Min RAM:** `2048` (2GB)
   - **Max RAM:** `6144` (6GB)
   - **Server Port:** `25565`
 4. Click **Import Server**
 5. Go to server → Click **Start**
 ---
 ## Server Configuration
 ### World Data
 | World | Description |
 |-------|-------------|
 | hutworld | Main overworld |
 | hutworld_nether | Nether dimension |
 | hutworld_the_end | End dimension |
 ### Installed Plugins
 | Plugin | Version | Purpose |
 |--------|---------|---------|
 | EssentialsX | 2.20.1 | Core server commands |
 | EssentialsXChat | 2.20.1 | Chat formatting |
 | EssentialsXSpawn | 2.20.1 | Spawn management |
 | Geyser-Spigot | Latest | Bedrock Edition support |
 | floodgate | Latest | Bedrock authentication |
 | GSit | 3.1.1 | Sit/lay/crawl animations |
 | LuckPerms | 5.5.22 | Permissions management |
 | PluginPortal | 2.2.2 | Plugin management |
 | Vault | 1.7.3 | Economy/permissions API |
 | ViaVersion | Latest | Multi-version support |
 | ViaBackwards | Latest | Older client support |
 | randomtp | Latest | Random teleportation |
 **Removed plugins** (cleaned up 2026-01-03):
 - GriefPrevention, Multiverse-Core, Multiverse-Portals, ProtocolLib, WorldEdit, WorldGuard (disabled/orphaned)
 ---
 ## Docker Configuration
 **Location:** `~/crafty/docker-compose.yml` on docker-host2
 ```yaml
 services:
  crafty:
    image: registry.gitlab.com/crafty-controller/crafty-4:4.4.7
    container_name: crafty
    restart: unless-stopped
    environment:
      - TZ=America/New_York
    ports:
      - "8443:8443"    # Web GUI (HTTPS)
      - "8123:8123"    # Dynmap (if used)
      - "25565:25565"  # Minecraft Java
      - "25566:25566"  # Additional server
      - "19132:19132/udp"  # Minecraft Bedrock (Geyser)
    volumes:
      - ./data/backups:/crafty/backups
      - ./data/logs:/crafty/logs
      - ./data/servers:/crafty/servers
      - ./data/config:/crafty/app/config
      - ./data/import:/crafty/import
 ```
 ---
 ## Traefik Configuration
 **File:** `/etc/traefik/conf.d/crafty.yaml` on CT 202 (10.10.10.250)
 ```yaml
 http:
  routers:
    crafty-secure:
      entryPoints:
        - websecure
      rule: "Host(`mc.htsn.io`)"
      service: crafty
      tls:
        certResolver: cloudflare
      priority: 50
  services:
    crafty:
      loadBalancer:
        servers:
          - url: "https://10.10.10.207:8443"
        serversTransport: crafty-transport@file
  serversTransports:
    crafty-transport:
      insecureSkipVerify: true
 ```
 ---
 ## Port Forwarding (UniFi)
 Configured via UniFi API on UCG-Fiber (10.10.10.1):
 | Rule Name | Port | Protocol | Destination |
 |-----------|------|----------|-------------|
 | Minecraft Java | 25565 | TCP/UDP | 10.10.10.207:25565 |
 | Minecraft Bedrock | 19132 | UDP | 10.10.10.207:19132 |
 ---
 ## DNS Records (Cloudflare)
 | Record | Type | Value | Proxied |
 |--------|------|-------|---------|
 | mc.htsn.io | CNAME | htsn.io | Yes (for web GUI) |
 | hutworld.htsn.io | A | 70.237.94.174 | No (direct for game traffic) |
 **Note:** Game traffic (25565, 19132) cannot be proxied through Cloudflare - only HTTP/HTTPS works with Cloudflare proxy.
 ---
 ## LuckPerms Web Editor
 After server is running:
 1. Open Crafty console for Hutworld server
 2. Run command: `/lp editor`
 3. A unique URL will be generated (cloud-hosted by LuckPerms)
 4. Open the URL in browser to manage permissions
 The editor is hosted by LuckPerms, so no additional port forwarding is needed.
 ---
 ## Backup Configuration
 ### Automated Backups to TrueNAS
 Backups run automatically every 6 hours and are stored on TrueNAS.
 | Setting | Value |
 |---------|-------|
 | **Destination** | TrueNAS (10.10.10.200) |
 | **Path** | `/mnt/vault/users/backups/minecraft/` |
 | **Frequency** | Every 6 hours (12am, 6am, 12pm, 6pm) |
 | **Retention** | 14 backups (~3.5 days of history) |
 | **Size** | ~2.3 GB per backup |
 | **Script** | `/home/hutson/minecraft-backup.sh` on docker-host2 |
 | **Log** | `/home/hutson/minecraft-backup.log` on docker-host2 |
 ### Backup Script
 **Location:** `~/minecraft-backup.sh` on docker-host2
 ```bash
 #!/bin/bash
 # Minecraft Server Backup Script
 # Backs up Crafty server data to TrueNAS
 BACKUP_SRC="$HOME/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60"
 BACKUP_DEST="hutson@10.10.10.200:/mnt/vault/users/backups/minecraft"
 DATE=$(date +%Y-%m-%d_%H%M)
 BACKUP_NAME="hutworld-$DATE.tar.gz"
 LOCAL_BACKUP="/tmp/$BACKUP_NAME"
 # Create compressed backup (exclude large unnecessary files)
 tar -czf "$LOCAL_BACKUP" \
    --exclude="*.jar" \
    --exclude="cache" \
    --exclude="libraries" \
    --exclude=".paper-remapped" \
    -C "$HOME/crafty/data/servers" \
    19f604a9-f037-442d-9283-0761c73cfd60
 # Transfer to TrueNAS
 sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$LOCAL_BACKUP" "$BACKUP_DEST/"
 # Clean up local temp file
 rm -f "$LOCAL_BACKUP"
 # Keep only last 14 backups on TrueNAS
 sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 '
    cd /mnt/vault/users/backups/minecraft
    ls -t hutworld-*.tar.gz 2>/dev/null | tail -n +15 | xargs -r rm -f
 '
 ```
 ### Cron Schedule
 ```bash
 # View current schedule
 ssh docker-host2 'crontab -l | grep minecraft'
 # Output: 0 */6 * * * /home/hutson/minecraft-backup.sh >> /home/hutson/minecraft-backup.log 2>&1
 ```
 ### Manual Backup Commands
 ```bash
 # Run backup manually
 ssh docker-host2 '~/minecraft-backup.sh'
 # Check backup log
 ssh docker-host2 'tail -20 ~/minecraft-backup.log'
 # List backups on TrueNAS
 sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
  'ls -lh /mnt/vault/users/backups/minecraft/'
 ```
 ### Restore from Backup
 ```bash
 # 1. Stop the server in Crafty web UI
 # 2. Copy backup from TrueNAS
 sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no \
  hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz \
  /tmp/
 # 3. Extract to server directory (backup existing first)
 ssh docker-host2 'cd ~/crafty/data/servers && \
  mv 19f604a9-f037-442d-9283-0761c73cfd60 19f604a9-f037-442d-9283-0761c73cfd60.old && \
  tar -xzf /tmp/hutworld-YYYY-MM-DD_HHMM.tar.gz'
 # 4. Start server in Crafty web UI
 ```
 ---
 ## Common Tasks
 ### Start/Stop Server
 Via Crafty web UI at https://mc.htsn.io, or:
 ```bash
 # Check Crafty container status
 ssh docker-host2 'docker ps | grep crafty'
 # Restart Crafty container
 ssh docker-host2 'cd ~/crafty && docker compose restart'
 # View Crafty logs
 ssh docker-host2 'docker logs -f crafty'
 ```
 ### Backup Server
 See [Backup Configuration](#backup-configuration) for full details.
 ```bash
 # Run backup manually
 ssh docker-host2 '~/minecraft-backup.sh'
 # Check recent backups
 sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
  'ls -lht /mnt/vault/users/backups/minecraft/ | head -5'
 ```
 ### Update Plugins
 1. Download new plugin JAR
 2. Upload via Crafty Files tab, or:
 ```bash
 scp plugin.jar docker-host2:~/crafty/data/servers/hutworld/plugins/
 ```
 3. Restart server in Crafty
 ### Check Server Logs
 Via Crafty web UI (Logs tab), or:
 ```bash
 ssh docker-host2 'tail -f ~/crafty/data/servers/hutworld/logs/latest.log'
 ```
 ---
 ## Troubleshooting
 ### Server won't start
 ```bash
 # Check Crafty container logs
 ssh docker-host2 'docker logs crafty --tail 50'
 # Check server logs
 ssh docker-host2 'cat ~/crafty/data/servers/hutworld/logs/latest.log | tail -100'
 # Check Java version in container
 ssh docker-host2 'docker exec crafty java -version'
 ```
 ### Can't connect externally
 1. Verify port forwarding is active:
 ```bash
 ssh root@10.10.10.1 'iptables -t nat -L -n | grep 25565'
 ```
 2. Test from external network:
 ```bash
 nc -zv hutworld.htsn.io 25565
 ```
 3. Check if server is listening:
 ```bash
 ssh docker-host2 'netstat -tlnp | grep 25565'
 ```
 ### Bedrock players can't connect
 1. Verify Geyser plugin is installed and enabled
 2. Check Geyser config: `~/crafty/data/servers/hutworld/plugins/Geyser-Spigot/config.yml`
 3. Ensure UDP 19132 is forwarded and not blocked
 ### LuckPerms missing users/permissions
 If LuckPerms shows a fresh database (missing users like Suwan):
 1. **Check if original database exists:**
 ```bash
 ssh docker-host2 'ls -la ~/crafty/data/import/hutworld/plugins/LuckPerms/*.db'
 ```
 2. **Restore from import backup:**
 ```bash
 # Stop server in Crafty UI first
 ssh docker-host2 'cp ~/crafty/data/import/hutworld/plugins/LuckPerms/luckperms-h2-v2.mv.db \
  ~/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60/plugins/LuckPerms/'
 ```
 3. **Or restore from TrueNAS backup:**
 ```bash
 # List available backups
 sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
  'ls -lt /mnt/vault/users/backups/minecraft/'
 # Extract LuckPerms database from backup
 sshpass -p 'GrilledCh33s3#' scp hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz /tmp/
 tar -xzf /tmp/hutworld-*.tar.gz -C /tmp --strip-components=2 \
  '*/plugins/LuckPerms/luckperms-h2-v2.mv.db'
 ```
 4. **Restart server in Crafty UI**
 ---
 ## Migration History
 ### 2026-01-04: Backup System
 - Configured automated backups to TrueNAS every 6 hours
 - Set 14-backup retention (~3.5 days of recovery points)
 - Created backup script with compression and cleanup
 - Storage: `/mnt/vault/users/backups/minecraft/`
 ### 2026-01-03: Server Fixes & Updates
 **Updates:**
 - Upgraded Paper from 1.21.5 to 1.21.11 (build 69)
 - Updated GSit from 2.3.2 to 3.1.1
 - Fixed corrupted LuckPerms JAR (re-downloaded 5.5.22)
 - Restored original LuckPerms database with user permissions
 **Cleanup:**
 - Removed disabled plugins: Dynmap, Graves
 - Removed orphaned data folders: GriefPreventionData, SilkSpawners_v2, Graves, ViaRewind
 **Fixes:**
 - Fixed memory allocation (was attempting 2TB, set to 2GB min / 4GB max)
 - Fixed file permissions for Docker container access
 ### 2026-01-03: Initial Migration
 **Source:** Windows PC (10.10.10.150) - D:\Minecraft\mcss\servers\hutworld
 **Steps completed:**
 1. Compressed hutworld folder on Windows (2.4GB zip)
 2. Transferred via SCP to docker-host2
 3. Unzipped to ~/crafty/data/import/hutworld
 4. Downloaded Paper 1.21.5 JAR (later upgraded to 1.21.11)
 5. Imported server into Crafty Controller
 6. Configured port forwarding (updated existing 25565 rule, added 19132)
 7. Created DNS record for hutworld.htsn.io
 **Original MCSS config preserved:** `mcss_server_config.json`
 ---
 ## Related Documentation
 - [IP Assignments](IP-ASSIGNMENTS.md) - Network configuration
 - [Traefik](TRAEFIK.md) - Reverse proxy setup
 - [VMs](VMS.md) - docker-host2 details
 - [Gateway](GATEWAY.md) - UCG-Fiber configuration
 ---
 ## Resources
 - [Crafty Controller Docs](https://docs.craftycontrol.com/)
 - [Paper MC](https://papermc.io/)
 - [Geyser MC](https://geysermc.org/)
 - [LuckPerms](https://luckperms.net/)
 ---
 **Last Updated:** 2026-01-04
--- a/MONITORING.md
+++ b/MONITORING.md
@@ -0,0 +1,583 @@
 # Monitoring and Alerting
 Documentation for system monitoring, health checks, and alerting across the homelab.
 ## Current Monitoring Status
 | Component | Monitored? | Method | Alerts | Notes |
 |-----------|------------|--------|--------|-------|
 | **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
 | **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
 | **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
 | **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
 | **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
 | **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
 | **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
 | **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
 | **Services** | ❌ No | - | ❌ No | No health checks |
 | **Backups** | ❌ No | - | ❌ No | No verification |
 **Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual
 ---
 ## Existing Monitoring
 ### UPS Monitoring (NUT)
 **Status**: ✅ **Active and working**
 **What's monitored**:
 - Battery charge percentage
 - Runtime remaining (seconds)
 - Load percentage
 - Input/output voltage
 - UPS status (OL/OB/LB)
 **Access**:
 ```bash
 # Full UPS status
 ssh pve 'upsc cyberpower@localhost'
 # Key metrics
 ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
 ```
 **Home Assistant Integration**:
 - Sensors: `sensor.cyberpower_*`
 - Can be used for automation/alerts
 - Currently: No alerts configured
 **See**: [UPS.md](UPS.md)
 ---
 ### Gateway Monitoring
 **Status**: ✅ **Active with auto-recovery**
 Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):
 **1. Internet Watchdog** (`internet-watchdog.service`)
 - Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
 - Auto-reboots gateway after 5 consecutive failures (~5 minutes)
 - Logs to `/var/log/internet-watchdog.log`
 **2. Memory Monitor** (`memory-monitor.service`)
 - Logs memory usage and top processes every 10 minutes
 - Logs to `/data/logs/memory-history.log`
 - Auto-rotates when log exceeds 10MB
 **Quick Commands**:
 ```bash
 # Check service status
 ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
 # View watchdog activity
 ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
 # View memory history
 ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
 # Current memory usage
 ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
 ```
 **See**: [GATEWAY.md](GATEWAY.md)
 ---
 ### Syncthing Monitoring
 **Status**: ⚠️ **Partial** - API available, no automated monitoring
 **What's available**:
 - Device connection status
 - Folder sync status
 - Sync errors
 - Bandwidth usage
 **Manual Checks**:
 ```bash
 # Check connections (Mac Mini)
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
 # Check folder status
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
 # Check errors
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
 ```
 **Needs**: Automated monitoring script + alerts
 **See**: [SYNCTHING.md](SYNCTHING.md)
 ---
 ### Temperature Monitoring
 **Status**: ⚠️ **Manual only**
 **Current Method**:
 ```bash
 # CPU temperature (Threadripper Tctl)
 ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
 ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
  label=$(cat ${f%_input}_label 2>/dev/null); \
  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
 ```
 **Thresholds**:
 - Healthy: 70-80°C under load
 - Warning: >85°C
 - Critical: >90°C (throttling)
 **Needs**: Automated monitoring + alert if >85°C
 ---
 ### Proxmox VM Monitoring
 **Status**: ⚠️ **Manual only**
 **Current Access**:
 - Proxmox Web UI: Node → Summary
 - CLI: `ssh pve 'qm list'`
 **Metrics Available** (via Proxmox):
 - CPU usage per VM
 - RAM usage per VM
 - Disk I/O
 - Network I/O
 - VM uptime
 **Needs**: API-based monitoring + alerts for VM down
 ---
 ## Recommended Monitoring Stack
 ### Option 1: Prometheus + Grafana (Recommended)
 **Why**:
 - Industry standard
 - Extensive integrations
 - Beautiful dashboards
 - Flexible alerting
 **Architecture**:
 ```
 Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
                              ↓
                          Alertmanager (alerts)
 ```
 **Required Exporters**:
 | Exporter | Monitors | Install On |
 |----------|----------|------------|
 | node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
 | zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
 | smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
 | nut_exporter | UPS metrics | PVE |
 | proxmox_exporter | VM/CT stats | PVE, PVE2 |
 | cadvisor | Docker containers | Saltbox, docker-host |
 **Deployment**:
 ```bash
 # Create monitoring VM
 ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0'
 # Install Prometheus + Grafana (via Docker)
 # /opt/monitoring/docker-compose.yml
 ```
 **Estimated Setup Time**: 4-6 hours
 ---
 ### Option 2: Uptime Kuma (Simpler Alternative)
 **Why**:
 - Lightweight
 - Easy to set up
 - Web-based dashboard
 - Built-in alerts (email, Slack, etc.)
 **What it monitors**:
 - HTTP/HTTPS endpoints
 - Ping (ICMP)
 - Ports (TCP)
 - Docker containers
 **Deployment**:
 ```bash
 ssh docker-host 'mkdir -p /opt/uptime-kuma'
 cat > docker-compose.yml << 'EOF'
 version: "3.8"
 services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    ports:
      - "3001:3001"
    volumes:
      - ./data:/app/data
    restart: unless-stopped
 EOF
 # Access: http://10.10.10.206:3001
 # Add Traefik config for uptime.htsn.io
 ```
 **Estimated Setup Time**: 1-2 hours
 ---
 ### Option 3: Netdata (Real-time Monitoring)
 **Why**:
 - Real-time metrics (1-second granularity)
 - Auto-discovers services
 - Low overhead
 - Beautiful web UI
 **Deployment**:
 ```bash
 # Install on each server
 ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
 ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
 # Access:
 # http://10.10.10.120:19999 (PVE)
 # http://10.10.10.102:19999 (PVE2)
 ```
 **Parent-Child Setup** (optional):
 - Configure PVE as parent
 - Stream metrics from PVE2 → PVE
 - Single dashboard for both servers
 **Estimated Setup Time**: 1 hour
 ---
 ## Critical Metrics to Monitor
 ### Server Health
 | Metric | Threshold | Action |
 |--------|-----------|--------|
 | **CPU usage** | >90% for 5 min | Alert |
 | **CPU temp** | >85°C | Alert |
 | **CPU temp** | >90°C | Critical alert |
 | **RAM usage** | >95% | Alert |
 | **Disk space** | >80% | Warning |
 | **Disk space** | >90% | Alert |
 | **Load average** | >CPU count | Alert |
 ### Storage Health
 | Metric | Threshold | Action |
 |--------|-----------|--------|
 | **ZFS pool errors** | >0 | Alert immediately |
 | **ZFS pool degraded** | Any degraded vdev | Critical alert |
 | **ZFS scrub failed** | Last scrub error | Alert |
 | **SMART reallocated sectors** | >0 | Warning |
 | **SMART pending sectors** | >0 | Alert |
 | **SMART failure** | Pre-fail | Critical - replace drive |
 ### UPS
 | Metric | Threshold | Action |
 |--------|-----------|--------|
 | **Battery charge** | <20% | Warning |
 | **Battery charge** | <10% | Alert |
 | **On battery** | >5 min | Alert |
 | **Runtime** | <5 min | Critical |
 ### Network
 | Metric | Threshold | Action |
 |--------|-----------|--------|
 | **Device unreachable** | >2 min down | Alert |
 | **High packet loss** | >5% | Warning |
 | **Bandwidth saturation** | >90% | Warning |
 ### VMs/Services
 | Metric | Threshold | Action |
 |--------|-----------|--------|
 | **VM stopped** | Critical VM down | Alert immediately |
 | **Service unreachable** | HTTP 5xx or timeout | Alert |
 | **Backup failed** | Any backup failure | Alert |
 | **Certificate expiry** | <30 days | Warning |
 | **Certificate expiry** | <7 days | Alert |
 ---
 ## Alert Destinations
 ### Email Alerts
 **Recommended**: Set up SMTP relay for email alerts
 **Options**:
 1. Gmail SMTP (free, rate-limited)
 2. SendGrid (free tier: 100 emails/day)
 3. Mailgun (free tier available)
 4. Self-hosted mail server (complex)
 **Configuration Example** (Prometheus Alertmanager):
 ```yaml
 # /etc/alertmanager/alertmanager.yml
 receivers:
  - name: 'email'
    email_configs:
      - to: 'hutson@example.com'
        from: 'alerts@htsn.io'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@htsn.io'
        auth_password: 'app-password-here'
 ```
 ---
 ### Push Notifications
 **Options**:
 - **Pushover**: $5 one-time, reliable
 - **Pushbullet**: Free tier available
 - **Telegram Bot**: Free
 - **Discord Webhook**: Free
 - **Slack**: Free tier available
 **Recommended**: Pushover or Telegram for mobile alerts
 ---
 ### Home Assistant Alerts
 Since Home Assistant is already running, use it for alerts:
 **Automation Example**:
 ```yaml
 automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_charge
        below: 20
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
  - alias: "Server High Temperature"
    trigger:
      - platform: template
        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
    action:
      - service: notify.mobile_app
        data:
          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
 ```
 **Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
 ---
 ## Monitoring Scripts
 ### Daily Health Check
 Save as `~/bin/homelab-health-check.sh`:
 ```bash
 #!/bin/bash
 # Daily homelab health check
 echo "=== Homelab Health Check ==="
 echo "Date: $(date)"
 echo ""
 echo "=== Server Status ==="
 ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
 ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
 echo ""
 echo "=== CPU Temperatures ==="
 ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
 ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
 echo ""
 echo "=== UPS Status ==="
 ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
 echo ""
 echo "=== ZFS Pools ==="
 ssh pve 'zpool status -x' 2>/dev/null
 ssh pve2 'zpool status -x' 2>/dev/null
 ssh truenas 'zpool status -x vault'
 echo ""
 echo "=== Disk Space ==="
 ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
 ssh truenas 'df -h /mnt/vault'
 echo ""
 echo "=== VM Status ==="
 ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
 ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
 echo ""
 echo "=== Syncthing Connections ==="
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections" | \
  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
 echo ""
 echo "=== Check Complete ==="
 ```
 **Run daily**:
 ```cron
 0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
 ```
 ---
 ### ZFS Scrub Checker
 ```bash
 #!/bin/bash
 # Check last ZFS scrub status
 echo "=== ZFS Scrub Status ==="
 for host in pve pve2; do
  echo "--- $host ---"
  ssh $host 'zpool status | grep -A1 scrub'
  echo ""
 done
 echo "--- TrueNAS ---"
 ssh truenas 'zpool status vault | grep -A1 scrub'
 ```
 ---
 ### SMART Health Checker
 ```bash
 #!/bin/bash
 # Check SMART health on all drives
 echo "=== SMART Health Check ==="
 echo "--- TrueNAS Drives ---"
 ssh truenas 'smartctl --scan | while read dev type; do
  echo "=== $dev ===";
  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
 done'
 echo "--- PVE Drives ---"
 ssh pve 'for dev in /dev/nvme* /dev/sd*; do
  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
 done'
 ```
 ---
 ## Dashboard Recommendations
 ### Grafana Dashboard Layout
 **Page 1: Overview**
 - Server uptime
 - CPU usage (all servers)
 - RAM usage (all servers)
 - Disk space (all pools)
 - Network traffic
 - UPS status
 **Page 2: Storage**
 - ZFS pool health
 - SMART status for all drives
 - I/O latency
 - Scrub progress
 - Disk temperatures
 **Page 3: VMs**
 - VM status (up/down)
 - VM resource usage
 - VM disk I/O
 - VM network traffic
 **Page 4: Services**
 - Service health checks
 - HTTP response times
 - Certificate expiry dates
 - Syncthing sync status
 ---
 ## Implementation Plan
 ### Phase 1: Basic Monitoring (Week 1)
 - [ ] Install Uptime Kuma or Netdata
 - [ ] Add HTTP checks for all services
 - [ ] Configure UPS alerts in Home Assistant
 - [ ] Set up daily health check email
 **Estimated Time**: 4-6 hours
 ---
 ### Phase 2: Advanced Monitoring (Week 2-3)
 - [ ] Install Prometheus + Grafana
 - [ ] Deploy node_exporter on all servers
 - [ ] Deploy zfs_exporter
 - [ ] Deploy smartmon_exporter
 - [ ] Create Grafana dashboards
 **Estimated Time**: 8-12 hours
 ---
 ### Phase 3: Alerting (Week 4)
 - [ ] Configure Alertmanager
 - [ ] Set up email/push notifications
 - [ ] Create alert rules for all critical metrics
 - [ ] Test all alert paths
 - [ ] Document alert procedures
 **Estimated Time**: 4-6 hours
 ---
 ## Related Documentation
 - [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting
 - [UPS.md](UPS.md) - UPS monitoring details
 - [STORAGE.md](STORAGE.md) - ZFS health checks
 - [SERVICES.md](SERVICES.md) - Service inventory
 - [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
 - [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
 ---
 **Last Updated**: 2026-01-02
 **Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**
--- a/N8N-INTEGRATIONS.md
+++ b/N8N-INTEGRATIONS.md
@@ -0,0 +1,382 @@
 # n8n Homelab Integrations - Quick Start Guide
 n8n is running on your homelab network (10.10.10.207) and can access all local services. This guide sets up useful automations.
 ---
 ## Network Access Verified
 n8n can connect to:
 - ✅ **Home Assistant** (10.10.10.110:8123)
 - ✅ **Prometheus** (10.10.10.206:9090)
 - ✅ **Grafana** (10.10.10.206:3001)
 - ✅ **Syncthing** (10.10.10.200:8384)
 - ✅ **PiHole** (10.10.10.10)
 - ✅ **Gitea** (10.10.10.220:3000)
 - ✅ **Proxmox** (10.10.10.120:8006, 10.10.10.102:8006)
 - ✅ **TrueNAS** (10.10.10.200)
 - ✅ **All external APIs** (via internet)
 ---
 ## Initial Setup (First-Time)
 1. Open **https://n8n.htsn.io**
 2. Complete the setup wizard:
   - **Owner Email:** hutson@htsn.io
   - **Owner Name:** Hutson
   - **Password:** (choose secure password)
 3. Skip data sharing (optional)
 ---
 ## Credentials to Add in n8n
 Go to **Settings → Credentials** and add:
 ### 1. Home Assistant
 | Field | Value |
 |-------|-------|
 | **Credential Type** | Home Assistant API |
 | **Host** | `http://10.10.10.110:8123` |
 | **Access Token** | (get from Home Assistant) |
 **Get Token:** Home Assistant → Profile → Long-Lived Access Tokens → Create Token
 ---
 ### 2. Prometheus
 | Field | Value |
 |-------|-------|
 | **Credential Type** | HTTP Request (Generic) |
 | **URL** | `http://10.10.10.206:9090` |
 | **Authentication** | None |
 ---
 ### 3. Grafana
 | Field | Value |
 |-------|-------|
 | **Credential Type** | Grafana API |
 | **URL** | `http://10.10.10.206:3001` |
 | **API Key** | (create in Grafana) |
 **Get API Key:** Grafana → Administration → Service Accounts → Create → Add Token
 ---
 ### 4. Syncthing
 | Field | Value |
 |-------|-------|
 | **Credential Type** | HTTP Request (Generic) |
 | **URL** | `http://10.10.10.200:8384` |
 | **Header Name** | `X-API-Key` |
 | **Header Value** | `VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs` |
 ---
 ### 5. Telegram Bot
 | Field | Value |
 |-------|-------|
 | **Credential Type** | Telegram API |
 | **Access Token** | `8450212653:AAHoVBlNUuA0vtrVPMNUfSgJh_gmFMxlrBg` |
 **Your Chat ID:** `1004084736`
 ---
 ### 6. Proxmox
 | Field | Value |
 |-------|-------|
 | **Credential Type** | HTTP Request (Generic) |
 | **URL** | `http://10.10.10.120:8006` |
 | **Authentication** | API Token |
 | **Token** | (use monitoring@pve token if needed) |
 ---
 ## Starter Workflows
 ### Workflow 1: Homelab Health Check (Every Hour)
 **Nodes:**
 1. **Schedule Trigger** (every hour)
 2. **HTTP Request** → Prometheus query for down hosts
   - URL: `http://10.10.10.206:9090/api/v1/query`
   - Query param: `query=up{job=~"node.*"} == 0`
 3. **If** → Check if any hosts are down
 4. **Telegram** → Send alert if hosts down
 **PromQL Query:**
 ```
 up{job=~"node.*"} == 0
 ```
 ---
 ### Workflow 2: Daily Backup Status
 **Nodes:**
 1. **Schedule Trigger** (8am daily)
 2. **HTTP Request** → Query Syncthing sync status
   - URL: `http://10.10.10.200:8384/rest/db/status?folder=backup`
   - Header: `X-API-Key: VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs`
 3. **Function** → Check if folder is syncing
 4. **Telegram** → Send daily status report
 ---
 ### Workflow 3: High CPU Alert
 **Nodes:**
 1. **Schedule Trigger** (every 5 minutes)
 2. **HTTP Request** → Prometheus CPU query
   - URL: `http://10.10.10.206:9090/api/v1/query`
   - Query: `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
 3. **If** → CPU > 90%
 4. **Telegram** → Send alert
 ---
 ### Workflow 4: UPS Power Event
 **Webhook Trigger Setup:**
 1. Create webhook trigger in n8n
 2. Get webhook URL: `https://n8n.htsn.io/webhook/ups-alert`
 3. Configure NUT to call webhook on power events
 **Nodes:**
 1. **Webhook Trigger** → Receive UPS event
 2. **Switch** → Route by event type (on battery, low battery, online)
 3. **Telegram** → Send appropriate alert
 ---
 ### Workflow 5: Gitea → Deploy on Push
 **Nodes:**
 1. **Webhook Trigger** → Gitea push event
 2. **If** → Check if branch is `main`
 3. **SSH** → Connect to target server
 4. **Execute Command** → `git pull && docker-compose up -d`
 5. **Telegram** → Notify deployment complete
 ---
 ### Workflow 6: Syncthing Folder Behind Alert
 **Nodes:**
 1. **Schedule Trigger** (every 30 minutes)
 2. **HTTP Request** → Get all folder statuses
   - URL: `http://10.10.10.200:8384/rest/stats/folder`
 3. **Function** → Check if any folder has errors or is significantly behind
 4. **If** → Errors found
 5. **Telegram** → Alert with folder name and status
 ---
 ### Workflow 7: Grafana Alert Forwarder
 **Purpose:** Forward Grafana alerts to Telegram
 **Nodes:**
 1. **Webhook Trigger** → Grafana webhook
 2. **Function** → Parse alert data
 3. **Telegram** → Format and send alert
 **Grafana Setup:**
 - Contact Point → Add webhook: `https://n8n.htsn.io/webhook/grafana-alerts`
 ---
 ### Workflow 8: Daily Homelab Summary
 **Nodes:**
 1. **Schedule Trigger** (9am daily)
 2. **Multiple HTTP Requests in parallel:**
   - Prometheus: System uptime
   - Prometheus: Average CPU usage (24h)
   - Prometheus: Disk usage
   - Syncthing: Sync status (all folders)
   - PiHole: Queries blocked (24h)
 3. **Function** → Format data as summary
 4. **Telegram** → Send daily report
 **Example Output:**
 ```
 🏠 Homelab Daily Summary
 ✅ All systems operational
 ⏱️  Uptime: 14 days
 📊 Avg CPU: 12%
 💾 Disk: 45% used
 🔄 Syncthing: All folders in sync
 🛡️  PiHole: 2,341 queries blocked
 Last updated: 2025-12-27 09:00
 ```
 ---
 ### Workflow 9: VM State Change Monitor
 **Nodes:**
 1. **Schedule Trigger** (every 1 minute)
 2. **HTTP Request** → Query Proxmox API for VM list
 3. **Function** → Compare with previous state (use Set node)
 4. **If** → VM state changed
 5. **Telegram** → Notify VM started/stopped
 ---
 ### Workflow 10: Internet Speed Test Alert
 **Nodes:**
 1. **Schedule Trigger** (every 6 hours)
 2. **HTTP Request** → Prometheus speedtest exporter
 3. **If** → Download speed < 500 Mbps
 4. **Telegram** → Alert about slow internet
 ---
 ## Advanced Integration Ideas
 ### Home Assistant Automations
 - Turn on lights when server room temperature > 80°F
 - Trigger workflows from HA button press
 - Send sensor data to external services
 ### Proxmox Automation
 - Auto-snapshot VMs before updates
 - Clone VMs for testing
 - Monitor resource usage and rebalance
 ### Media Management
 - Notify when new Plex content added
 - Auto-organize downloads
 - Send weekly watch statistics
 ### Backup Monitoring
 - Verify all Syncthing folders synced
 - Alert on ZFS scrub errors
 - Monitor snapshot ages
 ### Security
 - Alert on failed SSH attempts (from logs)
 - Monitor SSL certificate expiration
 - Track unusual network traffic patterns
 ---
 ## n8n Best Practices
 1. **Error Handling:** Always add error workflows to catch failures
 2. **Rate Limiting:** Don't query APIs too frequently
 3. **Credentials:** Never hardcode - always use credential store
 4. **Testing:** Use manual trigger during development
 5. **Logging:** Add Set nodes to track workflow state
 6. **Backups:** Export workflows regularly (Settings → Export)
 ---
 ## Useful PromQL Queries for n8n
 **CPU Usage:**
 ```promql
 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 ```
 **Memory Usage:**
 ```promql
 (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
 ```
 **Disk Usage:**
 ```promql
 (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
 ```
 **Hosts Down:**
 ```promql
 up{job=~"node.*"} == 0
 ```
 **Syncthing Disconnected:**
 ```promql
 up{job=~"syncthing.*"} == 0
 ```
 ---
 ## Webhook URLs
 After creating webhooks in n8n, you'll get URLs like:
 - `https://n8n.htsn.io/webhook/your-webhook-name`
 These can be called from:
 - Grafana alerts
 - Home Assistant automations
 - Gitea webhooks
 - Custom scripts
 - UPS monitoring (NUT)
 ---
 ## Testing Credentials
 Test each credential after adding:
 1. Create simple workflow with manual trigger
 2. Add HTTP Request node with credential
 3. Execute and check response
 4. Verify data returned correctly
 ---
 ## Troubleshooting
 **Can't reach local service:**
 - Verify service IP and port
 - Check if service requires HTTPS
 - Test with `curl` from docker-host2 first
 **Webhook not triggering:**
 - Check n8n is accessible: `curl https://n8n.htsn.io/webhook/test`
 - Verify webhook URL in external service
 - Check n8n execution logs
 **Workflow fails silently:**
 - Enable "Execute on Error" workflow
 - Check workflow execution list
 - Add Function nodes to log data
 **API authentication fails:**
 - Verify credential is saved
 - Check API token hasn't expired
 - Test with curl manually first
 ---
 ## Next Steps
 1. **Add Credentials** - Start with Telegram and Prometheus
 2. **Create Test Workflow** - Simple hourly health check
 3. **Test Telegram** - Verify messages arrive
 4. **Build Gradually** - Add one workflow at a time
 5. **Export Backups** - Save workflows regularly
 ---
 ## Resources
 - **n8n Docs:** https://docs.n8n.io
 - **Community Workflows:** https://n8n.io/workflows
 - **Your n8n:** https://n8n.htsn.io
 - **Your API Docs:** [N8N.md](N8N.md)
 **Last Updated:** 2025-12-27
--- a/N8N.md
+++ b/N8N.md
@@ -0,0 +1,308 @@
 # n8n - Workflow Automation
 n8n is an extendable workflow automation tool deployed on docker-host2 for automating tasks across your homelab and external services.
 ---
 ## Quick Reference
 | Setting | Value |
 |---------|-------|
 | **URL** | https://n8n.htsn.io |
 | **Local IP** | 10.10.10.207:5678 |
 | **Server** | docker-host2 (PVE2 VMID 302) |
 | **Database** | PostgreSQL (containerized) |
 | **API Endpoint** | http://10.10.10.207:5678/api/v1/ |
 ---
 ## Claude Code Integration (MCP)
 ### n8n-MCP Server
 The n8n-MCP server gives Claude Code deep knowledge of all 545+ n8n nodes, enabling it to build complete workflows from natural language descriptions.
 **Installation:** Already configured in `~/Library/Application Support/Claude/claude_desktop_config.json`
 ```json
 {
  "mcpServers": {
    "n8n-nodes": {
      "command": "npx",
      "args": ["-y", "@czlonkowski/n8n-mcp"]
    }
  }
 }
 ```
 **What This Enables:**
 - ✅ Build n8n workflows from natural language
 - ✅ Get detailed help with node parameters and options
 - ✅ Best practices for n8n node usage
 - ✅ Debug workflow issues with full node context
 **Example Prompts:**
 ```
 "Create an n8n workflow to monitor Prometheus and send Telegram alerts"
 "Build a workflow that triggers when Syncthing has errors"
 "What's the best n8n node to parse JSON responses?"
 ```
 **How It Works:**
 - MCP server provides offline documentation for all n8n nodes
 - No connection to your n8n instance required
 - Claude builds workflows that you can then import into https://n8n.htsn.io
 **Resources:**
 - [n8n-MCP GitHub](https://github.com/czlonkowski/n8n-mcp)
 - [MCP Documentation](https://docs.n8n.io/advanced-ai/accessing-n8n-mcp-server/)
 ---
 ## API Access
 ### API Key
 ```
 X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo
 ```
 ### API Examples
 **List Workflows:**
 ```bash
 curl -H "X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo" \
  http://10.10.10.207:5678/api/v1/workflows
 ```
 **Get Workflow by ID:**
 ```bash
 curl -H "X-N8N-API-KEY: YOUR_API_KEY" \
  http://10.10.10.207:5678/api/v1/workflows/{id}
 ```
 **Trigger Workflow:**
 ```bash
 curl -X POST \
  -H "X-N8N-API-KEY: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"data": {"key": "value"}}' \
  http://10.10.10.207:5678/api/v1/workflows/{id}/execute
 ```
 **API Documentation:** https://docs.n8n.io/api/
 ---
 ## Deployment Details
 ### Docker Compose
 **Location:** `/opt/n8n/docker-compose.yml` on docker-host2
 **Services:**
 - `n8n` - Main application (port 5678)
 - `postgres` - Database backend
 **Volumes:**
 - `n8n_data` - Workflow data, credentials, settings
 - `postgres_data` - Database storage
 ### Environment Configuration
 ```yaml
 N8N_HOST: n8n.htsn.io
 N8N_PORT: 5678
 N8N_PROTOCOL: https
 NODE_ENV: production
 WEBHOOK_URL: https://n8n.htsn.io/
 GENERIC_TIMEZONE: America/Los_Angeles
 DB_TYPE: postgresdb
 DB_POSTGRESDB_HOST: postgres
 DB_POSTGRESDB_DATABASE: n8n
 DB_POSTGRESDB_USER: n8n
 DB_POSTGRESDB_PASSWORD: n8n_secure_password_2024
 ```
 ### Resource Limits
 - **Memory**: 512MB-1GB (soft/hard)
 - **CPU**: Shared (4 vCPUs on host)
 ---
 ## Common Tasks
 ### Restart n8n
 ```bash
 ssh docker-host2 'cd /opt/n8n && docker compose restart n8n'
 ```
 ### View Logs
 ```bash
 ssh docker-host2 'docker logs -f n8n'
 ```
 ### Backup Workflows
 Workflows are stored in PostgreSQL. To backup:
 ```bash
 ssh docker-host2 'docker exec n8n-postgres pg_dump -U n8n n8n > /tmp/n8n-backup-$(date +%Y%m%d).sql'
 ```
 ### Update n8n
 ```bash
 ssh docker-host2 'cd /opt/n8n && docker compose pull n8n && docker compose up -d n8n'
 ```
 ---
 ## Traefik Configuration
 **File:** `/etc/traefik/conf.d/n8n.yaml` on CT 202
 ```yaml
 http:
  routers:
    n8n-secure:
      entryPoints:
        - websecure
      rule: "Host(`n8n.htsn.io`)"
      service: n8n
      tls:
        certResolver: cloudflare
      priority: 50
    n8n-redirect:
      entryPoints:
        - web
      rule: "Host(`n8n.htsn.io`)"
      middlewares:
        - n8n-https-redirect
      service: n8n
      priority: 50
  services:
    n8n:
      loadBalancer:
        servers:
          - url: "http://10.10.10.207:5678"
  middlewares:
    n8n-https-redirect:
      redirectScheme:
        scheme: https
        permanent: true
 ```
 ---
 ## Monitoring
 ### Prometheus
 n8n exposes metrics at `http://10.10.10.207:5678/metrics` (if enabled)
 ### Grafana
 n8n metrics can be visualized in Grafana dashboards
 ### Uptime Monitoring
 Add to Pulse: https://pulse.htsn.io
 - Monitor: https://n8n.htsn.io
 - Check interval: 60s
 ---
 ## Troubleshooting
 ### n8n won't start
 ```bash
 ssh docker-host2 'docker logs n8n | tail -50'
 ssh docker-host2 'docker logs n8n-postgres | tail -50'
 ```
 ### Database connection issues
 ```bash
 # Check postgres health
 ssh docker-host2 'docker exec n8n-postgres pg_isready -U n8n'
 # Restart postgres
 ssh docker-host2 'cd /opt/n8n && docker compose restart postgres'
 ```
 ### SSL/HTTPS issues
 ```bash
 # Check Traefik config
 ssh root@10.10.10.250 'cat /etc/traefik/conf.d/n8n.yaml'
 # Reload Traefik
 ssh root@10.10.10.250 'systemctl reload traefik'
 ```
 ### API not responding
 ```bash
 # Test API locally
 curl -H "X-N8N-API-KEY: YOUR_KEY" http://10.10.10.207:5678/api/v1/workflows
 # Check if n8n container is healthy
 ssh docker-host2 'docker ps | grep n8n'
 ```
 ---
 ## Integration Examples
 ### Homelab Automation Ideas
 1. **Backup Notifications** - Send Telegram alerts when backups complete
 2. **Server Monitoring** - Query Prometheus and alert on high CPU/memory
 3. **Media Management** - Trigger Sonarr/Radarr downloads
 4. **Home Assistant Integration** - Automate smart home workflows
 5. **Git Webhooks** - Deploy changes from Gitea automatically
 6. **Syncthing Monitoring** - Alert when sync folders get behind
 7. **UPS Alerts** - Notify on power events from NUT
 ---
 ## Security Notes
 - API key provides full access to all workflows and data
 - Store API key securely (added to this doc for homelab reference)
 - n8n credentials are encrypted at rest in PostgreSQL
 - HTTPS enforced via Traefik
 - No public internet exposure (only via Tailscale)
 ---
 ## Quick Start
 **New to n8n?** Start here: **[N8N-INTEGRATIONS.md](N8N-INTEGRATIONS.md)** ⭐
 This guide includes:
 - ✅ Network access verification
 - ✅ Credential setup for all homelab services
 - ✅ 10 ready-to-use starter workflows
 - ✅ Home Assistant, Prometheus, Syncthing, Telegram integrations
 - ✅ Troubleshooting tips
 ---
 ## Related Documentation
 - [n8n Homelab Integrations Guide](N8N-INTEGRATIONS.md) - **START HERE**
 - [docker-host2 VM details](VMS.md)
 - [Traefik reverse proxy](TRAEFIK.md)
 - [IP Assignments](IP-ASSIGNMENTS.md)
 - [Pulse Setup](PULSE-SETUP.md)
 **Last Updated:** 2025-12-26
--- a/POWER-MANAGEMENT.md
+++ b/POWER-MANAGEMENT.md
@@ -0,0 +1,509 @@
 # Power Management and Optimization
 Documentation of power optimizations applied to reduce idle power consumption and heat generation.
 ## Overview
 Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
 Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
 ---
 ## Power Draw Estimates
 ### PVE (10.10.10.120)
 | Component | Idle | Load | TDP |
 |-----------|------|------|-----|
 | Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
 | NVIDIA TITAN RTX | 2-3W | 250W | 280W |
 | NVIDIA Quadro P2000 | 25W | 70W | 75W |
 | RAM (128 GB DDR4) | 30-40W | 30-40W | - |
 | Storage (NVMe + SSD) | 20-30W | 40-50W | - |
 | HBAs, fans, misc | 20-30W | 20-30W | - |
 | **Total** | **250-350W** | **800-940W** | - |
 ### PVE2 (10.10.10.102)
 | Component | Idle | Load | TDP |
 |-----------|------|------|-----|
 | Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
 | NVIDIA RTX A6000 | 11W | 280W | 300W |
 | RAM (128 GB DDR4) | 30-40W | 30-40W | - |
 | Storage (NVMe + HDD) | 20-30W | 40-50W | - |
 | Fans, misc | 15-20W | 15-20W | - |
 | **Total** | **226-330W** | **765-890W** | - |
 ### Combined
 | Metric | Idle | Load |
 |--------|------|------|
 | Servers | 476-680W | 1565-1830W |
 | Network gear | ~50W | ~50W |
 | **Total** | **~530-730W** | **~1615-1880W** |
 | **UPS Load** | 40-55% | 120-140% ⚠️ |
 **Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
 ---
 ## Optimizations Applied
 ### 1. KSMD Disabled (2024-12-17)
 **KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
 **Problem**:
 - KSMD was consuming 44-57% CPU continuously on PVE
 - Caused CPU temp to rise from 74°C to 83°C
 - **Negative profit**: More power spent scanning than saved from deduplication
 **Solution**: Disabled KSM permanently
 **Configuration**:
 **Systemd service**: `/etc/systemd/system/disable-ksm.service`
 ```ini
 [Unit]
 Description=Disable KSM (Kernel Same-page Merging)
 After=multi-user.target
 [Service]
 Type=oneshot
 ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
 RemainAfterExit=yes
 [Install]
 WantedBy=multi-user.target
 ```
 **Enable and start**:
 ```bash
 systemctl daemon-reload
 systemctl enable --now disable-ksm
 systemctl mask ksmtuned  # Prevent re-enabling
 ```
 **Verify**:
 ```bash
 # KSM should be disabled (run=0)
 cat /sys/kernel/mm/ksm/run  # Should output: 0
 # ksmd should show 0% CPU
 ps aux | grep ksmd
 ```
 **Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
 **⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
 ```bash
 cat /sys/kernel/mm/ksm/run
 # If 1, disable it:
 echo 0 > /sys/kernel/mm/ksm/run
 systemctl mask ksmtuned
 ```
 ---
 ### 2. CPU Governor Optimization (2024-12-16)
 Default CPU governor keeps cores at max frequency even when idle, wasting power.
 #### PVE: `amd-pstate-epp` Driver
 **Driver**: `amd-pstate-epp` (modern AMD P-state driver)
 **Governor**: `powersave`
 **EPP**: `balance_power`
 **Configuration**:
 **Systemd service**: `/etc/systemd/system/cpu-powersave.service`
 ```ini
 [Unit]
 Description=Set CPU governor to powersave with balance_power EPP
 After=multi-user.target
 [Service]
 Type=oneshot
 ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
 ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
 RemainAfterExit=yes
 [Install]
 WantedBy=multi-user.target
 ```
 **Enable**:
 ```bash
 systemctl daemon-reload
 systemctl enable --now cpu-powersave
 ```
 **Verify**:
 ```bash
 # Check governor
 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
 # Output: powersave
 # Check EPP
 cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
 # Output: balance_power
 # Check current frequency (should be low when idle)
 grep MHz /proc/cpuinfo | head -5
 # Should show ~1700-2200 MHz idle, up to 4000 MHz under load
 ```
 #### PVE2: `acpi-cpufreq` Driver
 **Driver**: `acpi-cpufreq` (older ACPI driver)
 **Governor**: `schedutil` (adaptive, better than powersave for this driver)
 **Configuration**:
 **Systemd service**: `/etc/systemd/system/cpu-powersave.service`
 ```ini
 [Unit]
 Description=Set CPU governor to schedutil
 After=multi-user.target
 [Service]
 Type=oneshot
 ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
 RemainAfterExit=yes
 [Install]
 WantedBy=multi-user.target
 ```
 **Enable**:
 ```bash
 systemctl daemon-reload
 systemctl enable --now cpu-powersave
 ```
 **Verify**:
 ```bash
 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
 # Output: schedutil
 grep MHz /proc/cpuinfo | head -5
 # Should show ~1700-2200 MHz idle
 ```
 **Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
 **Performance impact**: Minimal - CPU still boosts to max frequency under load
 ---
 ### 3. GPU Power States (2024-12-16)
 GPUs automatically enter low-power states when idle. Verified optimal.
 | GPU | Location | Idle Power | P-State | Notes |
 |-----|----------|------------|---------|-------|
 | RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
 | TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
 | Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
 **Check GPU power state**:
 ```bash
 # Via nvidia-smi (if installed in VM)
 ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
 # Expected output:
 # name, power.draw [W], pstate
 # NVIDIA TITAN RTX, 2.50 W, P8
 # Via lspci (from Proxmox host - shows link speed, not power)
 ssh pve 'lspci | grep -i nvidia'
 ```
 **P-States**:
 - **P0**: Maximum performance
 - **P8**: Minimum power (idle)
 **No action needed** - GPUs automatically manage power states.
 **Savings**: N/A (already optimal)
 ---
 ### 4. Syncthing Rescan Intervals (2024-12-16)
 Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
 **Changed**:
 - Large folders: 60s → **3600s** (1 hour)
 - Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
 **Configuration**: Via Syncthing UI on each device
 - Settings → Folders → [Folder Name] → Advanced → Rescan Interval
 **Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
 **Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
 - Still acceptable for most use cases
 - Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
 ---
 ### 5. ksmtuned Disabled (2024-12-16)
 **ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
 **Solution**: Stopped and disabled on both servers
 ```bash
 systemctl stop ksmtuned
 systemctl disable ksmtuned
 systemctl mask ksmtuned  # Prevent re-enabling
 ```
 **Savings**: ~2-5W
 ---
 ### 6. HDD Spindown on PVE2 (2024-12-16)
 **Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
 **Solution**: Configure 30-minute spindown timeout
 **Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
 ```udev
 # Spin down WD Red 6TB drives after 30 minutes idle
 ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
 ```
 **hdparm value**: 241 = 30 minutes
 - Formula: `value * 5 seconds = timeout`
 - 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
 **Apply rule**:
 ```bash
 udevadm control --reload-rules
 udevadm trigger
 # Verify drives have spindown set
 hdparm -I /dev/sda | grep -i standby
 hdparm -I /dev/sdb | grep -i standby
 ```
 **Check if drives are spun down**:
 ```bash
 hdparm -C /dev/sda
 # Output: drive state is:  standby  (spun down)
 # or:     drive state is:  active/idle  (spinning)
 ```
 **Savings**: ~10-16W when spun down (8W per drive)
 **Trade-off**: 5-10 second delay when accessing pool after spindown
 ---
 ## Potential Optimizations (Not Yet Applied)
 ### PCIe ASPM (Active State Power Management)
 **Benefit**: Reduce power of idle PCIe devices
 **Risk**: May cause stability issues with some devices
 **Estimated savings**: 5-15W
 **Test**:
 ```bash
 # Check current ASPM state
 lspci -vv | grep -i aspm
 # Enable ASPM (test first)
 # Add to kernel cmdline: pcie_aspm=force
 # Edit /etc/default/grub:
 GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
 # Update grub
 update-grub
 reboot
 ```
 ### NMI Watchdog Disable
 **Benefit**: Reduce CPU wakeups
 **Risk**: Harder to debug kernel hangs
 **Estimated savings**: 1-3W
 **Test**:
 ```bash
 # Disable NMI watchdog
 echo 0 > /proc/sys/kernel/nmi_watchdog
 # Make permanent (add to kernel cmdline)
 # Edit /etc/default/grub:
 GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
 update-grub
 reboot
 ```
 ---
 ## Monitoring
 ### CPU Frequency
 ```bash
 # Current frequency on all cores
 ssh pve 'grep MHz /proc/cpuinfo | head -10'
 # Governor
 ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
 # Available governors
 ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
 ```
 ### CPU Temperature
 ```bash
 # PVE
 ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
 # PVE2
 ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
 ```
 **Healthy temps**: 70-80°C under load
 **Warning**: >85°C
 **Throttle**: 90°C (Tctl max for Threadripper PRO)
 ### GPU Power Draw
 ```bash
 # If nvidia-smi installed in VM
 ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
 # Sample output:
 # name, power.draw [W], power.limit [W], pstate
 # NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
 ```
 ### Power Consumption (UPS)
 ```bash
 # Check UPS load percentage
 ssh pve 'upsc cyberpower@localhost ups.load'
 # Battery runtime (seconds)
 ssh pve 'upsc cyberpower@localhost battery.runtime'
 # Full UPS status
 ssh pve 'upsc cyberpower@localhost'
 ```
 See [UPS.md](UPS.md) for more UPS monitoring details.
 ### ZFS ARC Memory Usage
 ```bash
 # PVE
 ssh pve 'arc_summary | grep -A5 "ARC size"'
 # TrueNAS
 ssh truenas 'arc_summary | grep -A5 "ARC size"'
 ```
 **ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
 ```bash
 # Limit ARC to 32 GB (example)
 # Edit /etc/modprobe.d/zfs.conf:
 options zfs zfs_arc_max=34359738368
 # Apply (reboot required)
 update-initramfs -u
 reboot
 ```
 ---
 ## Troubleshooting
 ### CPU Not Downclocking
 ```bash
 # Check current governor
 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
 # Should be: powersave (PVE) or schedutil (PVE2)
 # If not, systemd service may have failed
 # Check service status
 systemctl status cpu-powersave
 # Manually set governor (temporary)
 echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
 # Check frequency
 grep MHz /proc/cpuinfo | head -5
 ```
 ### High Idle Power After Update
 **Common causes**:
 1. **KSM re-enabled** after Proxmox update
   - Check: `cat /sys/kernel/mm/ksm/run`
   - Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
 2. **CPU governor reset** to default
   - Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
   - Fix: `systemctl restart cpu-powersave`
 3. **GPU stuck in high-performance mode**
   - Check: `nvidia-smi --query-gpu=pstate --format=csv`
   - Fix: Restart VM or power cycle GPU
 ### HDDs Won't Spin Down
 ```bash
 # Check spindown setting
 hdparm -I /dev/sda | grep -i standby
 # Set spindown manually (temporary)
 hdparm -S 241 /dev/sda
 # Check if drive is idle (ZFS may keep it active)
 zpool iostat -v 1 5  # Watch for activity
 # Check what's accessing the drive
 lsof | grep /mnt/pool
 ```
 ---
 ## Power Optimization Summary
 | Optimization | Savings | Applied | Notes |
 |--------------|---------|---------|-------|
 | **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
 | **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
 | **GPU power states** | 0W | ✅ | Already optimal (automatic) |
 | **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
 | **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
 | **HDD spindown** | 10-16W | ✅ | Only when drives idle |
 | PCIe ASPM | 5-15W | ❌ | Not yet tested |
 | NMI watchdog | 1-3W | ❌ | Not yet tested |
 | **Total savings** | **~150-300W** | - | Significant reduction |
 ---
 ## Related Documentation
 - [UPS.md](UPS.md) - UPS capacity and power monitoring
 - [STORAGE.md](STORAGE.md) - HDD spindown configuration
 - [VMS.md](VMS.md) - VM resource allocation
 ---
 **Last Updated**: 2025-12-22
--- a/PULSE-SETUP.md
+++ b/PULSE-SETUP.md
@@ -0,0 +1,69 @@
 # Add n8n and docker-host2 to Pulse Monitoring
 Pulse automatically monitors based on Prometheus targets, but you can also add custom HTTP monitors.
 ## Quick Steps
 1. Open **https://pulse.htsn.io** in your browser
 2. Login if required
 3. Click **"+ Add Monitor"** or **"New Monitor"**
 ---
 ## Monitor: n8n
 | Field | Value |
 |-------|-------|
 | **Name** | n8n Workflow Automation |
 | **URL** | https://n8n.htsn.io |
 | **Check Interval** | 60 seconds |
 | **Monitor Type** | HTTP/HTTPS |
 | **Expected Status** | 200 |
 | **Timeout** | 10 seconds |
 | **Alert After** | 2 failed checks |
 ---
 ## Monitor: docker-host2
 | Field | Value |
 |-------|-------|
 | **Name** | docker-host2 (node_exporter) |
 | **URL** | http://10.10.10.207:9100/metrics |
 | **Check Interval** | 60 seconds |
 | **Monitor Type** | HTTP |
 | **Expected Status** | 200 |
 | **Expected Content** | `node_exporter` |
 | **Timeout** | 5 seconds |
 | **Alert After** | 2 failed checks |
 ---
 ## Optional: docker-host2 SSH
 | Field | Value |
 |-------|-------|
 | **Name** | docker-host2 SSH |
 | **Host** | 10.10.10.207 |
 | **Port** | 22 |
 | **Monitor Type** | TCP Port |
 | **Check Interval** | 60 seconds |
 | **Timeout** | 5 seconds |
 ---
 ## Verification
 After adding monitors, you should see:
 - ✅ Green status for both monitors
 - Response time graphs
 - Uptime percentage
 - Alert history (should be empty)
 Access Pulse dashboard: **https://pulse.htsn.io**
 ---
 **Note:** Pulse may already be monitoring these services via Prometheus integration. Check existing monitors before adding duplicates.
 **Last Updated:** 2025-12-27
--- a/README.md
+++ b/README.md
@@ -0,0 +1,149 @@
 # Homelab Documentation
 Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
 ## 🚀 Quick Start
 **New to this homelab?** Start here:
 1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
 2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
 3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
 4. [SERVICES.md](SERVICES.md) - What services are running
 **Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
 ## 📚 Documentation Index
 ### Infrastructure
 | Document | Description |
 |----------|-------------|
 | [GATEWAY.md](GATEWAY.md) | UniFi gateway monitoring, watchdog services, troubleshooting |
 | [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
 | [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
 | [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
 | [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
 | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
 | [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
 ### Services & Applications
 | Document | Description |
 |----------|-------------|
 | [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
 | [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
 | [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
 | [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
 | [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
 ### Access & Security
 | Document | Description |
 |----------|-------------|
 | [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
 | [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
 | [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
 ### Operations
 | Document | Description |
 |----------|-------------|
 | [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
 | [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
 | [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
 | [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
 ### Reference
 | Document | Description |
 |----------|-------------|
 | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
 | [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
 ## 🖥️ System Overview
 ### Servers
 - **PVE** (10.10.10.120) - Primary Proxmox server
  - AMD Threadripper PRO 3975WX (32-core)
  - 128 GB RAM
  - NVIDIA Quadro P2000 + TITAN RTX
 - **PVE2** (10.10.10.102) - Secondary Proxmox server
  - AMD Threadripper PRO 3975WX (32-core)
  - 128 GB RAM
  - NVIDIA RTX A6000
 ### Key Services
 | Service | Location | URL |
 |---------|----------|-----|
 | **Proxmox** | PVE | https://pve.htsn.io |
 | **TrueNAS** | VM 100 | https://truenas.htsn.io |
 | **Plex** | Saltbox VM | https://plex.htsn.io |
 | **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
 | **Gitea** | VM 300 | https://git.htsn.io |
 | **Pi-hole** | CT 200 | http://10.10.10.10/admin |
 | **Traefik** | CT 202 | http://10.10.10.250:8080 |
 [See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
 ## 🔥 Emergency Procedures
 ### Power Failure
 1. UPS provides ~15 min runtime at typical load
 2. At 2 min remaining, NUT triggers graceful VM shutdown
 3. When power returns, servers auto-boot and start VMs in order
 See [UPS.md](UPS.md) for details.
 ### Service Down
 ```bash
 # Quick health check (run from Mac Mini)
 ssh pve 'qm list'                    # Check VMs on PVE
 ssh pve2 'qm list'                   # Check VMs on PVE2
 ssh pve 'pct list'                   # Check containers
 # Syncthing status
 curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/connections"
 # Restart a VM
 ssh pve 'qm stop VMID && qm start VMID'
 ```
 See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
 ## 📞 Getting Help
 **Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
 **Key Contacts**:
 - Homelab Owner: Hutson
 - Git Repo: https://git.htsn.io/hutson/homelab-docs
 - Local Path: `~/Projects/homelab`
 ## 🔄 Recent Changes
 See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
 ## 📝 Contributing
 When updating docs:
 1. Keep CLAUDE.md as quick reference only
 2. Move detailed content to specialized docs
 3. Update cross-references
 4. Test all commands before committing
 5. Add entries to changelog
 ```bash
 cd ~/Projects/homelab
 git add -A
 git commit -m "Update documentation: <description>"
 git push
 ```
 ---
 **Last Updated**: 2026-01-02
--- a/SERVICES.md
+++ b/SERVICES.md
@@ -0,0 +1,591 @@
 # Services Inventory
 Complete inventory of all services running across the homelab infrastructure.
 ## Overview
 | Category | Services | Location | Access |
 |----------|----------|----------|--------|
 | **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
 | **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
 | **Development** | Gitea, Docker services | VMs | Web UI |
 | **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
 | **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
 **Total Services**: 25+ running services
 ---
 ## Service URLs Quick Reference
 | Service | URL | Authentication | Purpose |
 |---------|-----|----------------|---------|
 | **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
 | **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
 | **Plex** | https://plex.htsn.io | Plex account | Media streaming |
 | **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
 | **Gitea** | https://git.htsn.io | Username/password | Git repositories |
 | **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
 | **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
 | **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
 | **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
 | **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
 | **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
 | **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
 ---
 ## Infrastructure Services
 ### Proxmox VE (PVE & PVE2)
 **Purpose**: Virtualization platform, VM/CT host
 **Location**: Physical servers (10.10.10.120, 10.10.10.102)
 **Access**: https://pve.htsn.io:8006, SSH
 **Version**: Unknown (check: `pveversion`)
 **Key Features**:
 - Web-based management
 - VM and LXC container support
 - ZFS storage pools
 - Clustering (2-node)
 - API access
 **Common Operations**:
 ```bash
 # List VMs
 ssh pve 'qm list'
 # Create VM
 ssh pve 'qm create VMID --name myvm ...'
 # Backup VM
 ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
 ```
 **See**: [VMS.md](VMS.md)
 ---
 ### TrueNAS SCALE (VM 100)
 **Purpose**: Central file storage, NFS/SMB shares
 **Location**: VM on PVE (10.10.10.200)
 **Access**: https://truenas.htsn.io, SSH
 **Version**: TrueNAS SCALE (check version in UI)
 **Key Features**:
 - ZFS storage management
 - NFS exports
 - SMB shares
 - Syncthing hub
 - Snapshot management
 **Storage Pools**:
 - `vault`: Main data pool on EMC enclosure
 **Shares** (needs documentation):
 - NFS exports for Saltbox media
 - SMB shares for Windows access
 - Syncthing sync folders
 **See**: [STORAGE.md](STORAGE.md)
 ---
 ### Pi-hole (CT 200)
 **Purpose**: Network-wide DNS server and ad blocker
 **Location**: LXC on PVE (10.10.10.10)
 **Access**: http://10.10.10.10/admin
 **Version**: Unknown
 **Configuration**:
 - **Upstream DNS**: Cloudflare (1.1.1.1)
 - **Blocklists**: Unknown count
 - **Queries**: All network DNS traffic
 - **DHCP**: Disabled (router handles DHCP)
 **Stats** (example):
 ```bash
 ssh pihole 'pihole -c -e'  # Stats
 ssh pihole 'pihole status'  # Status
 ```
 **Common Tasks**:
 - Update blocklists: `ssh pihole 'pihole -g'`
 - Whitelist domain: `ssh pihole 'pihole -w example.com'`
 - View logs: `ssh pihole 'pihole -t'`
 ---
 ### Traefik (CT 202)
 **Purpose**: Reverse proxy for all public-facing services
 **Location**: LXC on PVE (10.10.10.250)
 **Access**: http://10.10.10.250:8080/dashboard/
 **Version**: Unknown (check: `traefik version`)
 **Managed Services**:
 - All *.htsn.io domains (except Saltbox services)
 - SSL/TLS certificates via Let's Encrypt
 - HTTP → HTTPS redirects
 **See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
 ---
 ## Media Services (Saltbox VM)
 All media services run in Docker on the Saltbox VM (10.10.10.100).
 ### Plex Media Server
 **Purpose**: Media streaming platform
 **URL**: https://plex.htsn.io
 **Access**: Plex account
 **Features**:
 - Hardware transcoding (TITAN RTX)
 - Libraries: Movies, TV, Music
 - Remote access enabled
 - Managed by Saltbox
 **Media Storage**:
 - Source: TrueNAS NFS mounts
 - Location: `/mnt/unionfs/`
 **Common Tasks**:
 ```bash
 # View Plex status
 ssh saltbox 'docker logs -f plex'
 # Restart Plex
 ssh saltbox 'docker restart plex'
 # Scan library
 # (via Plex UI: Settings → Library → Scan)
 ```
 ---
 ### *arr Apps (Media Automation)
 Running on Saltbox VM, managed via Traefik-Saltbox.
 | Service | Purpose | URL | Notes |
 |---------|---------|-----|-------|
 | **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
 | **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
 | **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
 | **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
 | **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
 **Downloaders**:
 | Service | Purpose | URL |
 |---------|---------|-----|
 | **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
 | **NZBGet** | Usenet downloader | nzbget.htsn.io |
 | **qBittorrent** | Torrent client | qbittorrent.htsn.io |
 **Indexers**:
 | Service | Purpose | URL |
 |---------|---------|-----|
 | **Jackett** | Torrent indexer proxy | jackett.htsn.io |
 | **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
 ---
 ### Supporting Media Services
 | Service | Purpose | URL |
 |---------|---------|-----|
 | **Tautulli** | Plex statistics | tautulli.htsn.io |
 | **Organizr** | Service dashboard | organizr.htsn.io |
 | **Authelia** | SSO authentication | auth.htsn.io |
 ---
 ## Development Services
 ### Gitea (VM 300)
 **Purpose**: Self-hosted Git server
 **Location**: VM on PVE2 (10.10.10.220)
 **URL**: https://git.htsn.io
 **Access**: Username/password
 **Repositories**:
 - homelab-docs (this documentation)
 - Personal projects
 - Private repos
 **Common Tasks**:
 ```bash
 # SSH to Gitea VM
 ssh gitea-vm
 # View logs
 ssh gitea-vm 'journalctl -u gitea -f'
 # Backup
 ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
 ```
 **See**: Gitea documentation for API usage
 ---
 ### Docker Services (docker-host VM)
 Running on VM 206 (10.10.10.206).
 | Service | URL | Purpose | Port |
 |---------|-----|---------|------|
 | **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
 | **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
 | **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
 **Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
 **Managing services**:
 ```bash
 ssh docker-host 'docker ps'
 ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
 ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
 ```
 ---
 ## Home Automation
 ### Home Assistant (VM 110)
 **Purpose**: Smart home automation platform
 **Location**: VM on PVE (10.10.10.110)
 **URL**: https://homeassistant.htsn.io
 **Access**: Username/password
 **Integrations**:
 - UPS monitoring (NUT sensors)
 - Unknown other integrations (needs documentation)
 **Sensors**:
 - `sensor.cyberpower_battery_charge`
 - `sensor.cyberpower_load`
 - `sensor.cyberpower_battery_runtime`
 - `sensor.cyberpower_status`
 **See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
 ---
 ### Happy Coder Relay (docker-host)
 **Purpose**: Self-hosted relay server for Happy Coder mobile app
 **Location**: docker-host (10.10.10.206)
 **URL**: https://happy.htsn.io
 **Access**: QR code authentication
 **Stack**:
 - Happy Server (Node.js)
 - PostgreSQL (user/session data)
 - Redis (real-time events)
 - MinIO (file/image storage)
 **Clients**:
 - Mac Mini (Happy daemon)
 - Mobile app (iOS/Android)
 **Credentials**:
 - Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
 - PostgreSQL: `happy` / `happypass`
 - MinIO: `happyadmin` / `happyadmin123`
 ---
 ## File Sync & Storage
 ### Syncthing
 **Purpose**: File synchronization across all devices
 **Devices**:
 - Mac Mini (10.10.10.125) - Hub
 - MacBook - Mobile sync
 - TrueNAS (10.10.10.200) - Central storage
 - Windows PC (10.10.10.150) - Windows sync
 - Phone (10.10.10.54) - Mobile sync
 **API Keys**:
 - Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
 - MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
 - Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
 **Synced Folders**:
 - documents (~11 GB)
 - downloads (~38 GB)
 - pictures
 - notes
 - desktop (~7.2 GB)
 - config
 - movies
 **See**: [SYNCTHING.md](SYNCTHING.md)
 ---
 ### Copyparty (VM 201)
 **Purpose**: Simple HTTP file sharing
 **Location**: VM on PVE (10.10.10.201)
 **URL**: https://copyparty.htsn.io
 **Access**: Unknown
 **Features**:
 - Web-based file upload/download
 - Lightweight
 ---
 ## Trading & AI Services
 ### AI Trading Platform (trading-vm)
 **Purpose**: Algorithmic trading with AI models
 **Location**: VM 301 on PVE2 (10.10.10.221)
 **URL**: https://aitrade.htsn.io (if accessible)
 **GPU**: RTX A6000 (48GB VRAM)
 **Components**:
 - Trading algorithms
 - AI models for market prediction
 - Real-time data feeds
 - Backtesting infrastructure
 **Access**: SSH only (no web UI documented)
 ---
 ### LM Dev (lmdev1)
 **Purpose**: AI/LLM development environment
 **Location**: VM 111 on PVE (10.10.10.111)
 **URL**: https://lmdev.htsn.io (if accessible)
 **GPU**: TITAN RTX (shared with Saltbox)
 **Installed**:
 - CUDA toolkit
 - Python 3.11+
 - PyTorch, TensorFlow
 - Hugging Face transformers
 ---
 ## Monitoring & Utilities
 ### UPS Monitoring (NUT)
 **Purpose**: Monitor UPS status and trigger shutdowns
 **Location**: PVE (master), PVE2 (slave)
 **Access**: Command-line (`upsc`)
 **Key Commands**:
 ```bash
 ssh pve 'upsc cyberpower@localhost'
 ssh pve 'upsc cyberpower@localhost ups.load'
 ssh pve 'upsc cyberpower@localhost battery.runtime'
 ```
 **Home Assistant Integration**: UPS sensors exposed
 **See**: [UPS.md](UPS.md)
 ---
 ### Pulse Monitoring
 **Purpose**: Unknown monitoring dashboard
 **Location**: docker-host (10.10.10.206:7655)
 **URL**: https://pulse.htsn.io
 **Access**: Unknown
 **Needs documentation**:
 - What does it monitor?
 - How to configure?
 - Authentication?
 ---
 ### Tailscale VPN
 **Purpose**: Secure remote access to homelab
 **Subnet Routers**:
 - PVE (100.113.177.80) - Primary
 - UCG-Fiber (100.94.246.32) - Failover
 **Devices on Tailscale**:
 - Mac Mini: 100.108.89.58
 - PVE: 100.113.177.80
 - TrueNAS: 100.100.94.71
 - Pi-hole: 100.112.59.128
 **See**: [NETWORK.md](NETWORK.md)
 ---
 ## Custom Applications
 ### FindShyt (CT 205)
 **Purpose**: Unknown custom application
 **Location**: LXC on PVE (10.10.10.8)
 **URL**: https://findshyt.htsn.io
 **Access**: Unknown
 **Needs documentation**:
 - What is this app?
 - How to use it?
 - Tech stack?
 ---
 ## Service Dependencies
 ### Critical Dependencies
 ```
 TrueNAS
 ├── Plex (media files via NFS)
 ├── *arr apps (downloads via NFS)
 ├── Syncthing (central storage hub)
 └── Backups (if configured)
 Traefik (CT 202)
 ├── All *.htsn.io services
 └── SSL certificate management
 Pi-hole
 └── DNS for entire network
 Router
 └── Gateway for all services
 ```
 ### Startup Order
 **See [VMS.md](VMS.md)** for VM boot order configuration:
 1. TrueNAS (storage first)
 2. Saltbox (depends on TrueNAS NFS)
 3. Other VMs
 4. Containers
 ---
 ## Service Port Reference
 ### Well-Known Ports
 | Port | Service | Protocol | Purpose |
 |------|---------|----------|---------|
 | 22 | SSH | TCP | Remote access |
 | 53 | Pi-hole | UDP | DNS queries |
 | 80 | Traefik | TCP | HTTP (redirects to 443) |
 | 443 | Traefik | TCP | HTTPS |
 | 3000 | Gitea | TCP | Git HTTP/S |
 | 8006 | Proxmox | TCP | Web UI |
 | 8096 | Plex | TCP | Plex Media Server |
 | 8384 | Syncthing | TCP | Web UI |
 | 22000 | Syncthing | TCP | Sync protocol |
 ### Internal Ports
 | Port | Service | Purpose |
 |------|---------|---------|
 | 3002 | Happy Server | Relay backend |
 | 5432 | PostgreSQL | Happy Server DB |
 | 6379 | Redis | Happy Server cache |
 | 7655 | Pulse | Monitoring |
 | 8080 | Excalidraw | Whiteboard |
 | 8080 | Traefik | Dashboard |
 | 9000 | MinIO | Object storage |
 ---
 ## Service Health Checks
 ### Quick Health Check Script
 ```bash
 #!/bin/bash
 # Check all critical services
 echo "=== Infrastructure ==="
 curl -Is https://pve.htsn.io:8006 | head -1
 curl -Is https://truenas.htsn.io | head -1
 curl -I http://10.10.10.10/admin 2>/dev/null | head -1
 echo ""
 echo "=== Media Services ==="
 curl -Is https://plex.htsn.io | head -1
 curl -Is https://sonarr.htsn.io | head -1
 curl -Is https://radarr.htsn.io | head -1
 echo ""
 echo "=== Development ==="
 curl -Is https://git.htsn.io | head -1
 curl -Is https://excalidraw.htsn.io | head -1
 echo ""
 echo "=== Home Automation ==="
 curl -Is https://homeassistant.htsn.io | head -1
 curl -Is https://happy.htsn.io/health | head -1
 ```
 ### Service-Specific Checks
 ```bash
 # Proxmox VMs
 ssh pve 'qm list | grep running'
 # Docker services
 ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
 # Syncthing
 curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
  "http://127.0.0.1:8384/rest/system/status"
 # UPS
 ssh pve 'upsc cyberpower@localhost ups.status'
 ```
 ---
 ## Service Credentials
 **Location**: See individual service documentation
 | Service | Credentials Location | Notes |
 |---------|---------------------|-------|
 | Proxmox | Proxmox UI | Username + 2FA |
 | TrueNAS | TrueNAS UI | Root password |
 | Plex | Plex account | Managed externally |
 | Gitea | Gitea DB | Self-managed |
 | Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
 | Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
 **⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
 ---
 ## Related Documentation
 - [VMS.md](VMS.md) - VM/service locations
 - [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
 - [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
 - [NETWORK.md](NETWORK.md) - Network configuration
 - [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
 ---
 **Last Updated**: 2025-12-22
 **Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)
--- a/SSH-ACCESS.md
+++ b/SSH-ACCESS.md
@@ -0,0 +1,475 @@
 # SSH Access
 Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
 ## Overview
 Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
 **SSH Password**: `GrilledCh33s3#` (for systems without key auth)
 ---
 ## SSH Key Authentication (Primary Method)
 ### SSH Key Configuration
 SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
 **Key file**: `~/.ssh/homelab` (Ed25519 key)
 **Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
 ### Host Aliases
 Use these convenient aliases instead of IP addresses:
 | Host Alias | IP | User | Type | Notes |
 |------------|-----|------|------|-------|
 | `ucg-fiber` / `gateway` | 10.10.10.1 | root | UniFi Gateway | Router/firewall |
 | `pve` | 10.10.10.120 | root | Proxmox | Primary server |
 | `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
 | `truenas` | 10.10.10.200 | root | VM | NAS/storage |
 | `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
 | `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
 | `docker-host` | 10.10.10.206 | hutson | VM | Docker services (PVE) |
 | `docker-host2` | 10.10.10.207 | hutson | VM | Docker services (PVE2) - MetaMCP, n8n |
 | `fs-dev` | 10.10.10.5 | hutson | VM | Development |
 | `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
 | `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
 | `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
 | `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
 | `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
 | `findshyt` | 10.10.10.8 | root | LXC | Custom app |
 ### Usage Examples
 ```bash
 # List VMs on PVE
 ssh pve 'qm list'
 # Check ZFS pool on TrueNAS
 ssh truenas 'zpool status vault'
 # List Docker containers on Saltbox
 ssh saltbox 'docker ps'
 # Check Pi-hole status
 ssh pihole 'pihole status'
 # View Traefik config
 ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
 ```
 ### SSH Config File
 **Location**: `~/.ssh/config`
 **Example entries**:
 ```sshconfig
 # Proxmox Servers
 Host pve
    HostName 10.10.10.120
    User root
    IdentityFile ~/.ssh/homelab
 Host pve2
    HostName 10.10.10.102
    User root
    IdentityFile ~/.ssh/homelab
    # Post-quantum KEX causes MTU issues - use classic
    KexAlgorithms curve25519-sha256
 # VMs
 Host truenas
    HostName 10.10.10.200
    User root
    IdentityFile ~/.ssh/homelab
 Host saltbox
    HostName 10.10.10.100
    User hutson
    IdentityFile ~/.ssh/homelab
 Host lmdev1
    HostName 10.10.10.111
    User hutson
    IdentityFile ~/.ssh/homelab
 Host docker-host
    HostName 10.10.10.206
    User hutson
    IdentityFile ~/.ssh/homelab
 Host docker-host2
    HostName 10.10.10.207
    User hutson
    IdentityFile ~/.ssh/homelab
 Host fs-dev
    HostName 10.10.10.5
    User hutson
    IdentityFile ~/.ssh/homelab
 Host copyparty
    HostName 10.10.10.201
    User hutson
    IdentityFile ~/.ssh/homelab
 Host gitea-vm
    HostName 10.10.10.220
    User hutson
    IdentityFile ~/.ssh/homelab
 Host trading-vm
    HostName 10.10.10.221
    User hutson
    IdentityFile ~/.ssh/homelab
 # LXC Containers
 Host pihole
    HostName 10.10.10.10
    User root
    IdentityFile ~/.ssh/homelab
 Host traefik
    HostName 10.10.10.250
    User root
    IdentityFile ~/.ssh/homelab
 Host findshyt
    HostName 10.10.10.8
    User root
    IdentityFile ~/.ssh/homelab
 ```
 ---
 ## Password Authentication (Special Cases)
 Some systems don't support SSH key auth or have other limitations.
 ### UniFi Router (10.10.10.1) - NOW USES KEY AUTH
 **Host alias**: `ucg-fiber` or `gateway`
 **Status**: SSH key authentication now works (as of 2026-01-02)
 **Commands**:
 ```bash
 # Run command on router (using SSH key)
 ssh ucg-fiber 'hostname'
 # Get ARP table (all device IPs)
 ssh ucg-fiber 'cat /proc/net/arp'
 # Check Tailscale status
 ssh ucg-fiber 'tailscale status'
 # Check memory usage
 ssh ucg-fiber 'free -m'
 ```
 **Note**: Key may need to be re-deployed after firmware updates if UniFi clears authorized_keys.
 ### Windows PC (10.10.10.150)
 **OS**: Windows with OpenSSH server
 **User**: `claude`
 **Password**: `GrilledCh33s3#`
 **Shell**: PowerShell (not bash)
 **Commands**:
 ```bash
 # Run PowerShell command
 sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
 # Check Syncthing status
 sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
 # Restart Syncthing
 sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
 ```
 **⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
 **Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
 ---
 ## QEMU Guest Agent
 Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
 ### VMs with QEMU Agent
 | VMID | VM Name | Use Case |
 |------|---------|----------|
 | 100 | truenas | Execute commands, check ZFS |
 | 101 | saltbox | Execute commands, Docker mgmt |
 | 105 | fs-dev | Execute commands |
 | 111 | lmdev1 | Execute commands |
 | 201 | copyparty | Execute commands |
 | 206 | docker-host | Execute commands |
 | 300 | gitea-vm | Execute commands |
 | 301 | trading-vm | Execute commands |
 ### VM WITHOUT QEMU Agent
 **VMID 110 (homeassistant)**: No QEMU agent installed
 - Access via web UI only
 - Or install SSH server manually if needed
 ### Usage Examples
 **Basic syntax**:
 ```bash
 ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
 ```
 **Examples**:
 ```bash
 # Check ZFS pool on TrueNAS (without SSH)
 ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
 # Get VM IP addresses
 ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
 # Check Docker containers on Saltbox
 ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
 # Run multi-line command
 ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
 ```
 **When to use QEMU agent vs SSH**:
 - ✅ Use **SSH** for interactive sessions, file editing, complex tasks
 - ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
 - ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
 ---
 ## Troubleshooting SSH Issues
 ### Connection Refused
 ```bash
 # Check if SSH service is running
 ssh pve 'systemctl status sshd'
 # Check if port 22 is open
 nc -zv 10.10.10.XXX 22
 # Check firewall
 ssh pve 'iptables -L -n | grep 22'
 ```
 ### Permission Denied (Public Key)
 ```bash
 # Verify key file exists
 ls -la ~/.ssh/homelab
 # Check key permissions (should be 600)
 chmod 600 ~/.ssh/homelab
 # Test SSH key auth verbosely
 ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
 # Check authorized_keys on remote (via QEMU agent if SSH broken)
 ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
 ```
 ### Slow SSH Connection (PVE2 Issue)
 **Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
 **Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
 **Fix**: Use classic KEX algorithm instead
 **In `~/.ssh/config`**:
 ```sshconfig
 Host pve2
    HostName 10.10.10.102
    User root
    IdentityFile ~/.ssh/homelab
    KexAlgorithms curve25519-sha256  # Avoid mlkem768x25519-sha256
 ```
 **Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
 ---
 ## Adding SSH Keys to New Systems
 ### Linux (VMs/LXCs)
 ```bash
 # Copy public key to new host
 ssh-copy-id -i ~/.ssh/homelab user@hostname
 # Or manually:
 ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
 ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
 ```
 ### LXC Containers (Root User)
 ```bash
 # Via pct exec from Proxmox host
 ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
 ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
 ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
 # Also enable PermitRootLogin in sshd_config
 ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
 ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
 ```
 ### VMs (via QEMU Agent)
 ```bash
 # Add key via QEMU agent (if SSH not working)
 ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
 ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
 ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
 ```
 ---
 ## SSH Key Management
 ### Rotate SSH Keys (Future)
 When rotating SSH keys:
 1. Generate new key pair:
   ```bash
   ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
   ```
 2. Deploy new key to all hosts (keep old key for now):
   ```bash
   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
     ssh-copy-id -i ~/.ssh/homelab-new $host
   done
   ```
 3. Update `~/.ssh/config` to use new key:
   ```sshconfig
   IdentityFile ~/.ssh/homelab-new
   ```
 4. Test all connections:
   ```bash
   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
     echo "Testing $host..."
     ssh $host 'hostname'
   done
   ```
 5. Remove old key from all hosts once confirmed working
 ---
 ## Quick Reference
 ### Common SSH Operations
 ```bash
 # Execute command on remote host
 ssh host 'command'
 # Execute multiple commands
 ssh host 'command1 && command2'
 # Copy file to remote
 scp file host:/path/
 # Copy file from remote
 scp host:/path/file ./
 # Execute command on Proxmox VM (via QEMU agent)
 ssh pve 'qm guest exec VMID -- bash -c "command"'
 # Execute command on LXC
 ssh pve 'pct exec CTID -- command'
 # Interactive shell
 ssh host
 # SSH with X11 forwarding
 ssh -X host
 ```
 ### Troubleshooting Commands
 ```bash
 # Test SSH with verbose output
 ssh -vvv host
 # Check SSH service status (remote)
 ssh host 'systemctl status sshd'
 # Check SSH config (local)
 ssh -G host
 # Test port connectivity
 nc -zv hostname 22
 ```
 ---
 ## Security Best Practices
 ### Current Security Posture
 ✅ **Good**:
 - SSH keys used instead of passwords (where possible)
 - Keys use Ed25519 (modern, secure algorithm)
 - Root login disabled on VMs (use sudo instead)
 - SSH keys have proper permissions (600)
 ⚠️ **Could Improve**:
 - [ ] Disable password authentication on all hosts (force key-only)
 - [ ] Use SSH certificate authority instead of individual keys
 - [ ] Set up SSH bastion host (jump server)
 - [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
 - [ ] Implement SSH key rotation policy (annually)
 ### Hardening SSH (Future)
 For additional security, consider:
 ```sshconfig
 # /etc/ssh/sshd_config (on remote hosts)
 PermitRootLogin prohibit-password  # No root password login
 PasswordAuthentication no          # Disable password auth entirely
 PubkeyAuthentication yes           # Only allow key auth
 AuthorizedKeysFile .ssh/authorized_keys
 MaxAuthTries 3                     # Limit auth attempts
 MaxSessions 10                     # Limit concurrent sessions
 ClientAliveInterval 300            # Timeout idle sessions
 ClientAliveCountMax 2              # Drop after 2 keepalives
 ```
 **Apply after editing**:
 ```bash
 systemctl restart sshd
 ```
 ---
 ## Related Documentation
 - [VMS.md](VMS.md) - Complete VM/CT inventory
 - [NETWORK.md](NETWORK.md) - Network configuration
 - [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
 - [SECURITY.md](#) - Security policies (coming soon)
 ---
 **Last Updated**: 2025-12-22
--- a/STORAGE.md
+++ b/STORAGE.md
@@ -0,0 +1,510 @@
 # Storage Architecture
 Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
 ## Overview
 ### Storage Distribution
 | Location | Type | Capacity | Purpose |
 |----------|------|----------|---------|
 | **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
 | **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
 | **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
 ---
 ## PVE (10.10.10.120) Storage Pools
 ### nvme-mirror1 (Primary Fast Storage)
 - **Type**: ZFS mirror
 - **Devices**: 2x Sabrent Rocket Q NVMe
 - **Capacity**: 3.6 TB usable
 - **Purpose**: High-performance VM storage
 - **Used By**:
  - Critical VMs requiring fast IO
  - Database workloads
  - Development environments
 **Check status**:
 ```bash
 ssh pve 'zpool status nvme-mirror1'
 ssh pve 'zpool list nvme-mirror1'
 ```
 ### nvme-mirror2 (Secondary Fast Storage)
 - **Type**: ZFS mirror
 - **Devices**: 2x Kingston SFYRD 2TB NVMe
 - **Capacity**: 1.8 TB usable
 - **Purpose**: Additional fast VM storage
 - **Used By**: TBD
 **Check status**:
 ```bash
 ssh pve 'zpool status nvme-mirror2'
 ssh pve 'zpool list nvme-mirror2'
 ```
 ### rpool (Root Pool)
 - **Type**: ZFS mirror
 - **Devices**: 2x Samsung 870 QVO 4TB SSD
 - **Capacity**: 3.6 TB usable
 - **Purpose**: Proxmox OS, container storage, VM backups
 - **Used By**:
  - Proxmox root filesystem
  - LXC containers
  - Local VM backups
 **Check status**:
 ```bash
 ssh pve 'zpool status rpool'
 ssh pve 'df -h /var/lib/vz'
 ```
 ### Storage Pool Usage Summary (PVE)
 **Get current usage**:
 ```bash
 ssh pve 'zpool list'
 ssh pve 'pvesm status'
 ```
 ---
 ## PVE2 (10.10.10.102) Storage Pools
 ### nvme-mirror3 (Fast Storage)
 - **Type**: ZFS mirror
 - **Devices**: 2x NVMe (model unknown)
 - **Capacity**: Unknown (needs investigation)
 - **Purpose**: High-performance VM storage
 - **Used By**: Trading VM (301), other VMs
 **Check status**:
 ```bash
 ssh pve2 'zpool status nvme-mirror3'
 ssh pve2 'zpool list nvme-mirror3'
 ```
 ### local-zfs2 (Bulk Storage)
 - **Type**: ZFS mirror
 - **Devices**: 2x WD Red 6TB HDD
 - **Capacity**: ~6 TB usable
 - **Purpose**: Bulk/archival storage
 - **Power Management**: 30-minute spindown configured
  - Saves ~10-16W when idle
  - Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
  - Command: `hdparm -S 241` (30 min)
 **Notes**:
 - Pool had only 768 KB used as of 2024-12-16
 - Drives configured to spin down after 30 min idle
 - Good for archival, NOT for active workloads
 **Check status**:
 ```bash
 ssh pve2 'zpool status local-zfs2'
 ssh pve2 'zpool list local-zfs2'
 # Check if drives are spun down
 ssh pve2 'hdparm -C /dev/sdX'  # Shows active/standby
 ```
 ---
 ## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
 ### ZFS Pool: vault
 **Primary storage pool** for all shared data.
 **Devices**: ❓ Needs investigation
 - EMC storage enclosure with multiple drives
 - SAS connection via LSI SAS2308 HBA (passed through to VM)
 **Capacity**: ❓ Needs investigation
 **Check pool status**:
 ```bash
 ssh truenas 'zpool status vault'
 ssh truenas 'zpool list vault'
 # Get detailed capacity
 ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
 ```
 ### Datasets (Known)
 Based on Syncthing configuration, likely datasets:
 | Dataset | Purpose | Synced Devices | Notes |
 |---------|---------|----------------|-------|
 | vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
 | vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
 | vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
 | vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
 | vault/desktop | Desktop sync | Unknown | 7.2 GB |
 | vault/movies | Movie library | Unknown | Unknown size |
 | vault/config | Config files | Mac Mini, MacBook | Unknown size |
 **Get complete dataset list**:
 ```bash
 ssh truenas 'zfs list -r vault'
 ```
 ### NFS/SMB Shares
 **Status**: ❓ Not documented
 **Needs investigation**:
 ```bash
 # List NFS exports
 ssh truenas 'showmount -e localhost'
 # List SMB shares
 ssh truenas 'smbclient -L localhost -N'
 # Via TrueNAS API/UI
 # Sharing → Unix Shares (NFS)
 # Sharing → Windows Shares (SMB)
 ```
 **Expected shares**:
 - Media libraries for Plex (on Saltbox VM)
 - Document storage
 - VM backups?
 - ISO storage?
 ### EMC Storage Enclosure
 **Model**: EMC KTN-STL4 (or similar)
 **Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
 **Drives**: ❓ Unknown count and capacity
 **See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
 - SES commands
 - Fan control
 - LCC (Link Control Card) troubleshooting
 - Maintenance procedures
 **Check enclosure status**:
 ```bash
 ssh truenas 'sg_ses --page=0x02 /dev/sgX'  # Element descriptor
 ssh truenas 'smartctl --scan'              # List all drives
 ```
 ---
 ## Storage Network Architecture
 ### Internal Storage Network (10.10.10.20.0/24)
 **Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
 **Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
 **Subnet**: 10.10.10.20.0/24
 **DHCP**: No
 **Gateway**: No (internal only, no internet)
 **Connected VMs**:
 - TrueNAS VM (secondary NIC)
 - Saltbox VM (secondary NIC) - for NFS mounts
 - Other VMs needing storage access
 **Configuration**:
 ```bash
 # On TrueNAS VM - check second NIC
 ssh truenas 'ip addr show enp6s19'
 # On Saltbox - check NFS mounts
 ssh saltbox 'mount | grep nfs'
 ```
 **Benefits**:
 - Separates storage traffic from general network
 - Prevents NFS/SMB from saturating main network
 - Better performance for storage-heavy workloads
 ---
 ## Storage Capacity Planning
 ### Current Usage (Estimate)
 **Needs actual audit**:
 ```bash
 # PVE pools
 ssh pve 'zpool list -o name,size,alloc,free'
 # PVE2 pools
 ssh pve2 'zpool list -o name,size,alloc,free'
 # TrueNAS vault pool
 ssh truenas 'zpool list vault'
 # Get detailed breakdown
 ssh truenas 'zfs list -r vault -o name,used,avail'
 ```
 ### Growth Rate
 **Needs tracking** - recommend monthly snapshots of capacity:
 ```bash
 #!/bin/bash
 # Save as ~/bin/storage-capacity-report.sh
 DATE=$(date +%Y-%m-%d)
 REPORT=~/Backups/storage-reports/capacity-$DATE.txt
 mkdir -p ~/Backups/storage-reports
 echo "Storage Capacity Report - $DATE" > $REPORT
 echo "================================" >> $REPORT
 echo "" >> $REPORT
 echo "PVE Pools:" >> $REPORT
 ssh pve 'zpool list' >> $REPORT
 echo "" >> $REPORT
 echo "PVE2 Pools:" >> $REPORT
 ssh pve2 'zpool list' >> $REPORT
 echo "" >> $REPORT
 echo "TrueNAS Pools:" >> $REPORT
 ssh truenas 'zpool list' >> $REPORT
 echo "" >> $REPORT
 echo "TrueNAS Datasets:" >> $REPORT
 ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
 echo "Report saved to $REPORT"
 ```
 **Run monthly via cron**:
 ```cron
 0 9 1 * * ~/bin/storage-capacity-report.sh
 ```
 ### Expansion Planning
 **When to expand**:
 - Pool reaches 80% capacity
 - Performance degrades
 - New workloads require more space
 **Expansion options**:
 1. Add drives to existing pools (if mirrors, add mirror vdev)
 2. Add new NVMe drives to PVE/PVE2
 3. Expand EMC enclosure (add more drives)
 4. Add second EMC enclosure
 **Cost estimates**: TBD
 ---
 ## ZFS Health Monitoring
 ### Daily Health Checks
 ```bash
 # Check for errors on all pools
 ssh pve 'zpool status -x'     # Shows only unhealthy pools
 ssh pve2 'zpool status -x'
 ssh truenas 'zpool status -x'
 # Check scrub status
 ssh pve 'zpool status | grep scrub'
 ssh pve2 'zpool status | grep scrub'
 ssh truenas 'zpool status | grep scrub'
 ```
 ### Scrub Schedule
 **Recommended**: Monthly scrub on all pools
 **Configure scrub**:
 ```bash
 # Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
 # Or via cron:
 0 2 1 * * /sbin/zpool scrub nvme-mirror1
 0 2 1 * * /sbin/zpool scrub rpool
 ```
 **On TrueNAS**:
 - Configure via UI: Storage → Pools → Scrub Tasks
 - Recommended: 1st of every month at 2 AM
 ### SMART Monitoring
 **Check drive health**:
 ```bash
 # PVE
 ssh pve 'smartctl -a /dev/nvme0'
 ssh pve 'smartctl -a /dev/sda'
 # TrueNAS
 ssh truenas 'smartctl --scan'
 ssh truenas 'smartctl -a /dev/sdX'  # For each drive
 ```
 **Configure SMART tests**:
 - TrueNAS UI: Tasks → S.M.A.R.T. Tests
 - Recommended: Weekly short test, monthly long test
 ### Alerts
 **Set up email alerts for**:
 - ZFS pool errors
 - SMART test failures
 - Pool capacity > 80%
 - Scrub failures
 ---
 ## Storage Performance Tuning
 ### ZFS ARC (Cache)
 **Check ARC usage**:
 ```bash
 ssh pve 'arc_summary'
 ssh truenas 'arc_summary'
 ```
 **Tuning** (if needed):
 - PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
 - TrueNAS: Configure via UI (System → Advanced → Tunables)
 ### NFS Performance
 **Mount options** (on clients like Saltbox):
 ```
 rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
 ```
 **Verify NFS mounts**:
 ```bash
 ssh saltbox 'mount | grep nfs'
 ```
 ### Record Size Optimization
 **Different workloads need different record sizes**:
 - VMs: 64K (default, good for VMs)
 - Databases: 8K or 16K
 - Media files: 1M (large sequential reads)
 **Set record size** (on TrueNAS datasets):
 ```bash
 ssh truenas 'zfs set recordsize=1M vault/movies'
 ```
 ---
 ## Disaster Recovery
 ### Pool Recovery
 **If a pool fails to import**:
 ```bash
 # Try importing with different name
 zpool import -f -N poolname newpoolname
 # Check pool with readonly
 zpool import -f -o readonly=on poolname
 # Force import (last resort)
 zpool import -f -F poolname
 ```
 ### Drive Replacement
 **When a drive fails**:
 ```bash
 # Identify failed drive
 zpool status poolname
 # Replace drive
 zpool replace poolname old-device new-device
 # Monitor resilver
 watch zpool status poolname
 ```
 ### Data Recovery
 **If pool is completely lost**:
 1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
 2. Recreate pool structure
 3. Restore data
 **Critical**: This is why we need offsite backups!
 ---
 ## Quick Reference
 ### Common Commands
 ```bash
 # Pool status
 zpool status [poolname]
 zpool list
 # Dataset usage
 zfs list
 zfs list -r vault
 # Check pool health (only unhealthy)
 zpool status -x
 # Scrub pool
 zpool scrub poolname
 # Get pool IO stats
 zpool iostat -v 1
 # Snapshot management
 zfs snapshot poolname/dataset@snapname
 zfs list -t snapshot
 zfs rollback poolname/dataset@snapname
 zfs destroy poolname/dataset@snapname
 ```
 ### Storage Locations by Use Case
 | Use Case | Recommended Storage | Why |
 |----------|---------------------|-----|
 | VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
 | Database | nvme-mirror1/2 | Low latency |
 | Media files | TrueNAS vault | Large capacity |
 | Development | nvme-mirror2 | Fast, mid-tier |
 | Containers | rpool | Good performance |
 | Backups | TrueNAS or rpool | Large capacity |
 | Archive | local-zfs2 (PVE2) | Cheap, can spin down |
 ---
 ## Investigation Needed
 - [ ] Get complete TrueNAS dataset list
 - [ ] Document NFS/SMB share configuration
 - [ ] Inventory EMC enclosure drives (count, capacity, model)
 - [ ] Document current pool usage percentages
 - [ ] Set up monthly capacity reports
 - [ ] Configure ZFS scrub schedules
 - [ ] Set up storage health alerts
 ---
 ## Related Documentation
 - [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
 - [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
 - [VMS.md](VMS.md) - VM storage assignments
 - [NETWORK.md](NETWORK.md) - Storage network configuration
 ---
 **Last Updated**: 2025-12-22
--- a/TRAEFIK.md
+++ b/TRAEFIK.md
@@ -0,0 +1,673 @@
 # Traefik Reverse Proxy
 Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
 ## Overview
 There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
 | Instance | Location | IP | Purpose | Managed By |
 |----------|----------|-----|---------|------------|
 | **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
 | **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
 ---
 ## ⚠️ CRITICAL RULE: Which Traefik to Use
 ### When Adding ANY New Service:
 ✅ **USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
 ❌ **DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
 ### Why This Matters:
 - **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
 - Messing with it breaks Plex, Sonarr, Radarr, and all media services
 - Each Traefik has its own Let's Encrypt certificates
 - Mixing them causes certificate conflicts and routing issues
 ---
 ## Traefik-Primary (CT 202) - For New Services
 ### Configuration
 **Location**: Container 202 on PVE (10.10.10.250)
 **Config Directory**: `/etc/traefik/`
 **Main Config**: `/etc/traefik/traefik.yaml`
 **Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
 ### Access Traefik Config
 ```bash
 # From Mac Mini:
 ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
 ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
 # Edit a service config:
 ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
 # View logs:
 ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
 ```
 ### Services Using Traefik-Primary
 | Service | Domain | Backend |
 |---------|--------|---------|
 | Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
 | FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
 | Gitea | git.htsn.io | 10.10.10.220:3000 |
 | Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
 | LM Dev | lmdev.htsn.io | 10.10.10.111 |
 | MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 (docker-host2) |
 | Pi-hole | pihole.htsn.io | 10.10.10.200 |
 | TrueNAS | truenas.htsn.io | 10.10.10.200 |
 | Proxmox | pve.htsn.io | 10.10.10.120 |
 | Copyparty | copyparty.htsn.io | 10.10.10.201 |
 | AI Trade | aitrade.htsn.io | (trading server) |
 | Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
 | Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
 ---
 ## Traefik-Saltbox (VM 101) - DO NOT MODIFY
 ### Configuration
 **Location**: `/opt/traefik/` inside Saltbox VM
 **Managed By**: Saltbox Ansible playbooks (automatic)
 **Docker Mount**: `/opt/traefik` → `/etc/traefik` in container
 ### Services Using Traefik-Saltbox
 - Plex (plex.htsn.io)
 - Sonarr, Radarr, Lidarr
 - SABnzbd, NZBGet, qBittorrent
 - Overseerr, Tautulli, Organizr
 - Jackett, NZBHydra2
 - Authelia (SSO authentication)
 - All other Saltbox-managed containers
 ### View Saltbox Traefik (Read-Only)
 ```bash
 # View config (don't edit!)
 ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
 # View logs
 ssh saltbox 'docker logs -f traefik'
 ```
 **⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
 ---
 ## Adding a New Public Service - Complete Workflow
 Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
 ### Step 0: Deploy Your Service
 First, deploy your service on the appropriate host.
 #### Option A: Docker on docker-host (10.10.10.206)
 ```bash
 ssh hutson@10.10.10.206
 sudo mkdir -p /opt/myservice
 cat > /opt/myservice/docker-compose.yml << 'EOF'
 version: "3.8"
 services:
  myservice:
    image: myimage:latest
    ports:
      - "8080:80"
    restart: unless-stopped
 EOF
 cd /opt/myservice && sudo docker-compose up -d
 ```
 #### Option B: New LXC Container on PVE
 ```bash
 ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname myservice --memory 2048 --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
  --rootfs local-zfs:8 --unprivileged 1 --start 1'
 ```
 #### Option C: New VM on PVE
 ```bash
 ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
 ```
 ### Step 1: Create Traefik Config File
 Use this template for new services on **Traefik-Primary (CT 202)**:
 #### Basic Template
 ```yaml
 # /etc/traefik/conf.d/myservice.yaml
 http:
  routers:
    # HTTPS router
    myservice-secure:
      entryPoints:
        - websecure
      rule: "Host(`myservice.htsn.io`)"
      service: myservice
      tls:
        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
      priority: 50
    # HTTP → HTTPS redirect
    myservice-redirect:
      entryPoints:
        - web
      rule: "Host(`myservice.htsn.io`)"
      middlewares:
        - myservice-https-redirect
      service: myservice
      priority: 50
  services:
    myservice:
      loadBalancer:
        servers:
          - url: "http://10.10.10.XXX:PORT"
  middlewares:
    myservice-https-redirect:
      redirectScheme:
        scheme: https
        permanent: true
 ```
 #### Deploy the Config
 ```bash
 # Create file on CT 202
 ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
 <paste config here>
 EOF"'
 # Traefik auto-reloads (watches conf.d directory)
 # Check logs:
 ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
 ```
 ### Step 2: Add Cloudflare DNS Entry
 #### Cloudflare Credentials
 | Field | Value |
 |-------|-------|
 | Email | cloudflare@htsn.io |
 | API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
 | Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
 | Public IP | 70.237.94.174 |
 #### Method 1: Manual (Cloudflare Dashboard)
 1. Go to https://dash.cloudflare.com/
 2. Select `htsn.io` domain
 3. DNS → Add Record
 4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
 #### Method 2: Automated (CLI)
 Save this as `~/bin/add-cloudflare-dns.sh`:
 ```bash
 #!/bin/bash
 # Add DNS record to Cloudflare for htsn.io
 SUBDOMAIN="$1"
 CF_EMAIL="cloudflare@htsn.io"
 CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
 ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
 PUBLIC_IP="70.237.94.174"
 if [ -z "$SUBDOMAIN" ]; then
  echo "Usage: $0 <subdomain>"
  echo "Example: $0 myservice  # Creates myservice.htsn.io"
  exit 1
 fi
 curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data "{
    \"type\":\"A\",
    \"name\":\"$SUBDOMAIN\",
    \"content\":\"$PUBLIC_IP\",
    \"ttl\":1,
    \"proxied\":true
  }" | jq .
 ```
 **Usage**:
 ```bash
 chmod +x ~/bin/add-cloudflare-dns.sh
 ~/bin/add-cloudflare-dns.sh myservice  # Creates myservice.htsn.io
 ```
 ### Step 3: Testing
 ```bash
 # Check if DNS resolves
 dig myservice.htsn.io
 # Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
 # Test HTTP redirect
 curl -I http://myservice.htsn.io
 # Expected: 301 redirect to https://
 # Test HTTPS
 curl -I https://myservice.htsn.io
 # Expected: 200 OK
 # Check Traefik dashboard (if enabled)
 # http://10.10.10.250:8080/dashboard/
 ```
 ### Step 4: Update Documentation
 After deploying, update:
 1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
 2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
 3. **CLAUDE.md** - Update quick reference if needed
 ---
 ## SSL Certificates
 Traefik has **two certificate resolvers** configured:
 | Resolver | Use When | Challenge Type | Notes |
 |----------|----------|----------------|-------|
 | `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
 | `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
 ### ⚠️ Important: HTTP Challenge vs DNS Challenge
 **If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
 **Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
 ### Certificate Resolver Configuration
 **Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
 ```ini
 Environment="CF_API_EMAIL=cloudflare@htsn.io"
 Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
 ```
 ### Certificate Storage
 | Resolver | Storage File |
 |----------|--------------|
 | HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
 | DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
 **Permissions**: Must be `600` (read/write owner only)
 ```bash
 # Check permissions
 ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
 # Fix if needed
 ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
 ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
 ```
 ### Certificate Renewal
 - **Automatic** via Traefik
 - Checks every 24 hours
 - Renews 30 days before expiry
 - No manual intervention needed
 ### Troubleshooting Certificates
 #### Certificate Fails to Issue
 ```bash
 # Check Traefik logs
 ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
 # Verify Cloudflare API access
 curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
  -H "X-Auth-Email: cloudflare@htsn.io" \
  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
 # Check acme.json permissions
 ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
 ```
 #### Force Certificate Renewal
 ```bash
 # Delete certificate (Traefik will re-request)
 ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
 ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
 ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
 ssh pve 'pct exec 202 -- systemctl restart traefik'
 # Watch logs
 ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
 ```
 ---
 ## Quick Deployment - One-Liner
 For fast deployment, use this all-in-one command:
 ```bash
 # === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
 # 1. Create Traefik config
 ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
 http:
  routers:
    myservice-secure:
      entryPoints: [websecure]
      rule: Host(\\\`myservice.htsn.io\\\`)
      service: myservice
      tls: {certResolver: cloudflare}
  services:
    myservice:
      loadBalancer:
        servers:
          - url: http://10.10.10.206:8080
 EOF"'
 # 2. Add Cloudflare DNS
 curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
  -H "X-Auth-Email: cloudflare@htsn.io" \
  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
 # 3. Test (wait a few seconds for DNS propagation)
 curl -I https://myservice.htsn.io
 ```
 ---
 ## Docker Service with Traefik Labels (Alternative)
 If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
 **Requirements**:
 - Traefik must have access to Docker socket
 - Service must be on same Docker network as Traefik
 **Example docker-compose.yml**:
 ```yaml
 version: "3.8"
 services:
  myservice:
    image: myimage:latest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
      - "traefik.http.routers.myservice.entrypoints=websecure"
      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
    networks:
      - traefik
 networks:
  traefik:
    external: true
 ```
 **Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
 ---
 ## Cloudflare API Reference
 ### API Credentials
 | Field | Value |
 |-------|-------|
 | Email | cloudflare@htsn.io |
 | API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
 | Zone ID | c0f5a80448c608af35d39aa820a5f3af |
 ### Common API Operations
 Set credentials:
 ```bash
 CF_EMAIL="cloudflare@htsn.io"
 CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
 ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
 ```
 **List all DNS records**:
 ```bash
 curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" | jq
 ```
 **Add A record**:
 ```bash
 curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "type":"A",
    "name":"subdomain",
    "content":"70.237.94.174",
    "proxied":true
  }'
 ```
 **Delete record**:
 ```bash
 curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY"
 ```
 **Update record** (toggle proxy):
 ```bash
 curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" \
  -H "X-Auth-Key: $CF_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{"proxied":false}'
 ```
 ---
 ## Troubleshooting
 ### Service Not Accessible
 ```bash
 # 1. Check if DNS resolves
 dig myservice.htsn.io
 # 2. Check if backend is reachable
 curl -I http://10.10.10.XXX:PORT
 # 3. Check Traefik logs
 ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
 # 4. Check Traefik config is valid
 ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
 # 5. Restart Traefik (if needed)
 ssh pve 'pct exec 202 -- systemctl restart traefik'
 ```
 ### Certificate Issues
 ```bash
 # Check certificate status in acme.json
 ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
 # Check certificate expiry
 echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
 ```
 ### 502 Bad Gateway
 **Cause**: Backend service is down or unreachable
 ```bash
 # Check if backend is running
 ssh backend-host 'systemctl status myservice'
 # Check if port is open
 nc -zv 10.10.10.XXX PORT
 # Check firewall
 ssh backend-host 'iptables -L -n | grep PORT'
 ```
 ### 404 Not Found
 **Cause**: Traefik can't match the request to a router
 ```bash
 # Check router rule matches domain
 ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
 # Should be: rule: "Host(`myservice.htsn.io`)"
 # Check DNS is pointing to correct IP
 dig myservice.htsn.io
 # Restart Traefik to reload config
 ssh pve 'pct exec 202 -- systemctl restart traefik'
 ```
 ---
 ## Advanced Configuration Examples
 ### WebSocket Support
 For services that use WebSockets (like Home Assistant):
 ```yaml
 http:
  routers:
    myservice-secure:
      entryPoints:
        - websecure
      rule: "Host(`myservice.htsn.io`)"
      service: myservice
      tls:
        certResolver: cloudflare
  services:
    myservice:
      loadBalancer:
        servers:
          - url: "http://10.10.10.XXX:PORT"
        # No special config needed - WebSockets work by default in Traefik v2+
 ```
 ### Custom Headers
 Add custom headers (e.g., security headers):
 ```yaml
 http:
  routers:
    myservice-secure:
      middlewares:
        - myservice-headers
  middlewares:
    myservice-headers:
      headers:
        customResponseHeaders:
          X-Frame-Options: "DENY"
          X-Content-Type-Options: "nosniff"
          Referrer-Policy: "strict-origin-when-cross-origin"
 ```
 ### Basic Authentication
 Protect a service with basic auth:
 ```yaml
 http:
  routers:
    myservice-secure:
      middlewares:
        - myservice-auth
  middlewares:
    myservice-auth:
      basicAuth:
        users:
          - "user:$apr1$..." # Generate with: htpasswd -nb user password
 ```
 ---
 ## Maintenance
 ### Monthly Checks
 ```bash
 # Check Traefik status
 ssh pve 'pct exec 202 -- systemctl status traefik'
 # Review logs for errors
 ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
 # Check certificate expiry dates
 ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
 # Verify all services responding
 for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
  echo "Testing $domain..."
  curl -sI https://$domain | head -1
 done
 ```
 ### Backup Traefik Config
 ```bash
 # Backup all configs
 ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
 # Copy to safe location
 scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
 ```
 ---
 ## Related Documentation
 - [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
 - [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
 - [SERVICES.md](#) - Complete service inventory (coming soon)
 ---
 **Last Updated**: 2025-12-22
--- a/UPS.md
+++ b/UPS.md
@@ -0,0 +1,605 @@
 # UPS and Power Management
 Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
 ## Hardware
 ### Current UPS
 | Specification | Value |
 |---------------|-------|
 | **Model** | CyberPower OR2200PFCRT2U |
 | **Capacity** | 2200VA / 1320W |
 | **Form Factor** | 2U rackmount |
 | **Output** | PFC Sinewave (compatible with active PFC PSUs) |
 | **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
 | **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
 | **Runtime** | ~15-20 min at typical load (~33% / 440W) |
 | **Installed** | 2025-12-21 |
 | **Status** | Active |
 ### ⚠️ Temporary Wiring Modification
 **Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
 **Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
 **Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
 **Current draw**: ~1000-1350W total (safe margin)
 **Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
 ### Previous UPS
 | Model | Capacity | Issue | Replaced |
 |-------|----------|-------|----------|
 | WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
 **Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
 ---
 ## Power Draw Estimates
 ### Typical Load
 | Component | Idle | Load | Notes |
 |-----------|------|------|-------|
 | PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
 | PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
 | Network gear | ~50W | ~50W | Router, switches |
 | **Total** | **500-700W** | **1000-1400W** | Varies by workload |
 **UPS Load**: ~33-50% typical, 70-80% under heavy load
 ### Runtime Calculation
 At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
 At **660W load** (50%): ~10-12 min estimated
 At **1000W load** (75%): ~6-8 min estimated
 **NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
 ---
 ## NUT (Network UPS Tools) Configuration
 ### Architecture
 ```
 UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
                      │
                      └──> Home Assistant (monitoring only)
 ```
 **Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
 **Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
 ### NUT Server Configuration (PVE)
 #### 1. UPS Driver Config: `/etc/nut/ups.conf`
 ```ini
 [cyberpower]
    driver = usbhid-ups
    port = auto
    desc = "CyberPower OR2200PFCRT2U"
    override.battery.charge.low = 20
    override.battery.runtime.low = 120
 ```
 **Key settings**:
 - `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
 - `port = auto`: Auto-detect USB device
 - `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
 #### 2. NUT Server Config: `/etc/nut/upsd.conf`
 ```ini
 LISTEN 127.0.0.1 3493
 LISTEN 10.10.10.120 3493
 ```
 **Listens on**:
 - Localhost (for local monitoring)
 - LAN IP (for PVE2 to connect)
 #### 3. User Config: `/etc/nut/upsd.users`
 ```ini
 [admin]
    password = upsadmin123
    actions = SET
    instcmds = ALL
 [upsmon]
    password = upsmon123
    upsmon master
 ```
 **Users**:
 - `admin`: Full control, can run commands
 - `upsmon`: Monitoring only (used by PVE2)
 #### 4. Monitor Config: `/etc/nut/upsmon.conf`
 ```ini
 MONITOR cyberpower@localhost 1 upsmon upsmon123 master
 MINSUPPLIES 1
 SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
 NOTIFYCMD /usr/sbin/upssched
 POLLFREQ 5
 POLLFREQALERT 5
 HOSTSYNC 15
 DEADTIME 15
 POWERDOWNFLAG /etc/killpower
 NOTIFYMSG ONLINE    "UPS %s on line power"
 NOTIFYMSG ONBATT    "UPS %s on battery"
 NOTIFYMSG LOWBATT   "UPS %s battery is low"
 NOTIFYMSG FSD       "UPS %s: forced shutdown in progress"
 NOTIFYMSG COMMOK    "Communications with UPS %s established"
 NOTIFYMSG COMMBAD   "Communications with UPS %s lost"
 NOTIFYMSG SHUTDOWN  "Auto logout and shutdown proceeding"
 NOTIFYMSG REPLBATT  "UPS %s battery needs to be replaced"
 NOTIFYMSG NOCOMM    "UPS %s is unavailable"
 NOTIFYMSG NOPARENT  "upsmon parent process died - shutdown impossible"
 NOTIFYFLAG ONLINE   SYSLOG+WALL
 NOTIFYFLAG ONBATT   SYSLOG+WALL
 NOTIFYFLAG LOWBATT  SYSLOG+WALL
 NOTIFYFLAG FSD      SYSLOG+WALL
 NOTIFYFLAG COMMOK   SYSLOG+WALL
 NOTIFYFLAG COMMBAD  SYSLOG+WALL
 NOTIFYFLAG SHUTDOWN SYSLOG+WALL
 NOTIFYFLAG REPLBATT SYSLOG+WALL
 NOTIFYFLAG NOCOMM   SYSLOG+WALL
 NOTIFYFLAG NOPARENT SYSLOG
 ```
 **Key settings**:
 - `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
 - `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
 - `POLLFREQ 5`: Check UPS every 5 seconds
 #### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
 ```udev
 SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
 ```
 **Purpose**: Ensure NUT can access USB UPS device
 **Apply rule**:
 ```bash
 udevadm control --reload-rules
 udevadm trigger
 ```
 ### NUT Client Configuration (PVE2)
 #### Monitor Config: `/etc/nut/upsmon.conf`
 ```ini
 MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
 MINSUPPLIES 1
 SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
 POLLFREQ 5
 POLLFREQALERT 5
 HOSTSYNC 15
 DEADTIME 15
 POWERDOWNFLAG /etc/killpower
 # Same NOTIFYMSG and NOTIFYFLAG as PVE
 ```
 **Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
 ---
 ## Custom Shutdown Script
 ### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
 ```bash
 #!/bin/bash
 # Graceful VM/CT shutdown when UPS battery low
 LOG="/var/log/ups-shutdown.log"
 log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
 }
 log "=== UPS Shutdown Triggered ==="
 log "Battery low - initiating graceful shutdown of VMs/CTs"
 # Get list of running VMs (skip TrueNAS for now)
 VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
 for VMID in $VMS; do
    log "Stopping VM $VMID..."
    qm shutdown $VMID
 done
 # Get list of running containers
 CTS=$(pct list | awk '$2=="running" {print $1}')
 for CTID in $CTS; do
    log "Stopping CT $CTID..."
    pct shutdown $CTID
 done
 # Wait for VMs/CTs to stop
 log "Waiting 60 seconds for VMs/CTs to shut down..."
 sleep 60
 # Now stop TrueNAS (storage - must be last)
 if qm status 100 | grep -q running; then
    log "Stopping TrueNAS (VM 100) last..."
    qm shutdown 100
    sleep 30
 fi
 log "All VMs/CTs stopped. Host will remain running until UPS dies."
 log "=== UPS Shutdown Complete ==="
 ```
 **Make executable**:
 ```bash
 chmod +x /usr/local/bin/ups-shutdown.sh
 ```
 **Script behavior**:
 1. Stops all VMs (except TrueNAS)
 2. Stops all containers
 3. Waits 60 seconds
 4. Stops TrueNAS last (storage must be cleanly unmounted)
 5. **Does NOT shut down Proxmox hosts** - intentionally left running
 **Why not shut down hosts?**
 - BIOS configured to "Restore on AC Power Loss"
 - When power returns, servers auto-boot and start VMs in order
 - Avoids need for manual intervention
 ---
 ## Power Failure Behavior
 ### When Power Fails
 1. **UPS switches to battery** (`OB DISCHRG` status)
 2. **NUT monitors runtime** - polls every 5 seconds
 3. **At 120 seconds (2 min) remaining**:
   - NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
   - Script gracefully stops all VMs/CTs
   - TrueNAS stopped last (storage integrity)
 4. **Hosts remain running** until UPS battery depletes
 5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
 ### When Power Returns
 1. **UPS charges battery**, power returns to servers
 2. **BIOS "Restore on AC Power Loss"** boots both servers
 3. **Proxmox starts** and auto-starts VMs in configured order:
 | Order | Wait | VMs/CTs | Reason |
 |-------|------|---------|--------|
 | 1 | 30s | TrueNAS (VM 100) | Storage must start first |
 | 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
 | 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
 | 4 | 5s | pihole, traefik, findshyt | Containers |
 PVE2 VMs: order=1, wait=10s
 **Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
 ---
 ## UPS Status Codes
 | Code | Meaning | Action |
 |------|---------|--------|
 | `OL` | Online (AC power) | Normal operation |
 | `OB` | On Battery | Power outage - monitor runtime |
 | `LB` | Low Battery | <2 min remaining - shutdown imminent |
 | `CHRG` | Charging | Battery charging after power restored |
 | `DISCHRG` | Discharging | On battery, draining |
 | `FSD` | Forced Shutdown | NUT triggered shutdown |
 ---
 ## Monitoring & Commands
 ### Check UPS Status
 ```bash
 # Full status
 ssh pve 'upsc cyberpower@localhost'
 # Key metrics only
 ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
 # Example output:
 # battery.charge: 100
 # battery.runtime: 1234        (seconds remaining)
 # ups.load: 33                  (% load)
 # ups.status: OL                (online)
 ```
 ### Control UPS Beeper
 ```bash
 # Mute beeper (temporary - until next power event)
 ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
 # Disable beeper (permanent)
 ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
 # Enable beeper
 ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
 ```
 ### Test Shutdown Procedure
 **Simulate low battery** (careful - this will shut down VMs!):
 ```bash
 # Set a very high low battery threshold to trigger shutdown
 ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
 # Watch it trigger (when runtime drops below 300 seconds)
 ssh pve 'tail -f /var/log/ups-shutdown.log'
 # Reset to normal
 ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
 ```
 **Better test**: Run shutdown script manually without actually triggering NUT:
 ```bash
 ssh pve '/usr/local/bin/ups-shutdown.sh'
 ```
 ---
 ## Home Assistant Integration
 UPS metrics are exposed to Home Assistant via NUT integration.
 ### Available Sensors
 | Entity ID | Description |
 |-----------|-------------|
 | `sensor.cyberpower_battery_charge` | Battery % (0-100) |
 | `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
 | `sensor.cyberpower_load` | Load % (0-100) |
 | `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
 | `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
 | `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
 ### Configuration
 **Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
 ### Example Automations
 **Send notification when on battery**:
 ```yaml
 automation:
  - alias: "UPS On Battery Alert"
    trigger:
      - platform: state
        entity_id: sensor.cyberpower_status
        to: "OB"
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
 ```
 **Alert when battery low**:
 ```yaml
 automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_runtime
        below: 300
    action:
      - service: notify.mobile_app
        data:
          message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
 ```
 ---
 ## Testing Results
 ### Full Power Failure Test (2025-12-21)
 Complete end-to-end test of power failure and recovery:
 | Event | Time | Duration | Notes |
 |-------|------|----------|-------|
 | **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
 | **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
 | **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
 | **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
 | **Power restored** | ~22:47 | - | Plugged back in |
 | **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
 | **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
 | **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
 | **Total recovery** | - | **~7 min** | From power return to fully operational |
 **Results**:
 ✅ VMs shut down gracefully
 ✅ Hosts remained running until UPS died (as intended)
 ✅ Auto-boot on power restoration worked
 ✅ VMs started in correct order with appropriate delays
 ✅ No data corruption or issues
 **Runtime calculation**:
 - Load: ~33% (440W estimated)
 - Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
 - Matches manufacturer estimate for 33% load
 ---
 ## Proxmox Cluster Quorum Fix
 ### Problem
 With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
 During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
 ### Solution
 Modified `/etc/pve/corosync.conf` to enable 2-node mode:
 ```
 quorum {
    provider: corosync_votequorum
    two_node: 1
 }
 ```
 **Effect**:
 - Either node can operate independently if the other is down
 - No more waiting for quorum when one server is offline
 - Both nodes visible in single Proxmox interface when both up
 **Applied**: 2025-12-21
 ---
 ## Maintenance
 ### Monthly Checks
 ```bash
 # Check UPS status
 ssh pve 'upsc cyberpower@localhost'
 # Check NUT server running
 ssh pve 'systemctl status nut-server'
 ssh pve 'systemctl status nut-monitor'
 # Check NUT client running (PVE2)
 ssh pve2 'systemctl status nut-monitor'
 # Verify PVE2 can see UPS
 ssh pve2 'upsc cyberpower@10.10.10.120'
 # Check logs for errors
 ssh pve 'journalctl -u nut-server -n 50'
 ssh pve 'journalctl -u nut-monitor -n 50'
 ```
 ### Battery Health
 **Check battery stats**:
 ```bash
 ssh pve 'upsc cyberpower@localhost | grep battery'
 # Key metrics:
 # battery.charge: 100          (should be near 100 when on AC)
 # battery.runtime: 1200+       (seconds at current load)
 # battery.voltage: ~24V        (normal for 24V battery system)
 ```
 **Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
 ```bash
 ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
 ```
 CyberPower batteries typically last 3-5 years.
 ### Firmware Updates
 Check CyberPower website for firmware updates:
 https://www.cyberpowersystems.com/support/firmware/
 ---
 ## Troubleshooting
 ### UPS Not Detected
 ```bash
 # Check USB connection
 ssh pve 'lsusb | grep Cyber'
 # Expected:
 # Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
 # Restart NUT driver
 ssh pve 'systemctl restart nut-driver'
 ssh pve 'systemctl status nut-driver'
 ```
 ### PVE2 Can't Connect
 ```bash
 # Verify NUT server listening
 ssh pve 'netstat -tuln | grep 3493'
 # Should show:
 # tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
 # Test connection from PVE2
 ssh pve2 'telnet 10.10.10.120 3493'
 # Check firewall (should allow port 3493)
 ssh pve 'iptables -L -n | grep 3493'
 ```
 ### Shutdown Script Not Running
 ```bash
 # Check script permissions
 ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
 # Should be: -rwxr-xr-x (executable)
 # Check logs
 ssh pve 'cat /var/log/ups-shutdown.log'
 # Test script manually
 ssh pve '/usr/local/bin/ups-shutdown.sh'
 ```
 ### UPS Status Shows UNKNOWN
 ```bash
 # Driver may not be compatible
 ssh pve 'upsc cyberpower@localhost ups.status'
 # Try different driver (in /etc/nut/ups.conf)
 # driver = usbhid-ups
 # or
 # driver = blazer_usb
 # Restart after change
 ssh pve 'systemctl restart nut-driver nut-server'
 ```
 ---
 ## Future Improvements
 - [ ] Add email alerts for UPS events (power fail, low battery)
 - [ ] Log runtime statistics to track battery degradation
 - [ ] Set up Grafana dashboard for UPS metrics
 - [ ] Test battery runtime at different load levels
 - [ ] Upgrade to 20A circuit, restore original 5-20P plug
 - [ ] Consider adding network management card for out-of-band UPS access
 ---
 ## Related Documentation
 - [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
 - [VMS.md](VMS.md) - VM startup order configuration
 - [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
 ---
 **Last Updated**: 2025-12-22
--- a/VMS.md
+++ b/VMS.md
@@ -0,0 +1,580 @@
 # VMs and Containers
 Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
 ## Overview
 | Server | VMs | LXCs | Total |
 |--------|-----|------|-------|
 | **PVE** (10.10.10.120) | 6 | 3 | 9 |
 | **PVE2** (10.10.10.102) | 3 | 0 | 3 |
 | **Total** | **9** | **3** | **12** |
 ---
 ## PVE (10.10.10.120) - Primary Server
 ### Virtual Machines
 | VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
 |------|------|-----|-------|-----|---------|---------|-----------------|------------|
 | **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
 | **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
 | **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
 | **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
 | **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
 | **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
 | **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
 ### LXC Containers
 | CTID | Name | IP | RAM | Storage | Purpose |
 |------|------|-----|-----|---------|---------|
 | **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
 | **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
 | **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
 ---
 ## PVE2 (10.10.10.102) - Secondary Server
 ### Virtual Machines
 | VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
 |------|------|-----|-------|-----|---------|---------|-----------------|------------|
 | **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
 | **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
 | **302** | docker-host2 | 10.10.10.207 | 4 | 8GB | nvme-mirror3 | Docker host (n8n, automation) | - | ✅ Yes |
 ### LXC Containers
 None on PVE2.
 ---
 ## VM Details
 ### 100 - TrueNAS (Storage Server)
 **Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
 **Specs**:
 - **OS**: TrueNAS SCALE
 - **vCPUs**: 8
 - **RAM**: 32 GB
 - **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
 - **Network**:
  - Primary: 10 Gb (vmbr2)
  - Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
 **Hardware Passthrough**:
 - LSI SAS2308 HBA (for EMC enclosure drives)
 - Samsung NVMe (for ZFS caching)
 **ZFS Pools**:
 - `vault`: Main storage pool on EMC drives
 - Boot pool on passed-through NVMe
 **See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
 ---
 ### 101 - Saltbox (Media Automation)
 **Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 16
 - **RAM**: 16 GB
 - **Storage**: nvme-mirror1
 - **Network**: 10 Gb (vmbr2)
 **GPU Passthrough**:
 - NVIDIA TITAN RTX (for Plex hardware transcoding)
 **Services**:
 - Plex Media Server (plex.htsn.io)
 - Sonarr, Radarr, Lidarr (TV/movie/music automation)
 - SABnzbd, NZBGet (downloaders)
 - Overseerr (request management)
 - Tautulli (Plex stats)
 - Organizr (dashboard)
 - Authelia (SSO authentication)
 - Traefik (reverse proxy - separate from CT 202)
 **Managed By**: Saltbox Ansible playbooks
 **See**: [SALTBOX.md](#) (coming soon)
 ---
 ### 105 - fs-dev (Development Environment)
 **Purpose**: General development work, testing, prototyping
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 10
 - **RAM**: 8 GB
 - **Storage**: rpool
 - **Network**: 1 Gb (vmbr0)
 ---
 ### 110 - Home Assistant (Home Automation)
 **Purpose**: Smart home automation platform
 **Specs**:
 - **OS**: Home Assistant OS
 - **vCPUs**: 2
 - **RAM**: 2 GB
 - **Storage**: rpool
 - **Network**: 1 Gb (vmbr0)
 **Access**:
 - Web UI: https://homeassistant.htsn.io
 - API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
 **Special Notes**:
 - ❌ No QEMU agent (Home Assistant OS doesn't support it)
 - No SSH server by default (access via web terminal)
 ---
 ### 111 - lmdev1 (AI/LLM Development)
 **Purpose**: AI model development, fine-tuning, inference
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 8
 - **RAM**: 32 GB
 - **Storage**: nvme-mirror1
 - **Network**: 1 Gb (vmbr0)
 **GPU Passthrough**:
 - NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
 **Installed**:
 - CUDA toolkit
 - Python 3.11+
 - PyTorch, TensorFlow
 - Hugging Face transformers
 ---
 ### 201 - Copyparty (File Sharing)
 **Purpose**: Simple HTTP file sharing server
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 2
 - **RAM**: 2 GB
 - **Storage**: rpool
 - **Network**: 1 Gb (vmbr0)
 **Access**: https://copyparty.htsn.io
 ---
 ### 206 - docker-host (Docker Services)
 **Purpose**: General-purpose Docker host for miscellaneous services
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 2
 - **RAM**: 4 GB
 - **Storage**: rpool
 - **Network**: 1 Gb (vmbr0)
 - **CPU**: `host` passthrough (for x86-64-v3 support)
 **Services Running**:
 - Excalidraw (excalidraw.htsn.io) - Whiteboard
 - Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
 - Pulse (pulse.htsn.io) - Monitoring dashboard
 **Docker Compose Files**: `/opt/*/docker-compose.yml`
 ---
 ### 300 - gitea-vm (Git Server)
 **Purpose**: Self-hosted Git server
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 2
 - **RAM**: 4 GB
 - **Storage**: nvme-mirror3 (PVE2)
 - **Network**: 1 Gb (vmbr0)
 **Access**: https://git.htsn.io
 **Repositories**:
 - homelab-docs (this documentation)
 - Personal projects
 - Private repos
 ---
 ### 301 - trading-vm (AI Trading Platform)
 **Purpose**: Algorithmic trading system with AI models
 **Specs**:
 - **OS**: Ubuntu 22.04
 - **vCPUs**: 16
 - **RAM**: 32 GB
 - **Storage**: nvme-mirror3 (PVE2)
 - **Network**: 1 Gb (vmbr0)
 **GPU Passthrough**:
 - NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
 **Software**:
 - Trading algorithms
 - AI models for market prediction
 - Real-time data feeds
 - Backtesting infrastructure
 ---
 ## LXC Container Details
 ### 200 - Pi-hole (DNS & Ad Blocking)
 **Purpose**: Network-wide DNS server and ad blocker
 **Type**: LXC (unprivileged)
 **OS**: Ubuntu 22.04
 **IP**: 10.10.10.10
 **Storage**: rpool
 **Access**:
 - Web UI: http://10.10.10.10/admin
 - Public URL: https://pihole.htsn.io
 **Configuration**:
 - Upstream DNS: Cloudflare (1.1.1.1)
 - DHCP: Disabled (router handles DHCP)
 - Interface: All interfaces
 **Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
 ---
 ### 202 - Traefik (Reverse Proxy)
 **Purpose**: Primary reverse proxy for all public-facing services
 **Type**: LXC (unprivileged)
 **OS**: Ubuntu 22.04
 **IP**: 10.10.10.250
 **Storage**: rpool
 **Configuration**: `/etc/traefik/`
 **Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
 **See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
 **⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
 ---
 ### 205 - FindShyt (Custom App)
 **Purpose**: Custom application (details TBD)
 **Type**: LXC (unprivileged)
 **OS**: Ubuntu 22.04
 **IP**: 10.10.10.8
 **Storage**: rpool
 **Access**: https://findshyt.htsn.io
 ---
 ## VM Startup Order & Dependencies
 ### Power-On Sequence
 When servers boot (after power failure or restart), VMs/CTs start in this order:
 #### PVE (10.10.10.120)
 | Order | Wait | VMID | Name | Reason |
 |-------|------|------|------|--------|
 | **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
 | **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
 | **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
 | **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
 **Configure startup order** (already set):
 ```bash
 # View current config
 ssh pve 'qm config 100 | grep -E "startup|onboot"'
 # Set startup order (example)
 ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
 ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
 ```
 #### PVE2 (10.10.10.102)
 | Order | Wait | VMID | Name |
 |-------|------|------|------|
 | **1** | 10s | 300, 301 | All VMs |
 **Less critical** - no dependencies between PVE2 VMs.
 ---
 ## Resource Allocation Summary
 ### Total Allocated (PVE)
 | Resource | Allocated | Physical | % Used |
 |----------|-----------|----------|--------|
 | **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
 | **RAM** | 98 GB | 128 GB | 77% |
 **Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
 ### Total Allocated (PVE2)
 | Resource | Allocated | Physical | % Used |
 |----------|-----------|----------|--------|
 | **vCPUs** | 18 | 64 | 28% |
 | **RAM** | 36 GB | 128 GB | 28% |
 **PVE2** has significant headroom for additional VMs.
 ---
 ## Adding a New VM
 ### Quick Template
 ```bash
 # Create VM
 ssh pve 'qm create VMID \
  --name myvm \
  --memory 4096 \
  --cores 2 \
  --net0 virtio,bridge=vmbr0 \
  --scsihw virtio-scsi-pci \
  --scsi0 nvme-mirror1:32 \
  --boot order=scsi0 \
  --ostype l26 \
  --agent enabled=1'
 # Attach ISO for installation
 ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
 # Start VM
 ssh pve 'qm start VMID'
 # Access console
 ssh pve 'qm vncproxy VMID' # Then connect with VNC client
 # Or via Proxmox web UI
 ```
 ### Cloud-Init Template (Faster)
 Use cloud-init for automated VM deployment:
 ```bash
 # Download cloud image
 ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
 # Create VM
 ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
 # Import disk
 ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
 # Attach disk
 ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
 # Add cloud-init drive
 ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
 # Set boot disk
 ssh pve 'qm set VMID --boot order=scsi0'
 # Configure cloud-init (user, SSH key, network)
 ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
 # Enable QEMU agent
 ssh pve 'qm set VMID --agent enabled=1'
 # Resize disk (cloud images are small by default)
 ssh pve 'qm resize VMID scsi0 +30G'
 # Start VM
 ssh pve 'qm start VMID'
 ```
 **Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
 ---
 ## Adding a New LXC Container
 ```bash
 # Download template (if not already downloaded)
 ssh pve 'pveam update'
 ssh pve 'pveam available | grep ubuntu'
 ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
 # Create container
 ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname mycontainer \
  --memory 2048 \
  --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
  --rootfs local-zfs:8 \
  --unprivileged 1 \
  --features nesting=1 \
  --start 1'
 # Set root password
 ssh pve 'pct exec CTID -- passwd'
 # Add SSH key
 ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
 ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
 ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
 ```
 ---
 ## GPU Passthrough Configuration
 ### Current GPU Assignments
 | GPU | Location | Passed To | VMID | Purpose |
 |-----|----------|-----------|------|---------|
 | **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
 | **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
 | **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
 ### How to Pass GPU to VM
 1. **Identify GPU PCI ID**:
   ```bash
   ssh pve 'lspci | grep -i nvidia'
   # Example output:
   # 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
   # 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
   ```
 2. **Pass GPU to VM** (include both VGA and Audio):
   ```bash
   ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
   # If multi-function device (GPU + Audio), use:
   ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
   ```
 3. **Configure VM for GPU**:
   ```bash
   # Set machine type to q35
   ssh pve 'qm set VMID --machine q35'
   # Set BIOS to OVMF (UEFI)
   ssh pve 'qm set VMID --bios ovmf'
   # Add EFI disk
   ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
   ```
 4. **Reboot VM** and install NVIDIA drivers inside the VM
 **See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
 ---
 ## Backup Priority
 See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
 ### Critical VMs (Must Backup)
 | Priority | VMID | Name | Reason |
 |----------|------|------|--------|
 | 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
 | 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
 | 🟡 **HIGH** | 110 | homeassistant | Home automation config |
 | 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
 | 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
 ### Medium Priority
 | VMID | Name | Notes |
 |------|------|-------|
 | 200 | pihole | Easy to rebuild, but DNS config valuable |
 | 202 | traefik | Config files backed up separately |
 ### Low Priority (Ephemeral/Rebuildable)
 | VMID | Name | Notes |
 |------|------|-------|
 | 105 | fs-dev | Development - code is in Git |
 | 111 | lmdev1 | Ephemeral development |
 | 201 | copyparty | Simple app, easy to redeploy |
 | 206 | docker-host | Docker Compose files backed up separately |
 ---
 ## Quick Reference Commands
 ```bash
 # List all VMs
 ssh pve 'qm list'
 ssh pve2 'qm list'
 # List all containers
 ssh pve 'pct list'
 # Start/stop VM
 ssh pve 'qm start VMID'
 ssh pve 'qm stop VMID'
 ssh pve 'qm shutdown VMID'  # Graceful
 # Start/stop container
 ssh pve 'pct start CTID'
 ssh pve 'pct stop CTID'
 ssh pve 'pct shutdown CTID'  # Graceful
 # VM console
 ssh pve 'qm terminal VMID'
 # Container console
 ssh pve 'pct enter CTID'
 # Clone VM
 ssh pve 'qm clone VMID NEW_VMID --name newvm'
 # Delete VM
 ssh pve 'qm destroy VMID'
 # Delete container
 ssh pve 'pct destroy CTID'
 ```
 ---
 ## Related Documentation
 - [STORAGE.md](STORAGE.md) - Storage pool assignments
 - [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
 - [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
 - [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
 - [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
 ---
 **Last Updated**: 2025-12-22
--- a/client_secret_693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com.json
+++ b/client_secret_693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com.json
@@ -0,0 +1 @@
 {"web":{"client_id":"693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com","project_id":"spheric-method-482514-f8","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["https://my.home-assistant.io/redirect/oauth"]}}
--- a/data/scripts/internet-watchdog.sh
+++ b/data/scripts/internet-watchdog.sh
@@ -0,0 +1,41 @@
 #!/bin/bash
 # Internet Watchdog - Reboots if internet is unreachable for 5 minutes
 LOG_FILE="/var/log/internet-watchdog.log"
 FAIL_COUNT=0
 MAX_FAILS=5
 CHECK_INTERVAL=60
 log() {
    echo "$(date "+%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
 }
 check_internet() {
    for endpoint in 1.1.1.1 8.8.8.8 208.67.222.222; do
        if ping -c 1 -W 5 "$endpoint" > /dev/null 2>&1; then
            return 0
        fi
    done
    return 1
 }
 log "Watchdog started"
 while true; do
    if check_internet; then
        if [ $FAIL_COUNT -gt 0 ]; then
            log "Internet restored after $FAIL_COUNT failures"
        fi
        FAIL_COUNT=0
    else
        FAIL_COUNT=$((FAIL_COUNT + 1))
        log "Internet check failed ($FAIL_COUNT/$MAX_FAILS)"
        if [ $FAIL_COUNT -ge $MAX_FAILS ]; then
            log "CRITICAL: $MAX_FAILS consecutive failures - REBOOTING"
            sync
            sleep 2
            reboot
        fi
    fi
    sleep $CHECK_INTERVAL
 done
--- a/data/scripts/memory-monitor.sh
+++ b/data/scripts/memory-monitor.sh
@@ -0,0 +1,23 @@
 #!/bin/bash
 LOG_DIR="/data/logs"
 LOG_FILE="$LOG_DIR/memory-history.log"
 mkdir -p "$LOG_DIR"
 while true; do
    # Rotate if over 10MB
    if [ -f "$LOG_FILE" ]; then
        SIZE=$(wc -c < "$LOG_FILE" 2>/dev/null || echo 0)
        if [ "$SIZE" -gt 10485760 ]; then
            mv "$LOG_FILE" "$LOG_FILE.old"
        fi
    fi
    echo "========== $(date +%Y-%m-%d\ %H:%M:%S) ==========" >> "$LOG_FILE"
    echo "--- MEMORY ---" >> "$LOG_FILE"
    free -m >> "$LOG_FILE"
    echo "--- TOP MEMORY PROCESSES ---" >> "$LOG_FILE"
    ps -eo pid,rss,comm --sort=-rss | head -12 >> "$LOG_FILE"
    echo "" >> "$LOG_FILE"
    sleep 600
 done
		`@@ -0,0 +1 @@`
							`{"web":{"client_id":"693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com","project_id":"spheric-method-482514-f8","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["https://my.home-assistant.io/redirect/oauth"]}}`