Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/BACKUP-STRATEGY.md
+++ b/BACKUP-STRATEGY.md
@@ -0,0 +1,358 @@
+# Backup Strategy
+
+## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
+
+This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
+
+## Executive Summary
+
+### What We Have ✅
+- **Syncthing**: File synchronization across 5+ devices
+- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
+- **Proxmox**: Built-in backup capabilities (not yet configured)
+
+### What We DON'T Have 🚨
+- ❌ No documented VM/CT backups
+- ❌ No ZFS snapshot schedule
+- ❌ No offsite backups
+- ❌ No disaster recovery plan
+- ❌ No tested restore procedures
+- ❌ No configuration backups
+
+**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
+
+---
+
+## Current State Analysis
+
+### Syncthing (File Synchronization)
+
+**What it is**: Real-time file sync across devices
+**What it is NOT**: A backup solution
+
+| Folder | Devices | Size | Protected? |
+|--------|---------|------|------------|
+| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
+| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
+| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
+| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
+| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
+
+**Limitations**:
+- ❌ Accidental deletion → deleted everywhere
+- ❌ Ransomware/corruption → spreads everywhere
+- ❌ No point-in-time recovery
+- ❌ No version history (unless file versioning enabled - not documented)
+
+**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
+
+### ZFS on TrueNAS (Potential Backup Target)
+
+**Current Status**: ❓ Unknown - snapshots may or may not be configured
+
+**Needs Investigation**:
+```bash
+# Check if snapshots exist
+ssh truenas 'zfs list -t snapshot'
+
+# Check if automated snapshots are configured
+ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
+
+# Check snapshot schedule via TrueNAS API/UI
+```
+
+**If configured**, ZFS snapshots provide:
+- ✅ Point-in-time recovery
+- ✅ Protection against accidental deletion
+- ✅ Fast rollback capability
+- ⚠️ Still single location (no offsite protection)
+
+### Proxmox VM/CT Backups
+
+**Current Status**: ❓ Unknown - no backup jobs documented
+
+**Needs Investigation**:
+```bash
+# Check backup configuration
+ssh pve 'pvesh get /cluster/backup'
+
+# Check if any backups exist
+ssh pve 'ls -lh /var/lib/vz/dump/'
+ssh pve2 'ls -lh /var/lib/vz/dump/'
+```
+
+**Critical VMs Needing Backup**:
+| VM/CT | VMID | Priority | Notes |
+|-------|------|----------|-------|
+| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
+| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
+| homeassistant | 110 | 🟡 HIGH | Home automation config |
+| gitea-vm | 300 | 🟡 HIGH | Git repositories |
+| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
+| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
+| trading-vm | 301 | 🟡 HIGH | AI trading platform |
+| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
+
+---
+
+## Recommended Backup Strategy
+
+### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
+
+**ZFS Snapshots on TrueNAS**
+
+Schedule automatic snapshots for all datasets:
+
+| Dataset | Frequency | Retention |
+|---------|-----------|-----------|
+| vault/documents | Every 15 min | 1 hour |
+| vault/documents | Hourly | 24 hours |
+| vault/documents | Daily | 30 days |
+| vault/documents | Weekly | 12 weeks |
+| vault/documents | Monthly | 12 months |
+
+**Implementation**:
+```bash
+# Via TrueNAS UI: Storage → Snapshots → Add
+# Or via CLI:
+ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
+```
+
+**Proxmox VM Backups**
+
+Configure weekly backups to local storage:
+
+```bash
+# Create backup job via Proxmox UI:
+# Datacenter → Backup → Add
+# - Schedule: Weekly (Sunday 2 AM)
+# - Storage: local-zfs or nvme-mirror1
+# - Mode: Snapshot (fast)
+# - Retention: 4 backups
+```
+
+**Or via CLI**:
+```bash
+ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
+```
+
+### Tier 2: Offsite Backups (CRITICAL GAP)
+
+**Option A: Cloud Storage (Recommended)**
+
+Use **rclone** or **restic** to sync critical data to cloud:
+
+| Provider | Cost | Pros | Cons |
+|----------|------|------|------|
+| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
+| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
+| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
+
+**Implementation Example (Backblaze B2)**:
+```bash
+# Install on TrueNAS
+ssh truenas 'pkg install rclone restic'
+
+# Configure B2
+rclone config  # Follow prompts for B2
+
+# Daily backup critical folders
+0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
+```
+
+**Option B: Offsite TrueNAS Replication**
+
+- Set up second TrueNAS at friend/family member's house
+- Use ZFS replication to sync snapshots
+- Requires: Static IP or Tailscale, trust
+
+**Option C: USB Drive Rotation**
+
+- Weekly backup to external USB drive
+- Rotate 2-3 drives (one always offsite)
+- Manual but simple
+
+### Tier 3: Configuration Backups
+
+**Proxmox Configuration**
+
+```bash
+# Backup /etc/pve (configs are already in cluster filesystem)
+# But also backup to external location:
+ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
+
+# Copy to safe location
+scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
+```
+
+**VM-Specific Configs**
+
+- Traefik configs: `/etc/traefik/` on CT 202
+- Saltbox configs: `/srv/git/saltbox/` on VM 101
+- Home Assistant: `/config/` on VM 110
+
+**Script to backup all configs**:
+```bash
+#!/bin/bash
+# Save as ~/bin/backup-homelab-configs.sh
+
+DATE=$(date +%Y%m%d)
+BACKUP_DIR=~/Backups/homelab-configs/$DATE
+
+mkdir -p $BACKUP_DIR
+
+# Proxmox configs
+ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
+ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
+
+# Traefik
+ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
+
+# Saltbox
+ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
+
+# Home Assistant
+ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
+
+echo "Configs backed up to $BACKUP_DIR"
+```
+
+---
+
+## Disaster Recovery Scenarios
+
+### Scenario 1: Single VM Failure
+
+**Impact**: Medium
+**Recovery Time**: 30-60 minutes
+
+1. Restore from Proxmox backup:
+   ```bash
+   ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
+   ```
+2. Start VM and verify
+3. Update IP if needed
+
+### Scenario 2: TrueNAS Failure
+
+**Impact**: CATASTROPHIC (all storage lost)
+**Recovery Time**: Unknown - NO PLAN
+
+**Current State**: 🚨 NO RECOVERY PLAN
+**Needed**:
+- Offsite backup of critical datasets
+- Documented ZFS pool creation steps
+- Share configuration export
+
+### Scenario 3: Complete PVE Server Failure
+
+**Impact**: SEVERE
+**Recovery Time**: 4-8 hours
+
+**Current State**: ⚠️ PARTIALLY RECOVERABLE
+**Needed**:
+- VM backups stored on TrueNAS or PVE2
+- Proxmox reinstall procedure
+- Network config documentation
+
+### Scenario 4: Complete Site Disaster (Fire/Flood)
+
+**Impact**: TOTAL LOSS
+**Recovery Time**: Unknown
+
+**Current State**: 🚨 NO RECOVERY PLAN
+**Needed**:
+- Offsite backups (cloud or physical)
+- Critical data prioritization
+- Restore procedures
+
+---
+
+## Action Plan
+
+### Immediate (Next 7 Days)
+
+- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
+  ```bash
+  ssh truenas 'zfs list -t snapshot'
+  ssh pve 'ls -lh /var/lib/vz/dump/'
+  ```
+
+- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
+
+- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
+
+- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
+
+### Short-term (Next 30 Days)
+
+- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
+
+- [ ] **Install backup tools**: rclone or restic on TrueNAS
+
+- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
+
+- [ ] **Document restore procedures**: Step-by-step guides for each scenario
+
+### Long-term (Next 90 Days)
+
+- [ ] **Implement monitoring**: Alerts for backup failures
+
+- [ ] **Quarterly restore test**: Verify backups actually work
+
+- [ ] **Backup rotation policy**: Automate old backup cleanup
+
+- [ ] **Configuration backup automation**: Weekly cron job
+
+---
+
+## Monitoring & Validation
+
+### Backup Health Checks
+
+```bash
+# Check last ZFS snapshot
+ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
+
+# Check Proxmox backup status
+ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
+
+# Check cloud sync status (if using rclone)
+ssh truenas 'rclone ls b2:homelab-backup | wc -l'
+```
+
+### Alerts to Set Up
+
+- Email alert if no snapshot created in 24 hours
+- Email alert if Proxmox backup fails
+- Email alert if cloud sync fails
+- Weekly backup status report
+
+---
+
+## Cost Estimate
+
+**Monthly Backup Costs**:
+
+| Component | Cost | Notes |
+|-----------|------|-------|
+| Local storage (already owned) | $0 | Using existing TrueNAS |
+| Proxmox backups (local) | $0 | Using existing storage |
+| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
+| **Total** | **~$10/mo** | Minimal cost for peace of mind |
+
+**One-time**:
+- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
+
+---
+
+## Related Documentation
+
+- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
+- [VMS.md](VMS.md) - VM inventory and prioritization
+- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED
--- a/CLAUDE.md
+++ b/CLAUDE.md
--- a/HARDWARE.md
+++ b/HARDWARE.md
@@ -0,0 +1,455 @@
+# Hardware Inventory
+
+Complete hardware specifications for all homelab equipment.
+
+## Servers
+
+### PVE (10.10.10.120) - Primary Proxmox Server
+
+#### CPU
+- **Model**: AMD Ryzen Threadripper PRO 3975WX
+- **Cores**: 32 cores / 64 threads
+- **Base Clock**: 3.5 GHz
+- **Boost Clock**: 4.2 GHz
+- **TDP**: 280W
+- **Architecture**: Zen 2 (7nm)
+- **Socket**: sTRX4
+- **Features**: ECC support, PCIe 4.0
+
+#### RAM
+- **Capacity**: 128 GB
+- **Type**: DDR4 ECC Registered
+- **Speed**: Unknown (needs investigation)
+- **Channels**: 8-channel (quad-channel per socket)
+- **Idle Power**: ~30-40W
+
+#### Storage
+
+**OS/VM Storage:**
+
+| Pool | Devices | Type | Capacity | Purpose |
+|------|---------|------|----------|---------|
+| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
+| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
+| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
+
+**Total Storage**: ~9 TB usable
+
+#### GPUs
+
+| Model | Slot | VRAM | TDP | Purpose | Passed To |
+|-------|------|------|-----|---------|-----------|
+| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
+| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
+
+**Total GPU Power**: 75W + 280W = 355W (under load)
+
+#### Network Cards
+
+| Interface | Model | Speed | Purpose | Bridge |
+|-----------|-------|-------|---------|--------|
+| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
+| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
+| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
+
+**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
+
+#### Storage Controllers
+
+| Model | Interface | Purpose |
+|-------|-----------|---------|
+| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
+| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
+
+#### Motherboard
+- **Model**: Unknown - needs investigation
+- **Chipset**: AMD TRX40
+- **Form Factor**: ATX/EATX
+- **PCIe Slots**: Multiple PCIe 4.0 slots
+- **Features**: IOMMU support, ECC memory
+
+#### Power Supply
+- **Model**: Unknown
+- **Wattage**: Likely 1000W+ (needs investigation)
+- **Type**: ATX, 80+ certification unknown
+
+#### Cooling
+- **CPU Cooler**: Unknown - likely large tower or AIO
+- **Case Fans**: Unknown quantity
+- **Note**: CPU temps 70-80°C under load (healthy)
+
+---
+
+### PVE2 (10.10.10.102) - Secondary Proxmox Server
+
+#### CPU
+- **Model**: AMD Ryzen Threadripper PRO 3975WX
+- **Specs**: Same as PVE (32C/64T, 280W TDP)
+
+#### RAM
+- **Capacity**: 128 GB DDR4 ECC
+- **Same specs as PVE**
+
+#### Storage
+
+| Pool | Devices | Type | Capacity | Purpose |
+|------|---------|------|----------|---------|
+| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
+| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
+
+**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
+
+#### GPUs
+
+| Model | Slot | VRAM | TDP | Purpose | Passed To |
+|-------|------|------|-----|---------|-----------|
+| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
+
+#### Network Cards
+
+| Interface | Model | Speed | Purpose |
+|-----------|-------|-------|---------|
+| nic1 | Unknown (onboard) | 1 Gb | Management |
+
+**Note**: MTU set to 9000 for jumbo frames
+
+#### Motherboard
+- **Model**: Unknown
+- **Chipset**: AMD TRX40
+- **Similar to PVE**
+
+---
+
+## Network Equipment
+
+### UniFi Dream Machine Pro (UCG-Fiber)
+
+- **Model**: UniFi Cloud Gateway Fiber
+- **IP**: 10.10.10.1
+- **Ports**: Multiple 1Gb + SFP+ uplink
+- **Features**: Router, firewall, VPN, IDS/IPS
+- **MTU**: 9216 (supports jumbo frames)
+- **Tailscale**: Installed for VPN failover
+
+### Switches
+
+**Details needed** - investigate current switch setup:
+- 10Gb switch for high-speed connections?
+- 1Gb switch for general devices?
+- PoE capabilities?
+
+```bash
+# Check what's connected to 10Gb interfaces
+ssh pve 'ip link show enp35s0f0'
+ssh pve 'ip link show enp35s0f1'
+```
+
+---
+
+## Storage Hardware
+
+### EMC Storage Enclosure
+
+**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
+
+- **Model**: EMC KTN-STL4 (or similar)
+- **Form Factor**: 4U rackmount
+- **Drive Bays**: 25x 3.5" SAS/SATA
+- **Controllers**: Dual LCC (Link Control Cards)
+- **Connection**: SAS via LSI SAS2308 HBA
+- **Passed to**: TrueNAS VM (VMID 100)
+
+**Current Status**:
+- LCC A: Active (working)
+- LCC B: Failed (replacement ordered)
+
+**Drive Inventory**: Unknown - needs audit
+
+```bash
+# Get drive list from TrueNAS
+ssh truenas 'smartctl --scan'
+ssh truenas 'lsblk'
+```
+
+### NVMe Drives
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
+| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
+| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
+| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
+
+### SSDs
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
+
+### HDDs
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
+| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
+
+---
+
+## UPS
+
+### Current UPS
+
+| Specification | Value |
+|---------------|-------|
+| **Model** | CyberPower OR2200PFCRT2U |
+| **Capacity** | 2200VA / 1320W |
+| **Form Factor** | 2U rackmount |
+| **Input** | NEMA 5-15P (rewired from 5-20P) |
+| **Outlets** | 2x 5-20R + 6x 5-15R |
+| **Output** | PFC Sinewave |
+| **Runtime** | ~15-20 min @ 33% load |
+| **Interface** | USB (connected to PVE) |
+
+**See [UPS.md](UPS.md) for configuration details**
+
+---
+
+## Client Devices
+
+### Mac Mini (Hutson's Workstation)
+
+- **Model**: Unknown generation
+- **CPU**: Unknown
+- **RAM**: Unknown
+- **Storage**: Unknown
+- **Network**: 1Gb Ethernet (en0) - MTU 9000
+- **Tailscale IP**: 100.108.89.58
+- **Local IP**: 10.10.10.125 (static)
+- **Purpose**: Primary workstation, Happy Coder daemon host
+
+### MacBook (Mobile)
+
+- **Model**: Unknown
+- **Network**: Wi-Fi + Ethernet adapter
+- **Tailscale IP**: Unknown
+- **Purpose**: Mobile work, development
+
+### Windows PC
+
+- **Model**: Unknown
+- **CPU**: Unknown
+- **Network**: 1Gb Ethernet
+- **IP**: 10.10.10.150
+- **Purpose**: Gaming, Windows development, Syncthing node
+
+### Phone (Android)
+
+- **Model**: Unknown
+- **IP**: 10.10.10.54 (when on Wi-Fi)
+- **Purpose**: Syncthing mobile node, Happy Coder client
+
+---
+
+## Rack Layout (If Applicable)
+
+**Needs documentation** - Current rack configuration unknown
+
+Suggested format:
+```
+U42: Blank panel
+U41: UPS (CyberPower 2U)
+U40: UPS (CyberPower 2U)
+U39: Switch (10Gb)
+U38-U35: EMC Storage Enclosure (4U)
+U34: PVE Server
+U33: PVE2 Server
+...
+```
+
+---
+
+## Power Consumption
+
+### Measured Power Draw
+
+| Component | Idle | Typical | Peak | Notes |
+|-----------|------|---------|------|-------|
+| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
+| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
+| Network Gear | ~50W | ~50W | ~50W | Router + switches |
+| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
+
+**UPS Capacity**: 1320W
+**Typical Load**: 33-50% (safe margin)
+**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
+
+### Power Optimizations Applied
+
+**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
+
+- KSMD disabled: ~60-80W saved
+- CPU governors: ~60-120W saved
+- Syncthing rescans: ~60-80W saved
+- HDD spindown: ~10-16W saved when idle
+- **Total savings**: ~150-300W
+
+---
+
+## Thermal Management
+
+### CPU Cooling
+
+**PVE & PVE2**:
+- CPU cooler: Unknown model
+- Thermal paste: Unknown, likely needs refresh if temps >85°C
+- Target temp: 70-80°C under load
+- Max safe: 90°C Tctl (Threadripper PRO spec)
+
+### GPU Cooling
+
+All GPUs are passively managed (stock coolers):
+- TITAN RTX: 2-3W idle, 280W load
+- RTX A6000: 11W idle, 300W load
+- Quadro P2000: 25W constant (Plex active)
+
+### Case Airflow
+
+**Unknown** - needs investigation:
+- Case model?
+- Fan configuration?
+- Positive or negative pressure?
+
+---
+
+## Cable Management
+
+### Network Cables
+
+| Connection | Type | Length | Speed |
+|------------|------|--------|-------|
+| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
+| PVE2 → Router | Cat6 | Unknown | 1Gb |
+| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
+| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
+
+### Power Cables
+
+**Critical**: All servers on UPS battery-backed outlets
+
+---
+
+## Maintenance Schedule
+
+### Annual Maintenance
+
+- [ ] Clean dust from servers (every 6-12 months)
+- [ ] Check thermal paste on CPUs (every 2-3 years)
+- [ ] Test UPS battery runtime (annually)
+- [ ] Verify all fans operational
+- [ ] Check for bulging capacitors on PSUs
+
+### Drive Health
+
+```bash
+# Check SMART status on all drives
+ssh pve 'smartctl -a /dev/nvme0'
+ssh pve2 'smartctl -a /dev/sda'
+ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
+```
+
+### Temperature Monitoring
+
+```bash
+# Check all temps (needs lm-sensors installed)
+ssh pve 'sensors'
+ssh pve2 'sensors'
+```
+
+---
+
+## Warranty & Purchase Info
+
+**Needs documentation**:
+- When were servers purchased?
+- Where were components bought?
+- Any warranties still active?
+- Replacement part sources?
+
+---
+
+## Upgrade Path
+
+### Short-term Upgrades (< 6 months)
+
+- [ ] 20A circuit for UPS (restore original 5-20P plug)
+- [ ] Document missing hardware specs
+- [ ] Label all cables
+- [ ] Create rack diagram
+
+### Medium-term Upgrades (6-12 months)
+
+- [ ] Additional 10Gb NIC for PVE2?
+- [ ] More NVMe storage?
+- [ ] Upgrade network switches?
+- [ ] Replace EMC enclosure with newer model?
+
+### Long-term Upgrades (1-2 years)
+
+- [ ] CPU upgrade to newer Threadripper?
+- [ ] RAM expansion to 256GB?
+- [ ] Additional GPU for AI workloads?
+- [ ] Migrate to PCIe 5.0 storage?
+
+---
+
+## Investigation Needed
+
+High-priority items to document:
+
+- [ ] Get exact motherboard model (both servers)
+- [ ] Get PSU model and wattage
+- [ ] CPU cooler models
+- [ ] Network switch models and configuration
+- [ ] Complete drive inventory in EMC enclosure
+- [ ] RAM speed and timings
+- [ ] Case models
+- [ ] Exact NVMe models for all drives
+
+**Commands to gather info**:
+
+```bash
+# Motherboard
+ssh pve 'dmidecode -t baseboard'
+
+# CPU details
+ssh pve 'lscpu'
+
+# RAM details
+ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
+
+# Storage devices
+ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
+
+# Network cards
+ssh pve 'lspci | grep -i network'
+
+# GPU details
+ssh pve 'lspci | grep -i vga'
+ssh pve 'nvidia-smi -L'  # If nvidia-smi available
+```
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - VM resource allocation
+- [STORAGE.md](STORAGE.md) - Storage pools and usage
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
+- [UPS.md](UPS.md) - UPS configuration
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Incomplete - many specs need investigation
--- a/HOMEASSISTANT.md
+++ b/HOMEASSISTANT.md
@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
 - **Philips Hue** - Lights
 - **Sonos** - Speakers
 - **Motion Sensors** - Various locations
+- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
+
+### NUT / UPS Integration
+
+Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
+
+**Connection:**
+- Host: 10.10.10.120
+- Port: 3493
+- Username: upsmon
+- Password: upsmon123
+
+**Entities:**
+| Entity ID | Description |
+|-----------|-------------|
+| `sensor.cyberpower_battery_charge` | Battery percentage |
+| `sensor.cyberpower_load` | Current load % |
+| `sensor.cyberpower_input_voltage` | Input voltage |
+| `sensor.cyberpower_output_voltage` | Output voltage |
+| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
+| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
+
+**Dashboard Card Example:**
+```yaml
+type: entities
+title: UPS Status
+entities:
+  - entity: sensor.cyberpower_status
+    name: Status
+  - entity: sensor.cyberpower_battery_charge
+    name: Battery
+  - entity: sensor.cyberpower_load
+    name: Load
+  - entity: sensor.cyberpower_input_voltage
+    name: Input Voltage
+```

 ## Automations

--- a/MAINTENANCE.md
+++ b/MAINTENANCE.md
@@ -0,0 +1,618 @@
+# Maintenance Procedures and Schedules
+
+Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
+
+## Overview
+
+| Frequency | Tasks | Estimated Time |
+|-----------|-------|----------------|
+| **Daily** | Quick health check | 2-5 min |
+| **Weekly** | Service status, logs review | 15-30 min |
+| **Monthly** | Updates, backups verification | 1-2 hours |
+| **Quarterly** | Full system audit, testing | 2-4 hours |
+| **Annual** | Hardware maintenance, planning | 4-8 hours |
+
+---
+
+## Daily Maintenance (Automated)
+
+### Quick Health Check Script
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily via cron**:
+```bash
+# Add to crontab
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+## Weekly Maintenance
+
+### Service Status Review
+
+**Check all critical services**:
+```bash
+# Proxmox services
+ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
+ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
+
+# NUT (UPS monitoring)
+ssh pve 'systemctl status nut-server nut-monitor'
+ssh pve2 'systemctl status nut-monitor'
+
+# Container services
+ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
+ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik
+
+# VM services (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS
+```
+
+### Log Review
+
+**Check for errors in critical logs**:
+```bash
+# Proxmox system logs
+ssh pve 'journalctl -p err -b | tail -50'
+ssh pve2 'journalctl -p err -b | tail -50'
+
+# VM logs (if QEMU agent available)
+ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
+
+# Traefik access logs
+ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
+```
+
+### Syncthing Sync Status
+
+**Check for sync errors**:
+```bash
+# Check all folder errors
+for folder in documents downloads desktop movies pictures notes config; do
+  echo "=== $folder ==="
+  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
+done
+```
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+## Monthly Maintenance
+
+### System Updates
+
+#### Proxmox Updates
+
+**Check for updates**:
+```bash
+ssh pve 'apt update && apt list --upgradable'
+ssh pve2 'apt update && apt list --upgradable'
+```
+
+**Apply updates**:
+```bash
+# PVE
+ssh pve 'apt update && apt dist-upgrade -y'
+
+# PVE2
+ssh pve2 'apt update && apt dist-upgrade -y'
+
+# Reboot if kernel updated
+ssh pve 'reboot'
+ssh pve2 'reboot'
+```
+
+**⚠️ Important**:
+- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
+- Test on PVE2 first if possible
+- Ensure all VMs are backed up before updating
+- Monitor VMs after reboot - some may need manual restart
+
+#### Container Updates (LXC)
+
+```bash
+# Update all containers
+ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
+```
+
+#### VM Updates
+
+**Update VMs individually via SSH**:
+```bash
+# Ubuntu/Debian VMs
+ssh truenas 'apt update && apt upgrade -y'
+ssh docker-host 'apt update && apt upgrade -y'
+ssh fs-dev 'apt update && apt upgrade -y'
+
+# Check if reboot required
+ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
+```
+
+### ZFS Scrubs
+
+**Schedule**: Run monthly on all pools
+
+**PVE**:
+```bash
+# Start scrub on all pools
+ssh pve 'zpool scrub nvme-mirror1'
+ssh pve 'zpool scrub nvme-mirror2'
+ssh pve 'zpool scrub rpool'
+
+# Check scrub status
+ssh pve 'zpool status | grep -A2 scrub'
+```
+
+**PVE2**:
+```bash
+ssh pve2 'zpool scrub nvme-mirror3'
+ssh pve2 'zpool scrub local-zfs2'
+ssh pve2 'zpool status | grep -A2 scrub'
+```
+
+**TrueNAS**:
+```bash
+# Scrub via TrueNAS web UI or SSH
+ssh truenas 'zpool scrub vault'
+ssh truenas 'zpool status vault | grep -A2 scrub'
+```
+
+**Automate scrubs**:
+```bash
+# Add to crontab (run on 1st of month at 2 AM)
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub nvme-mirror2
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**See**: [STORAGE.md](STORAGE.md) for pool details
+
+### SMART Tests
+
+**Run extended SMART tests monthly**:
+
+```bash
+# TrueNAS drives (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
+
+# Check results after 4-8 hours
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
+
+# PVE drives
+ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+
+# PVE2 drives
+ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+```
+
+**Automate SMART tests**:
+```bash
+# Add to crontab (run on 15th of month at 3 AM)
+0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
+0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
+```
+
+### Certificate Renewal Verification
+
+**Check SSL certificate expiry**:
+```bash
+# Check Traefik certificates
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
+
+# Check specific service
+echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
+```
+
+**Certificates should auto-renew 30 days before expiry via Traefik**
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
+
+### Backup Verification
+
+**⚠️ TODO**: No backup strategy currently in place
+
+**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
+
+---
+
+## Quarterly Maintenance
+
+### Full System Audit
+
+**Check all systems comprehensively**:
+
+1. **ZFS Pool Health**:
+   ```bash
+   ssh pve 'zpool status -v'
+   ssh pve2 'zpool status -v'
+   ssh truenas 'zpool status -v vault'
+   ```
+   Look for: errors, degraded vdevs, resilver operations
+
+2. **SMART Health**:
+   ```bash
+   # Run SMART health check script
+   ~/bin/smart-health-check.sh
+   ```
+   Look for: reallocated sectors, pending sectors, failures
+
+3. **Disk Space Trends**:
+   ```bash
+   # Check growth rate
+   ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
+   ssh truenas 'df -h /mnt/vault'
+   ```
+   Plan for expansion if >80% full
+
+4. **VM Resource Usage**:
+   ```bash
+   # Check if VMs need more/less resources
+   ssh pve 'qm list'
+   ssh pve 'pvesh get /nodes/pve/status'
+   ```
+
+5. **Network Performance**:
+   ```bash
+   # Test bandwidth between critical nodes
+   iperf3 -s  # On one host
+   iperf3 -c 10.10.10.120  # From another
+   ```
+
+6. **Temperature Monitoring**:
+   ```bash
+   # Check max temps over past quarter
+   # TODO: Set up Prometheus/Grafana for historical data
+   ssh pve 'sensors'
+   ssh pve2 'sensors'
+   ```
+
+### Service Dependency Testing
+
+**Test critical paths**:
+
+1. **Power failure recovery** (if safe to test):
+   - See [UPS.md](UPS.md) for full procedure
+   - Verify VM startup order works
+   - Confirm all services come back online
+
+2. **Failover testing**:
+   - Tailscale subnet routing (PVE → UCG-Fiber)
+   - NUT monitoring (PVE server → PVE2 client)
+
+3. **Backup restoration** (when backups implemented):
+   - Test restoring a VM from backup
+   - Test restoring files from Syncthing versioning
+
+### Documentation Review
+
+- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
+- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
+- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
+- [ ] Update any changed procedures in this document
+
+---
+
+## Annual Maintenance
+
+### Hardware Maintenance
+
+**Physical cleaning**:
+```bash
+# Shut down servers (coordinate with users)
+ssh pve 'shutdown -h now'
+ssh pve2 'shutdown -h now'
+
+# Clean dust from:
+# - CPU heatsinks
+# - GPU fans
+# - Case fans
+# - PSU vents
+# - Storage enclosure fans
+
+# Check for:
+# - Bulging capacitors on PSU/motherboard
+# - Loose cables
+# - Fan noise/vibration
+```
+
+**Thermal paste inspection** (every 2-3 years):
+- Check CPU temps vs baseline
+- If temps >85°C under load, consider reapplying paste
+- Threadripper PRO: Tctl max safe = 90°C
+
+**See**: [HARDWARE.md](HARDWARE.md) for component details
+
+### UPS Battery Test
+
+**Runtime test**:
+```bash
+# Check battery health
+ssh pve 'upsc cyberpower@localhost | grep battery'
+
+# Perform runtime test (coordinate power loss)
+# 1. Note current runtime estimate
+# 2. Unplug UPS from wall
+# 3. Let battery drain to 20%
+# 4. Note actual runtime vs estimate
+# 5. Plug back in before shutdown triggers
+
+# Battery replacement if:
+# - Runtime < 10 min at typical load
+# - Battery age > 3-5 years
+# - Battery charge < 100% when on AC for 24h
+```
+
+**See**: [UPS.md](UPS.md) for full UPS details
+
+### Drive Replacement Planning
+
+**Check drive age and health**:
+```bash
+# Get drive hours and health
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
+done'
+```
+
+**Replace drives if**:
+- Reallocated sectors > 0
+- Pending sectors > 0
+- SMART pre-fail warnings
+- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
+- Hours > 50,000 for consumer drives
+
+**Budget for replacements**:
+- HDDs: WD Red 6TB (~$150/drive)
+- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
+
+### Capacity Planning
+
+**Review growth trends**:
+```bash
+# Storage growth (compare to last year)
+ssh pve 'zpool list'
+ssh truenas 'df -h /mnt/vault'
+
+# Network bandwidth (if monitoring in place)
+# Review Grafana dashboards
+
+# Power consumption
+ssh pve 'upsc cyberpower@localhost ups.load'
+```
+
+**Plan expansions**:
+- Storage: Add drives if >70% full
+- RAM: Check if VMs hitting limits
+- Network: Upgrade if bandwidth saturation
+- UPS: Upgrade if load >80%
+
+### License and Subscription Review
+
+**Proxmox subscription** (if applicable):
+- Community (free) or Enterprise subscription?
+- Check for updates to pricing/features
+
+**Service subscriptions**:
+- Domain registration (htsn.io)
+- Cloudflare plan (currently free)
+- Let's Encrypt (free, no action needed)
+
+---
+
+## Update Schedules
+
+### Proxmox
+
+| Component | Frequency | Notes |
+|-----------|-----------|-------|
+| Security patches | Weekly | Via `apt upgrade` |
+| Minor updates | Monthly | Test on PVE2 first |
+| Major versions | Quarterly | Read release notes, plan downtime |
+| Kernel updates | Monthly | Requires reboot |
+
+**Update procedure**:
+1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
+2. Backup VM configs: `vzdump --dumpdir /tmp`
+3. Update: `apt update && apt dist-upgrade`
+4. Reboot if kernel changed: `reboot`
+5. Verify VMs auto-started: `qm list`
+
+### Containers (LXC)
+
+| Container | Update Frequency | Package Manager |
+|-----------|------------------|-----------------|
+| Pi-hole (200) | Weekly | `apt` |
+| Traefik (202) | Monthly | `apt` |
+| FindShyt (205) | As needed | `apt` |
+
+**Update command**:
+```bash
+ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
+```
+
+### VMs
+
+| VM | Update Frequency | Notes |
+|----|------------------|-------|
+| TrueNAS | Monthly | Via web UI or `apt` |
+| Saltbox | Weekly | Managed by Saltbox updates |
+| HomeAssistant | Monthly | Via HA supervisor |
+| Docker-host | Weekly | `apt` + Docker images |
+| Trading-VM | As needed | Via SSH |
+| Gitea-VM | Monthly | Via web UI + `apt` |
+
+**Docker image updates**:
+```bash
+ssh docker-host 'docker-compose pull && docker-compose up -d'
+```
+
+### Firmware Updates
+
+| Component | Check Frequency | Update Method |
+|-----------|----------------|---------------|
+| Motherboard BIOS | Annually | Manual flash (high risk) |
+| GPU firmware | Rarely | `nvidia-smi` or manual |
+| SSD/NVMe firmware | Quarterly | Vendor tools |
+| HBA firmware | Annually | LSI tools |
+| UPS firmware | Annually | PowerPanel or manual |
+
+**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
+- Critical security issue
+- Needed for hardware compatibility
+- Fixing known bug affecting you
+
+---
+
+## Testing Checklists
+
+### Pre-Update Checklist
+
+Before ANY system update:
+- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
+- [ ] Verify backups are current (when backup system in place)
+- [ ] Check for critical VMs/services that can't have downtime
+- [ ] Review update changelog/release notes
+- [ ] Test on non-critical system first (PVE2 or test VM)
+- [ ] Plan rollback strategy if update fails
+- [ ] Notify users if downtime expected
+
+### Post-Update Checklist
+
+After system update:
+- [ ] Verify system booted correctly: `uptime`
+- [ ] Check all VMs/CTs started: `qm list`, `pct list`
+- [ ] Test critical services:
+  - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
+  - [ ] Traefik routing: `curl -I https://plex.htsn.io`
+  - [ ] NFS/SMB shares: Test mount from VM
+  - [ ] Syncthing sync: Check all devices connected
+- [ ] Review logs for errors: `journalctl -p err -b`
+- [ ] Check temperatures: `sensors`
+- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
+
+### Disaster Recovery Test
+
+**Quarterly test** (when backup system in place):
+- [ ] Simulate VM failure: Restore from backup
+- [ ] Simulate storage failure: Import pool on different system
+- [ ] Simulate network failure: Verify Tailscale failover
+- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
+- [ ] Document recovery time and issues
+
+---
+
+## Log Rotation
+
+**System logs** are automatically rotated by systemd-journald and logrotate.
+
+**Check log sizes**:
+```bash
+# Journalctl size
+ssh pve 'journalctl --disk-usage'
+
+# Traefik logs
+ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
+```
+
+**Configure retention**:
+```bash
+# Limit journald to 500MB
+ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
+ssh pve 'systemctl restart systemd-journald'
+```
+
+**Traefik log rotation** (already configured):
+```bash
+# /etc/logrotate.d/traefik on CT 202
+/var/log/traefik/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+}
+```
+
+---
+
+## Monitoring Integration
+
+**TODO**: Set up automated monitoring for these procedures
+
+**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
+- ZFS scrub completion/errors
+- SMART test failures
+- Certificate expiry warnings (<30 days)
+- Update availability notifications
+- Disk space thresholds (>80%)
+- Temperature warnings (>85°C)
+
+---
+
+## Related Documentation
+
+- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
+- [UPS.md](UPS.md) - Power failure procedures
+- [STORAGE.md](STORAGE.md) - ZFS pool management
+- [HARDWARE.md](HARDWARE.md) - Hardware specifications
+- [SERVICES.md](SERVICES.md) - Service inventory
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Manual procedures only - monitoring automation needed
--- a/MONITORING.md
+++ b/MONITORING.md
@@ -0,0 +1,546 @@
+# Monitoring and Alerting
+
+Documentation for system monitoring, health checks, and alerting across the homelab.
+
+## Current Monitoring Status
+
+| Component | Monitored? | Method | Alerts | Notes |
+|-----------|------------|--------|--------|-------|
+| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
+| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
+| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
+| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
+| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
+| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
+| **Network** | ❌ No | - | ❌ No | No uptime monitoring |
+| **Services** | ❌ No | - | ❌ No | No health checks |
+| **Backups** | ❌ No | - | ❌ No | No verification |
+
+**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts
+
+---
+
+## Existing Monitoring
+
+### UPS Monitoring (NUT)
+
+**Status**: ✅ **Active and working**
+
+**What's monitored**:
+- Battery charge percentage
+- Runtime remaining (seconds)
+- Load percentage
+- Input/output voltage
+- UPS status (OL/OB/LB)
+
+**Access**:
+```bash
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+
+# Key metrics
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+```
+
+**Home Assistant Integration**:
+- Sensors: `sensor.cyberpower_*`
+- Can be used for automation/alerts
+- Currently: No alerts configured
+
+**See**: [UPS.md](UPS.md)
+
+---
+
+### Syncthing Monitoring
+
+**Status**: ⚠️ **Partial** - API available, no automated monitoring
+
+**What's available**:
+- Device connection status
+- Folder sync status
+- Sync errors
+- Bandwidth usage
+
+**Manual Checks**:
+```bash
+# Check connections (Mac Mini)
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+
+# Check folder status
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
+
+# Check errors
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
+```
+
+**Needs**: Automated monitoring script + alerts
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+### Temperature Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Method**:
+```bash
+# CPU temperature (Threadripper Tctl)
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Thresholds**:
+- Healthy: 70-80°C under load
+- Warning: >85°C
+- Critical: >90°C (throttling)
+
+**Needs**: Automated monitoring + alert if >85°C
+
+---
+
+### Proxmox VM Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Access**:
+- Proxmox Web UI: Node → Summary
+- CLI: `ssh pve 'qm list'`
+
+**Metrics Available** (via Proxmox):
+- CPU usage per VM
+- RAM usage per VM
+- Disk I/O
+- Network I/O
+- VM uptime
+
+**Needs**: API-based monitoring + alerts for VM down
+
+---
+
+## Recommended Monitoring Stack
+
+### Option 1: Prometheus + Grafana (Recommended)
+
+**Why**:
+- Industry standard
+- Extensive integrations
+- Beautiful dashboards
+- Flexible alerting
+
+**Architecture**:
+```
+Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
+                              ↓
+                          Alertmanager (alerts)
+```
+
+**Required Exporters**:
+| Exporter | Monitors | Install On |
+|----------|----------|------------|
+| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
+| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
+| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
+| nut_exporter | UPS metrics | PVE |
+| proxmox_exporter | VM/CT stats | PVE, PVE2 |
+| cadvisor | Docker containers | Saltbox, docker-host |
+
+**Deployment**:
+```bash
+# Create monitoring VM
+ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
+  --net0 virtio,bridge=vmbr0'
+
+# Install Prometheus + Grafana (via Docker)
+# /opt/monitoring/docker-compose.yml
+```
+
+**Estimated Setup Time**: 4-6 hours
+
+---
+
+### Option 2: Uptime Kuma (Simpler Alternative)
+
+**Why**:
+- Lightweight
+- Easy to set up
+- Web-based dashboard
+- Built-in alerts (email, Slack, etc.)
+
+**What it monitors**:
+- HTTP/HTTPS endpoints
+- Ping (ICMP)
+- Ports (TCP)
+- Docker containers
+
+**Deployment**:
+```bash
+ssh docker-host 'mkdir -p /opt/uptime-kuma'
+cat > docker-compose.yml << 'EOF'
+version: "3.8"
+services:
+  uptime-kuma:
+    image: louislam/uptime-kuma:latest
+    ports:
+      - "3001:3001"
+    volumes:
+      - ./data:/app/data
+    restart: unless-stopped
+EOF
+
+# Access: http://10.10.10.206:3001
+# Add Traefik config for uptime.htsn.io
+```
+
+**Estimated Setup Time**: 1-2 hours
+
+---
+
+### Option 3: Netdata (Real-time Monitoring)
+
+**Why**:
+- Real-time metrics (1-second granularity)
+- Auto-discovers services
+- Low overhead
+- Beautiful web UI
+
+**Deployment**:
+```bash
+# Install on each server
+ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+
+# Access:
+# http://10.10.10.120:19999 (PVE)
+# http://10.10.10.102:19999 (PVE2)
+```
+
+**Parent-Child Setup** (optional):
+- Configure PVE as parent
+- Stream metrics from PVE2 → PVE
+- Single dashboard for both servers
+
+**Estimated Setup Time**: 1 hour
+
+---
+
+## Critical Metrics to Monitor
+
+### Server Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **CPU usage** | >90% for 5 min | Alert |
+| **CPU temp** | >85°C | Alert |
+| **CPU temp** | >90°C | Critical alert |
+| **RAM usage** | >95% | Alert |
+| **Disk space** | >80% | Warning |
+| **Disk space** | >90% | Alert |
+| **Load average** | >CPU count | Alert |
+
+### Storage Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **ZFS pool errors** | >0 | Alert immediately |
+| **ZFS pool degraded** | Any degraded vdev | Critical alert |
+| **ZFS scrub failed** | Last scrub error | Alert |
+| **SMART reallocated sectors** | >0 | Warning |
+| **SMART pending sectors** | >0 | Alert |
+| **SMART failure** | Pre-fail | Critical - replace drive |
+
+### UPS
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Battery charge** | <20% | Warning |
+| **Battery charge** | <10% | Alert |
+| **On battery** | >5 min | Alert |
+| **Runtime** | <5 min | Critical |
+
+### Network
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Device unreachable** | >2 min down | Alert |
+| **High packet loss** | >5% | Warning |
+| **Bandwidth saturation** | >90% | Warning |
+
+### VMs/Services
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **VM stopped** | Critical VM down | Alert immediately |
+| **Service unreachable** | HTTP 5xx or timeout | Alert |
+| **Backup failed** | Any backup failure | Alert |
+| **Certificate expiry** | <30 days | Warning |
+| **Certificate expiry** | <7 days | Alert |
+
+---
+
+## Alert Destinations
+
+### Email Alerts
+
+**Recommended**: Set up SMTP relay for email alerts
+
+**Options**:
+1. Gmail SMTP (free, rate-limited)
+2. SendGrid (free tier: 100 emails/day)
+3. Mailgun (free tier available)
+4. Self-hosted mail server (complex)
+
+**Configuration Example** (Prometheus Alertmanager):
+```yaml
+# /etc/alertmanager/alertmanager.yml
+receivers:
+  - name: 'email'
+    email_configs:
+      - to: 'hutson@example.com'
+        from: 'alerts@htsn.io'
+        smarthost: 'smtp.gmail.com:587'
+        auth_username: 'alerts@htsn.io'
+        auth_password: 'app-password-here'
+```
+
+---
+
+### Push Notifications
+
+**Options**:
+- **Pushover**: $5 one-time, reliable
+- **Pushbullet**: Free tier available
+- **Telegram Bot**: Free
+- **Discord Webhook**: Free
+- **Slack**: Free tier available
+
+**Recommended**: Pushover or Telegram for mobile alerts
+
+---
+
+### Home Assistant Alerts
+
+Since Home Assistant is already running, use it for alerts:
+
+**Automation Example**:
+```yaml
+automation:
+  - alias: "UPS Low Battery Alert"
+    trigger:
+      - platform: numeric_state
+        entity_id: sensor.cyberpower_battery_charge
+        below: 20
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
+
+  - alias: "Server High Temperature"
+    trigger:
+      - platform: template
+        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
+```
+
+**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
+
+---
+
+## Monitoring Scripts
+
+### Daily Health Check
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily**:
+```cron
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+### ZFS Scrub Checker
+
+```bash
+#!/bin/bash
+# Check last ZFS scrub status
+
+echo "=== ZFS Scrub Status ==="
+
+for host in pve pve2; do
+  echo "--- $host ---"
+  ssh $host 'zpool status | grep -A1 scrub'
+  echo ""
+done
+
+echo "--- TrueNAS ---"
+ssh truenas 'zpool status vault | grep -A1 scrub'
+```
+
+---
+
+### SMART Health Checker
+
+```bash
+#!/bin/bash
+# Check SMART health on all drives
+
+echo "=== SMART Health Check ==="
+
+echo "--- TrueNAS Drives ---"
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
+done'
+
+echo "--- PVE Drives ---"
+ssh pve 'for dev in /dev/nvme* /dev/sd*; do
+  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
+done'
+```
+
+---
+
+## Dashboard Recommendations
+
+### Grafana Dashboard Layout
+
+**Page 1: Overview**
+- Server uptime
+- CPU usage (all servers)
+- RAM usage (all servers)
+- Disk space (all pools)
+- Network traffic
+- UPS status
+
+**Page 2: Storage**
+- ZFS pool health
+- SMART status for all drives
+- I/O latency
+- Scrub progress
+- Disk temperatures
+
+**Page 3: VMs**
+- VM status (up/down)
+- VM resource usage
+- VM disk I/O
+- VM network traffic
+
+**Page 4: Services**
+- Service health checks
+- HTTP response times
+- Certificate expiry dates
+- Syncthing sync status
+
+---
+
+## Implementation Plan
+
+### Phase 1: Basic Monitoring (Week 1)
+
+- [ ] Install Uptime Kuma or Netdata
+- [ ] Add HTTP checks for all services
+- [ ] Configure UPS alerts in Home Assistant
+- [ ] Set up daily health check email
+
+**Estimated Time**: 4-6 hours
+
+---
+
+### Phase 2: Advanced Monitoring (Week 2-3)
+
+- [ ] Install Prometheus + Grafana
+- [ ] Deploy node_exporter on all servers
+- [ ] Deploy zfs_exporter
+- [ ] Deploy smartmon_exporter
+- [ ] Create Grafana dashboards
+
+**Estimated Time**: 8-12 hours
+
+---
+
+### Phase 3: Alerting (Week 4)
+
+- [ ] Configure Alertmanager
+- [ ] Set up email/push notifications
+- [ ] Create alert rules for all critical metrics
+- [ ] Test all alert paths
+- [ ] Document alert procedures
+
+**Estimated Time**: 4-6 hours
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS monitoring details
+- [STORAGE.md](STORAGE.md) - ZFS health checks
+- [SERVICES.md](SERVICES.md) - Service inventory
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
+- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed**
--- a/POWER-MANAGEMENT.md
+++ b/POWER-MANAGEMENT.md
@@ -0,0 +1,509 @@
+# Power Management and Optimization
+
+Documentation of power optimizations applied to reduce idle power consumption and heat generation.
+
+## Overview
+
+Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
+
+Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
+
+---
+
+## Power Draw Estimates
+
+### PVE (10.10.10.120)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
+| NVIDIA Quadro P2000 | 25W | 70W | 75W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
+| HBAs, fans, misc | 20-30W | 20-30W | - |
+| **Total** | **250-350W** | **800-940W** | - |
+
+### PVE2 (10.10.10.102)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA RTX A6000 | 11W | 280W | 300W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
+| Fans, misc | 15-20W | 15-20W | - |
+| **Total** | **226-330W** | **765-890W** | - |
+
+### Combined
+
+| Metric | Idle | Load |
+|--------|------|------|
+| Servers | 476-680W | 1565-1830W |
+| Network gear | ~50W | ~50W |
+| **Total** | **~530-730W** | **~1615-1880W** |
+| **UPS Load** | 40-55% | 120-140% ⚠️ |
+
+**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
+
+---
+
+## Optimizations Applied
+
+### 1. KSMD Disabled (2024-12-17)
+
+**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
+
+**Problem**:
+- KSMD was consuming 44-57% CPU continuously on PVE
+- Caused CPU temp to rise from 74°C to 83°C
+- **Negative profit**: More power spent scanning than saved from deduplication
+
+**Solution**: Disabled KSM permanently
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/disable-ksm.service`
+```ini
+[Unit]
+Description=Disable KSM (Kernel Same-page Merging)
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable and start**:
+```bash
+systemctl daemon-reload
+systemctl enable --now disable-ksm
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Verify**:
+```bash
+# KSM should be disabled (run=0)
+cat /sys/kernel/mm/ksm/run  # Should output: 0
+
+# ksmd should show 0% CPU
+ps aux | grep ksmd
+```
+
+**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
+
+**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
+```bash
+cat /sys/kernel/mm/ksm/run
+# If 1, disable it:
+echo 0 > /sys/kernel/mm/ksm/run
+systemctl mask ksmtuned
+```
+
+---
+
+### 2. CPU Governor Optimization (2024-12-16)
+
+Default CPU governor keeps cores at max frequency even when idle, wasting power.
+
+#### PVE: `amd-pstate-epp` Driver
+
+**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
+**Governor**: `powersave`
+**EPP**: `balance_power`
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to powersave with balance_power EPP
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+# Check governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: powersave
+
+# Check EPP
+cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
+# Output: balance_power
+
+# Check current frequency (should be low when idle)
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
+```
+
+#### PVE2: `acpi-cpufreq` Driver
+
+**Driver**: `acpi-cpufreq` (older ACPI driver)
+**Governor**: `schedutil` (adaptive, better than powersave for this driver)
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to schedutil
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: schedutil
+
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle
+```
+
+**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
+
+**Performance impact**: Minimal - CPU still boosts to max frequency under load
+
+---
+
+### 3. GPU Power States (2024-12-16)
+
+GPUs automatically enter low-power states when idle. Verified optimal.
+
+| GPU | Location | Idle Power | P-State | Notes |
+|-----|----------|------------|---------|-------|
+| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
+| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
+| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
+
+**Check GPU power state**:
+```bash
+# Via nvidia-smi (if installed in VM)
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
+
+# Expected output:
+# name, power.draw [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, P8
+
+# Via lspci (from Proxmox host - shows link speed, not power)
+ssh pve 'lspci | grep -i nvidia'
+```
+
+**P-States**:
+- **P0**: Maximum performance
+- **P8**: Minimum power (idle)
+
+**No action needed** - GPUs automatically manage power states.
+
+**Savings**: N/A (already optimal)
+
+---
+
+### 4. Syncthing Rescan Intervals (2024-12-16)
+
+Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
+
+**Changed**:
+- Large folders: 60s → **3600s** (1 hour)
+- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
+
+**Configuration**: Via Syncthing UI on each device
+- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
+
+**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
+
+**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
+- Still acceptable for most use cases
+- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
+
+---
+
+### 5. ksmtuned Disabled (2024-12-16)
+
+**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
+
+**Solution**: Stopped and disabled on both servers
+
+```bash
+systemctl stop ksmtuned
+systemctl disable ksmtuned
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Savings**: ~2-5W
+
+---
+
+### 6. HDD Spindown on PVE2 (2024-12-16)
+
+**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
+
+**Solution**: Configure 30-minute spindown timeout
+
+**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
+```udev
+# Spin down WD Red 6TB drives after 30 minutes idle
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
+```
+
+**hdparm value**: 241 = 30 minutes
+- Formula: `value * 5 seconds = timeout`
+- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
+
+**Apply rule**:
+```bash
+udevadm control --reload-rules
+udevadm trigger
+
+# Verify drives have spindown set
+hdparm -I /dev/sda | grep -i standby
+hdparm -I /dev/sdb | grep -i standby
+```
+
+**Check if drives are spun down**:
+```bash
+hdparm -C /dev/sda
+# Output: drive state is:  standby  (spun down)
+# or:     drive state is:  active/idle  (spinning)
+```
+
+**Savings**: ~10-16W when spun down (8W per drive)
+
+**Trade-off**: 5-10 second delay when accessing pool after spindown
+
+---
+
+## Potential Optimizations (Not Yet Applied)
+
+### PCIe ASPM (Active State Power Management)
+
+**Benefit**: Reduce power of idle PCIe devices
+**Risk**: May cause stability issues with some devices
+**Estimated savings**: 5-15W
+
+**Test**:
+```bash
+# Check current ASPM state
+lspci -vv | grep -i aspm
+
+# Enable ASPM (test first)
+# Add to kernel cmdline: pcie_aspm=force
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
+
+# Update grub
+update-grub
+reboot
+```
+
+### NMI Watchdog Disable
+
+**Benefit**: Reduce CPU wakeups
+**Risk**: Harder to debug kernel hangs
+**Estimated savings**: 1-3W
+
+**Test**:
+```bash
+# Disable NMI watchdog
+echo 0 > /proc/sys/kernel/nmi_watchdog
+
+# Make permanent (add to kernel cmdline)
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
+
+update-grub
+reboot
+```
+
+---
+
+## Monitoring
+
+### CPU Frequency
+
+```bash
+# Current frequency on all cores
+ssh pve 'grep MHz /proc/cpuinfo | head -10'
+
+# Governor
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
+
+# Available governors
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
+```
+
+### CPU Temperature
+
+```bash
+# PVE
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+# PVE2
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Healthy temps**: 70-80°C under load
+**Warning**: >85°C
+**Throttle**: 90°C (Tctl max for Threadripper PRO)
+
+### GPU Power Draw
+
+```bash
+# If nvidia-smi installed in VM
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
+
+# Sample output:
+# name, power.draw [W], power.limit [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
+```
+
+### Power Consumption (UPS)
+
+```bash
+# Check UPS load percentage
+ssh pve 'upsc cyberpower@localhost ups.load'
+
+# Battery runtime (seconds)
+ssh pve 'upsc cyberpower@localhost battery.runtime'
+
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+```
+
+See [UPS.md](UPS.md) for more UPS monitoring details.
+
+### ZFS ARC Memory Usage
+
+```bash
+# PVE
+ssh pve 'arc_summary | grep -A5 "ARC size"'
+
+# TrueNAS
+ssh truenas 'arc_summary | grep -A5 "ARC size"'
+```
+
+**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
+
+```bash
+# Limit ARC to 32 GB (example)
+# Edit /etc/modprobe.d/zfs.conf:
+options zfs zfs_arc_max=34359738368
+
+# Apply (reboot required)
+update-initramfs -u
+reboot
+```
+
+---
+
+## Troubleshooting
+
+### CPU Not Downclocking
+
+```bash
+# Check current governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+
+# Should be: powersave (PVE) or schedutil (PVE2)
+# If not, systemd service may have failed
+
+# Check service status
+systemctl status cpu-powersave
+
+# Manually set governor (temporary)
+echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+# Check frequency
+grep MHz /proc/cpuinfo | head -5
+```
+
+### High Idle Power After Update
+
+**Common causes**:
+1. **KSM re-enabled** after Proxmox update
+   - Check: `cat /sys/kernel/mm/ksm/run`
+   - Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
+
+2. **CPU governor reset** to default
+   - Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
+   - Fix: `systemctl restart cpu-powersave`
+
+3. **GPU stuck in high-performance mode**
+   - Check: `nvidia-smi --query-gpu=pstate --format=csv`
+   - Fix: Restart VM or power cycle GPU
+
+### HDDs Won't Spin Down
+
+```bash
+# Check spindown setting
+hdparm -I /dev/sda | grep -i standby
+
+# Set spindown manually (temporary)
+hdparm -S 241 /dev/sda
+
+# Check if drive is idle (ZFS may keep it active)
+zpool iostat -v 1 5  # Watch for activity
+
+# Check what's accessing the drive
+lsof | grep /mnt/pool
+```
+
+---
+
+## Power Optimization Summary
+
+| Optimization | Savings | Applied | Notes |
+|--------------|---------|---------|-------|
+| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
+| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
+| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
+| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
+| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
+| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
+| PCIe ASPM | 5-15W | ❌ | Not yet tested |
+| NMI watchdog | 1-3W | ❌ | Not yet tested |
+| **Total savings** | **~150-300W** | - | Significant reduction |
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS capacity and power monitoring
+- [STORAGE.md](STORAGE.md) - HDD spindown configuration
+- [VMS.md](VMS.md) - VM resource allocation
+
+---
+
+**Last Updated**: 2025-12-22
--- a/README.md
+++ b/README.md
@@ -0,0 +1,148 @@
+# Homelab Documentation
+
+Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
+
+## 🚀 Quick Start
+
+**New to this homelab?** Start here:
+1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
+2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
+3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
+4. [SERVICES.md](SERVICES.md) - What services are running
+
+**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
+
+## 📚 Documentation Index
+
+### Infrastructure
+
+| Document | Description |
+|----------|-------------|
+| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
+| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
+| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
+| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
+| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
+| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
+
+### Services & Applications
+
+| Document | Description |
+|----------|-------------|
+| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
+| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
+| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
+| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
+| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
+
+### Access & Security
+
+| Document | Description |
+|----------|-------------|
+| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
+| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
+| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
+
+### Operations
+
+| Document | Description |
+|----------|-------------|
+| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
+| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
+| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
+| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
+
+### Reference
+
+| Document | Description |
+|----------|-------------|
+| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
+| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
+
+## 🖥️ System Overview
+
+### Servers
+
+- **PVE** (10.10.10.120) - Primary Proxmox server
+  - AMD Threadripper PRO 3975WX (32-core)
+  - 128 GB RAM
+  - NVIDIA Quadro P2000 + TITAN RTX
+
+- **PVE2** (10.10.10.102) - Secondary Proxmox server
+  - AMD Threadripper PRO 3975WX (32-core)
+  - 128 GB RAM
+  - NVIDIA RTX A6000
+
+### Key Services
+
+| Service | Location | URL |
+|---------|----------|-----|
+| **Proxmox** | PVE | https://pve.htsn.io |
+| **TrueNAS** | VM 100 | https://truenas.htsn.io |
+| **Plex** | Saltbox VM | https://plex.htsn.io |
+| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
+| **Gitea** | VM 300 | https://git.htsn.io |
+| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
+| **Traefik** | CT 202 | http://10.10.10.250:8080 |
+
+[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
+
+## 🔥 Emergency Procedures
+
+### Power Failure
+1. UPS provides ~15 min runtime at typical load
+2. At 2 min remaining, NUT triggers graceful VM shutdown
+3. When power returns, servers auto-boot and start VMs in order
+
+See [UPS.md](UPS.md) for details.
+
+### Service Down
+
+```bash
+# Quick health check (run from Mac Mini)
+ssh pve 'qm list'                    # Check VMs on PVE
+ssh pve2 'qm list'                   # Check VMs on PVE2
+ssh pve 'pct list'                   # Check containers
+
+# Syncthing status
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections"
+
+# Restart a VM
+ssh pve 'qm stop VMID && qm start VMID'
+```
+
+See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
+
+## 📞 Getting Help
+
+**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
+
+**Key Contacts**:
+- Homelab Owner: Hutson
+- Git Repo: https://git.htsn.io/hutson/homelab-docs
+- Local Path: `~/Projects/homelab`
+
+## 🔄 Recent Changes
+
+See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
+
+## 📝 Contributing
+
+When updating docs:
+1. Keep CLAUDE.md as quick reference only
+2. Move detailed content to specialized docs
+3. Update cross-references
+4. Test all commands before committing
+5. Add entries to changelog
+
+```bash
+cd ~/Projects/homelab
+git add -A
+git commit -m "Update documentation: <description>"
+git push
+```
+
+---
+
+**Last Updated**: 2025-12-22
--- a/SERVICES.md
+++ b/SERVICES.md
@@ -0,0 +1,591 @@
+# Services Inventory
+
+Complete inventory of all services running across the homelab infrastructure.
+
+## Overview
+
+| Category | Services | Location | Access |
+|----------|----------|----------|--------|
+| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
+| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
+| **Development** | Gitea, Docker services | VMs | Web UI |
+| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
+| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
+
+**Total Services**: 25+ running services
+
+---
+
+## Service URLs Quick Reference
+
+| Service | URL | Authentication | Purpose |
+|---------|-----|----------------|---------|
+| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
+| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
+| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
+| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
+| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
+| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
+| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
+| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
+| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
+| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
+| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
+| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
+
+---
+
+## Infrastructure Services
+
+### Proxmox VE (PVE & PVE2)
+
+**Purpose**: Virtualization platform, VM/CT host
+**Location**: Physical servers (10.10.10.120, 10.10.10.102)
+**Access**: https://pve.htsn.io:8006, SSH
+**Version**: Unknown (check: `pveversion`)
+
+**Key Features**:
+- Web-based management
+- VM and LXC container support
+- ZFS storage pools
+- Clustering (2-node)
+- API access
+
+**Common Operations**:
+```bash
+# List VMs
+ssh pve 'qm list'
+
+# Create VM
+ssh pve 'qm create VMID --name myvm ...'
+
+# Backup VM
+ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
+```
+
+**See**: [VMS.md](VMS.md)
+
+---
+
+### TrueNAS SCALE (VM 100)
+
+**Purpose**: Central file storage, NFS/SMB shares
+**Location**: VM on PVE (10.10.10.200)
+**Access**: https://truenas.htsn.io, SSH
+**Version**: TrueNAS SCALE (check version in UI)
+
+**Key Features**:
+- ZFS storage management
+- NFS exports
+- SMB shares
+- Syncthing hub
+- Snapshot management
+
+**Storage Pools**:
+- `vault`: Main data pool on EMC enclosure
+
+**Shares** (needs documentation):
+- NFS exports for Saltbox media
+- SMB shares for Windows access
+- Syncthing sync folders
+
+**See**: [STORAGE.md](STORAGE.md)
+
+---
+
+### Pi-hole (CT 200)
+
+**Purpose**: Network-wide DNS server and ad blocker
+**Location**: LXC on PVE (10.10.10.10)
+**Access**: http://10.10.10.10/admin
+**Version**: Unknown
+
+**Configuration**:
+- **Upstream DNS**: Cloudflare (1.1.1.1)
+- **Blocklists**: Unknown count
+- **Queries**: All network DNS traffic
+- **DHCP**: Disabled (router handles DHCP)
+
+**Stats** (example):
+```bash
+ssh pihole 'pihole -c -e'  # Stats
+ssh pihole 'pihole status'  # Status
+```
+
+**Common Tasks**:
+- Update blocklists: `ssh pihole 'pihole -g'`
+- Whitelist domain: `ssh pihole 'pihole -w example.com'`
+- View logs: `ssh pihole 'pihole -t'`
+
+---
+
+### Traefik (CT 202)
+
+**Purpose**: Reverse proxy for all public-facing services
+**Location**: LXC on PVE (10.10.10.250)
+**Access**: http://10.10.10.250:8080/dashboard/
+**Version**: Unknown (check: `traefik version`)
+
+**Managed Services**:
+- All *.htsn.io domains (except Saltbox services)
+- SSL/TLS certificates via Let's Encrypt
+- HTTP → HTTPS redirects
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
+
+---
+
+## Media Services (Saltbox VM)
+
+All media services run in Docker on the Saltbox VM (10.10.10.100).
+
+### Plex Media Server
+
+**Purpose**: Media streaming platform
+**URL**: https://plex.htsn.io
+**Access**: Plex account
+
+**Features**:
+- Hardware transcoding (TITAN RTX)
+- Libraries: Movies, TV, Music
+- Remote access enabled
+- Managed by Saltbox
+
+**Media Storage**:
+- Source: TrueNAS NFS mounts
+- Location: `/mnt/unionfs/`
+
+**Common Tasks**:
+```bash
+# View Plex status
+ssh saltbox 'docker logs -f plex'
+
+# Restart Plex
+ssh saltbox 'docker restart plex'
+
+# Scan library
+# (via Plex UI: Settings → Library → Scan)
+```
+
+---
+
+### *arr Apps (Media Automation)
+
+Running on Saltbox VM, managed via Traefik-Saltbox.
+
+| Service | Purpose | URL | Notes |
+|---------|---------|-----|-------|
+| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
+| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
+| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
+| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
+| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
+
+**Downloaders**:
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
+| **NZBGet** | Usenet downloader | nzbget.htsn.io |
+| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
+
+**Indexers**:
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
+| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
+
+---
+
+### Supporting Media Services
+
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **Tautulli** | Plex statistics | tautulli.htsn.io |
+| **Organizr** | Service dashboard | organizr.htsn.io |
+| **Authelia** | SSO authentication | auth.htsn.io |
+
+---
+
+## Development Services
+
+### Gitea (VM 300)
+
+**Purpose**: Self-hosted Git server
+**Location**: VM on PVE2 (10.10.10.220)
+**URL**: https://git.htsn.io
+**Access**: Username/password
+
+**Repositories**:
+- homelab-docs (this documentation)
+- Personal projects
+- Private repos
+
+**Common Tasks**:
+```bash
+# SSH to Gitea VM
+ssh gitea-vm
+
+# View logs
+ssh gitea-vm 'journalctl -u gitea -f'
+
+# Backup
+ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
+```
+
+**See**: Gitea documentation for API usage
+
+---
+
+### Docker Services (docker-host VM)
+
+Running on VM 206 (10.10.10.206).
+
+| Service | URL | Purpose | Port |
+|---------|-----|---------|------|
+| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
+| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
+| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
+
+**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
+
+**Managing services**:
+```bash
+ssh docker-host 'docker ps'
+ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
+ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
+```
+
+---
+
+## Home Automation
+
+### Home Assistant (VM 110)
+
+**Purpose**: Smart home automation platform
+**Location**: VM on PVE (10.10.10.110)
+**URL**: https://homeassistant.htsn.io
+**Access**: Username/password
+
+**Integrations**:
+- UPS monitoring (NUT sensors)
+- Unknown other integrations (needs documentation)
+
+**Sensors**:
+- `sensor.cyberpower_battery_charge`
+- `sensor.cyberpower_load`
+- `sensor.cyberpower_battery_runtime`
+- `sensor.cyberpower_status`
+
+**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
+
+---
+
+### Happy Coder Relay (docker-host)
+
+**Purpose**: Self-hosted relay server for Happy Coder mobile app
+**Location**: docker-host (10.10.10.206)
+**URL**: https://happy.htsn.io
+**Access**: QR code authentication
+
+**Stack**:
+- Happy Server (Node.js)
+- PostgreSQL (user/session data)
+- Redis (real-time events)
+- MinIO (file/image storage)
+
+**Clients**:
+- Mac Mini (Happy daemon)
+- Mobile app (iOS/Android)
+
+**Credentials**:
+- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
+- PostgreSQL: `happy` / `happypass`
+- MinIO: `happyadmin` / `happyadmin123`
+
+---
+
+## File Sync & Storage
+
+### Syncthing
+
+**Purpose**: File synchronization across all devices
+**Devices**:
+- Mac Mini (10.10.10.125) - Hub
+- MacBook - Mobile sync
+- TrueNAS (10.10.10.200) - Central storage
+- Windows PC (10.10.10.150) - Windows sync
+- Phone (10.10.10.54) - Mobile sync
+
+**API Keys**:
+- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
+- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
+- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
+
+**Synced Folders**:
+- documents (~11 GB)
+- downloads (~38 GB)
+- pictures
+- notes
+- desktop (~7.2 GB)
+- config
+- movies
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+### Copyparty (VM 201)
+
+**Purpose**: Simple HTTP file sharing
+**Location**: VM on PVE (10.10.10.201)
+**URL**: https://copyparty.htsn.io
+**Access**: Unknown
+
+**Features**:
+- Web-based file upload/download
+- Lightweight
+
+---
+
+## Trading & AI Services
+
+### AI Trading Platform (trading-vm)
+
+**Purpose**: Algorithmic trading with AI models
+**Location**: VM 301 on PVE2 (10.10.10.221)
+**URL**: https://aitrade.htsn.io (if accessible)
+**GPU**: RTX A6000 (48GB VRAM)
+
+**Components**:
+- Trading algorithms
+- AI models for market prediction
+- Real-time data feeds
+- Backtesting infrastructure
+
+**Access**: SSH only (no web UI documented)
+
+---
+
+### LM Dev (lmdev1)
+
+**Purpose**: AI/LLM development environment
+**Location**: VM 111 on PVE (10.10.10.111)
+**URL**: https://lmdev.htsn.io (if accessible)
+**GPU**: TITAN RTX (shared with Saltbox)
+
+**Installed**:
+- CUDA toolkit
+- Python 3.11+
+- PyTorch, TensorFlow
+- Hugging Face transformers
+
+---
+
+## Monitoring & Utilities
+
+### UPS Monitoring (NUT)
+
+**Purpose**: Monitor UPS status and trigger shutdowns
+**Location**: PVE (master), PVE2 (slave)
+**Access**: Command-line (`upsc`)
+
+**Key Commands**:
+```bash
+ssh pve 'upsc cyberpower@localhost'
+ssh pve 'upsc cyberpower@localhost ups.load'
+ssh pve 'upsc cyberpower@localhost battery.runtime'
+```
+
+**Home Assistant Integration**: UPS sensors exposed
+
+**See**: [UPS.md](UPS.md)
+
+---
+
+### Pulse Monitoring
+
+**Purpose**: Unknown monitoring dashboard
+**Location**: docker-host (10.10.10.206:7655)
+**URL**: https://pulse.htsn.io
+**Access**: Unknown
+
+**Needs documentation**:
+- What does it monitor?
+- How to configure?
+- Authentication?
+
+---
+
+### Tailscale VPN
+
+**Purpose**: Secure remote access to homelab
+**Subnet Routers**:
+- PVE (100.113.177.80) - Primary
+- UCG-Fiber (100.94.246.32) - Failover
+
+**Devices on Tailscale**:
+- Mac Mini: 100.108.89.58
+- PVE: 100.113.177.80
+- TrueNAS: 100.100.94.71
+- Pi-hole: 100.112.59.128
+
+**See**: [NETWORK.md](NETWORK.md)
+
+---
+
+## Custom Applications
+
+### FindShyt (CT 205)
+
+**Purpose**: Unknown custom application
+**Location**: LXC on PVE (10.10.10.8)
+**URL**: https://findshyt.htsn.io
+**Access**: Unknown
+
+**Needs documentation**:
+- What is this app?
+- How to use it?
+- Tech stack?
+
+---
+
+## Service Dependencies
+
+### Critical Dependencies
+
+```
+TrueNAS
+├── Plex (media files via NFS)
+├── *arr apps (downloads via NFS)
+├── Syncthing (central storage hub)
+└── Backups (if configured)
+
+Traefik (CT 202)
+├── All *.htsn.io services
+└── SSL certificate management
+
+Pi-hole
+└── DNS for entire network
+
+Router
+└── Gateway for all services
+```
+
+### Startup Order
+
+**See [VMS.md](VMS.md)** for VM boot order configuration:
+1. TrueNAS (storage first)
+2. Saltbox (depends on TrueNAS NFS)
+3. Other VMs
+4. Containers
+
+---
+
+## Service Port Reference
+
+### Well-Known Ports
+
+| Port | Service | Protocol | Purpose |
+|------|---------|----------|---------|
+| 22 | SSH | TCP | Remote access |
+| 53 | Pi-hole | UDP | DNS queries |
+| 80 | Traefik | TCP | HTTP (redirects to 443) |
+| 443 | Traefik | TCP | HTTPS |
+| 3000 | Gitea | TCP | Git HTTP/S |
+| 8006 | Proxmox | TCP | Web UI |
+| 8096 | Plex | TCP | Plex Media Server |
+| 8384 | Syncthing | TCP | Web UI |
+| 22000 | Syncthing | TCP | Sync protocol |
+
+### Internal Ports
+
+| Port | Service | Purpose |
+|------|---------|---------|
+| 3002 | Happy Server | Relay backend |
+| 5432 | PostgreSQL | Happy Server DB |
+| 6379 | Redis | Happy Server cache |
+| 7655 | Pulse | Monitoring |
+| 8080 | Excalidraw | Whiteboard |
+| 8080 | Traefik | Dashboard |
+| 9000 | MinIO | Object storage |
+
+---
+
+## Service Health Checks
+
+### Quick Health Check Script
+
+```bash
+#!/bin/bash
+# Check all critical services
+
+echo "=== Infrastructure ==="
+curl -Is https://pve.htsn.io:8006 | head -1
+curl -Is https://truenas.htsn.io | head -1
+curl -I http://10.10.10.10/admin 2>/dev/null | head -1
+echo ""
+
+echo "=== Media Services ==="
+curl -Is https://plex.htsn.io | head -1
+curl -Is https://sonarr.htsn.io | head -1
+curl -Is https://radarr.htsn.io | head -1
+echo ""
+
+echo "=== Development ==="
+curl -Is https://git.htsn.io | head -1
+curl -Is https://excalidraw.htsn.io | head -1
+echo ""
+
+echo "=== Home Automation ==="
+curl -Is https://homeassistant.htsn.io | head -1
+curl -Is https://happy.htsn.io/health | head -1
+```
+
+### Service-Specific Checks
+
+```bash
+# Proxmox VMs
+ssh pve 'qm list | grep running'
+
+# Docker services
+ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
+
+# Syncthing
+curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/status"
+
+# UPS
+ssh pve 'upsc cyberpower@localhost ups.status'
+```
+
+---
+
+## Service Credentials
+
+**Location**: See individual service documentation
+
+| Service | Credentials Location | Notes |
+|---------|---------------------|-------|
+| Proxmox | Proxmox UI | Username + 2FA |
+| TrueNAS | TrueNAS UI | Root password |
+| Plex | Plex account | Managed externally |
+| Gitea | Gitea DB | Self-managed |
+| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
+| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
+
+**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - VM/service locations
+- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)
--- a/SSH-ACCESS.md
+++ b/SSH-ACCESS.md
@@ -0,0 +1,464 @@
+# SSH Access
+
+Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
+
+## Overview
+
+Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
+
+**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
+
+---
+
+## SSH Key Authentication (Primary Method)
+
+### SSH Key Configuration
+
+SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
+
+**Key file**: `~/.ssh/homelab` (Ed25519 key)
+
+**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
+
+### Host Aliases
+
+Use these convenient aliases instead of IP addresses:
+
+| Host Alias | IP | User | Type | Notes |
+|------------|-----|------|------|-------|
+| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
+| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
+| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
+| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
+| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
+| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
+| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
+| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
+| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
+| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
+| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
+| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
+| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
+
+### Usage Examples
+
+```bash
+# List VMs on PVE
+ssh pve 'qm list'
+
+# Check ZFS pool on TrueNAS
+ssh truenas 'zpool status vault'
+
+# List Docker containers on Saltbox
+ssh saltbox 'docker ps'
+
+# Check Pi-hole status
+ssh pihole 'pihole status'
+
+# View Traefik config
+ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
+```
+
+### SSH Config File
+
+**Location**: `~/.ssh/config`
+
+**Example entries**:
+
+```sshconfig
+# Proxmox Servers
+Host pve
+    HostName 10.10.10.120
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host pve2
+    HostName 10.10.10.102
+    User root
+    IdentityFile ~/.ssh/homelab
+    # Post-quantum KEX causes MTU issues - use classic
+    KexAlgorithms curve25519-sha256
+
+# VMs
+Host truenas
+    HostName 10.10.10.200
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host saltbox
+    HostName 10.10.10.100
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host lmdev1
+    HostName 10.10.10.111
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host docker-host
+    HostName 10.10.10.206
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host fs-dev
+    HostName 10.10.10.5
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host copyparty
+    HostName 10.10.10.201
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host gitea-vm
+    HostName 10.10.10.220
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host trading-vm
+    HostName 10.10.10.221
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+# LXC Containers
+Host pihole
+    HostName 10.10.10.10
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host traefik
+    HostName 10.10.10.250
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host findshyt
+    HostName 10.10.10.8
+    User root
+    IdentityFile ~/.ssh/homelab
+```
+
+---
+
+## Password Authentication (Special Cases)
+
+Some systems don't support SSH key auth or have other limitations.
+
+### UniFi Router (10.10.10.1)
+
+**Issue**: Uses `keyboard-interactive` auth method, incompatible with `sshpass`
+**Solution**: Use `expect` to automate password entry
+
+**Commands**:
+
+```bash
+# Run command on router
+expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+
+# Get ARP table (all device IPs)
+expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+
+# Check Tailscale status
+expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+```
+
+**Why not key auth?**: UniFi router firmware doesn't persist SSH keys across reboots.
+
+### Windows PC (10.10.10.150)
+
+**OS**: Windows with OpenSSH server
+**User**: `claude`
+**Password**: `GrilledCh33s3#`
+**Shell**: PowerShell (not bash)
+
+**Commands**:
+
+```bash
+# Run PowerShell command
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
+
+# Check Syncthing status
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
+
+# Restart Syncthing
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
+```
+
+**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
+
+**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
+
+---
+
+## QEMU Guest Agent
+
+Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
+
+### VMs with QEMU Agent
+
+| VMID | VM Name | Use Case |
+|------|---------|----------|
+| 100 | truenas | Execute commands, check ZFS |
+| 101 | saltbox | Execute commands, Docker mgmt |
+| 105 | fs-dev | Execute commands |
+| 111 | lmdev1 | Execute commands |
+| 201 | copyparty | Execute commands |
+| 206 | docker-host | Execute commands |
+| 300 | gitea-vm | Execute commands |
+| 301 | trading-vm | Execute commands |
+
+### VM WITHOUT QEMU Agent
+
+**VMID 110 (homeassistant)**: No QEMU agent installed
+- Access via web UI only
+- Or install SSH server manually if needed
+
+### Usage Examples
+
+**Basic syntax**:
+```bash
+ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
+```
+
+**Examples**:
+
+```bash
+# Check ZFS pool on TrueNAS (without SSH)
+ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
+
+# Get VM IP addresses
+ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
+
+# Check Docker containers on Saltbox
+ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
+
+# Run multi-line command
+ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
+```
+
+**When to use QEMU agent vs SSH**:
+- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
+- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
+- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
+
+---
+
+## Troubleshooting SSH Issues
+
+### Connection Refused
+
+```bash
+# Check if SSH service is running
+ssh pve 'systemctl status sshd'
+
+# Check if port 22 is open
+nc -zv 10.10.10.XXX 22
+
+# Check firewall
+ssh pve 'iptables -L -n | grep 22'
+```
+
+### Permission Denied (Public Key)
+
+```bash
+# Verify key file exists
+ls -la ~/.ssh/homelab
+
+# Check key permissions (should be 600)
+chmod 600 ~/.ssh/homelab
+
+# Test SSH key auth verbosely
+ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
+
+# Check authorized_keys on remote (via QEMU agent if SSH broken)
+ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
+```
+
+### Slow SSH Connection (PVE2 Issue)
+
+**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
+**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
+**Fix**: Use classic KEX algorithm instead
+
+**In `~/.ssh/config`**:
+```sshconfig
+Host pve2
+    HostName 10.10.10.102
+    User root
+    IdentityFile ~/.ssh/homelab
+    KexAlgorithms curve25519-sha256  # Avoid mlkem768x25519-sha256
+```
+
+**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
+
+---
+
+## Adding SSH Keys to New Systems
+
+### Linux (VMs/LXCs)
+
+```bash
+# Copy public key to new host
+ssh-copy-id -i ~/.ssh/homelab user@hostname
+
+# Or manually:
+ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
+ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
+```
+
+### LXC Containers (Root User)
+
+```bash
+# Via pct exec from Proxmox host
+ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
+ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
+ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
+
+# Also enable PermitRootLogin in sshd_config
+ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
+ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
+```
+
+### VMs (via QEMU Agent)
+
+```bash
+# Add key via QEMU agent (if SSH not working)
+ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
+ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
+ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
+```
+
+---
+
+## SSH Key Management
+
+### Rotate SSH Keys (Future)
+
+When rotating SSH keys:
+
+1. Generate new key pair:
+   ```bash
+   ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
+   ```
+
+2. Deploy new key to all hosts (keep old key for now):
+   ```bash
+   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
+     ssh-copy-id -i ~/.ssh/homelab-new $host
+   done
+   ```
+
+3. Update `~/.ssh/config` to use new key:
+   ```sshconfig
+   IdentityFile ~/.ssh/homelab-new
+   ```
+
+4. Test all connections:
+   ```bash
+   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
+     echo "Testing $host..."
+     ssh $host 'hostname'
+   done
+   ```
+
+5. Remove old key from all hosts once confirmed working
+
+---
+
+## Quick Reference
+
+### Common SSH Operations
+
+```bash
+# Execute command on remote host
+ssh host 'command'
+
+# Execute multiple commands
+ssh host 'command1 && command2'
+
+# Copy file to remote
+scp file host:/path/
+
+# Copy file from remote
+scp host:/path/file ./
+
+# Execute command on Proxmox VM (via QEMU agent)
+ssh pve 'qm guest exec VMID -- bash -c "command"'
+
+# Execute command on LXC
+ssh pve 'pct exec CTID -- command'
+
+# Interactive shell
+ssh host
+
+# SSH with X11 forwarding
+ssh -X host
+```
+
+### Troubleshooting Commands
+
+```bash
+# Test SSH with verbose output
+ssh -vvv host
+
+# Check SSH service status (remote)
+ssh host 'systemctl status sshd'
+
+# Check SSH config (local)
+ssh -G host
+
+# Test port connectivity
+nc -zv hostname 22
+```
+
+---
+
+## Security Best Practices
+
+### Current Security Posture
+
+✅ **Good**:
+- SSH keys used instead of passwords (where possible)
+- Keys use Ed25519 (modern, secure algorithm)
+- Root login disabled on VMs (use sudo instead)
+- SSH keys have proper permissions (600)
+
+⚠️ **Could Improve**:
+- [ ] Disable password authentication on all hosts (force key-only)
+- [ ] Use SSH certificate authority instead of individual keys
+- [ ] Set up SSH bastion host (jump server)
+- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
+- [ ] Implement SSH key rotation policy (annually)
+
+### Hardening SSH (Future)
+
+For additional security, consider:
+
+```sshconfig
+# /etc/ssh/sshd_config (on remote hosts)
+PermitRootLogin prohibit-password  # No root password login
+PasswordAuthentication no          # Disable password auth entirely
+PubkeyAuthentication yes           # Only allow key auth
+AuthorizedKeysFile .ssh/authorized_keys
+MaxAuthTries 3                     # Limit auth attempts
+MaxSessions 10                     # Limit concurrent sessions
+ClientAliveInterval 300            # Timeout idle sessions
+ClientAliveCountMax 2              # Drop after 2 keepalives
+```
+
+**Apply after editing**:
+```bash
+systemctl restart sshd
+```
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - Complete VM/CT inventory
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
+- [SECURITY.md](#) - Security policies (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
--- a/STORAGE.md
+++ b/STORAGE.md
@@ -0,0 +1,510 @@
+# Storage Architecture
+
+Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
+
+## Overview
+
+### Storage Distribution
+
+| Location | Type | Capacity | Purpose |
+|----------|------|----------|---------|
+| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
+| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
+| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
+
+---
+
+## PVE (10.10.10.120) Storage Pools
+
+### nvme-mirror1 (Primary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Sabrent Rocket Q NVMe
+- **Capacity**: 3.6 TB usable
+- **Purpose**: High-performance VM storage
+- **Used By**:
+  - Critical VMs requiring fast IO
+  - Database workloads
+  - Development environments
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror1'
+ssh pve 'zpool list nvme-mirror1'
+```
+
+### nvme-mirror2 (Secondary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Kingston SFYRD 2TB NVMe
+- **Capacity**: 1.8 TB usable
+- **Purpose**: Additional fast VM storage
+- **Used By**: TBD
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror2'
+ssh pve 'zpool list nvme-mirror2'
+```
+
+### rpool (Root Pool)
+- **Type**: ZFS mirror
+- **Devices**: 2x Samsung 870 QVO 4TB SSD
+- **Capacity**: 3.6 TB usable
+- **Purpose**: Proxmox OS, container storage, VM backups
+- **Used By**:
+  - Proxmox root filesystem
+  - LXC containers
+  - Local VM backups
+
+**Check status**:
+```bash
+ssh pve 'zpool status rpool'
+ssh pve 'df -h /var/lib/vz'
+```
+
+### Storage Pool Usage Summary (PVE)
+
+**Get current usage**:
+```bash
+ssh pve 'zpool list'
+ssh pve 'pvesm status'
+```
+
+---
+
+## PVE2 (10.10.10.102) Storage Pools
+
+### nvme-mirror3 (Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x NVMe (model unknown)
+- **Capacity**: Unknown (needs investigation)
+- **Purpose**: High-performance VM storage
+- **Used By**: Trading VM (301), other VMs
+
+**Check status**:
+```bash
+ssh pve2 'zpool status nvme-mirror3'
+ssh pve2 'zpool list nvme-mirror3'
+```
+
+### local-zfs2 (Bulk Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x WD Red 6TB HDD
+- **Capacity**: ~6 TB usable
+- **Purpose**: Bulk/archival storage
+- **Power Management**: 30-minute spindown configured
+  - Saves ~10-16W when idle
+  - Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
+  - Command: `hdparm -S 241` (30 min)
+
+**Notes**:
+- Pool had only 768 KB used as of 2024-12-16
+- Drives configured to spin down after 30 min idle
+- Good for archival, NOT for active workloads
+
+**Check status**:
+```bash
+ssh pve2 'zpool status local-zfs2'
+ssh pve2 'zpool list local-zfs2'
+
+# Check if drives are spun down
+ssh pve2 'hdparm -C /dev/sdX'  # Shows active/standby
+```
+
+---
+
+## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
+
+### ZFS Pool: vault
+
+**Primary storage pool** for all shared data.
+
+**Devices**: ❓ Needs investigation
+- EMC storage enclosure with multiple drives
+- SAS connection via LSI SAS2308 HBA (passed through to VM)
+
+**Capacity**: ❓ Needs investigation
+
+**Check pool status**:
+```bash
+ssh truenas 'zpool status vault'
+ssh truenas 'zpool list vault'
+
+# Get detailed capacity
+ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
+```
+
+### Datasets (Known)
+
+Based on Syncthing configuration, likely datasets:
+
+| Dataset | Purpose | Synced Devices | Notes |
+|---------|---------|----------------|-------|
+| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
+| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
+| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
+| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
+| vault/desktop | Desktop sync | Unknown | 7.2 GB |
+| vault/movies | Movie library | Unknown | Unknown size |
+| vault/config | Config files | Mac Mini, MacBook | Unknown size |
+
+**Get complete dataset list**:
+```bash
+ssh truenas 'zfs list -r vault'
+```
+
+### NFS/SMB Shares
+
+**Status**: ❓ Not documented
+
+**Needs investigation**:
+```bash
+# List NFS exports
+ssh truenas 'showmount -e localhost'
+
+# List SMB shares
+ssh truenas 'smbclient -L localhost -N'
+
+# Via TrueNAS API/UI
+# Sharing → Unix Shares (NFS)
+# Sharing → Windows Shares (SMB)
+```
+
+**Expected shares**:
+- Media libraries for Plex (on Saltbox VM)
+- Document storage
+- VM backups?
+- ISO storage?
+
+### EMC Storage Enclosure
+
+**Model**: EMC KTN-STL4 (or similar)
+**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
+**Drives**: ❓ Unknown count and capacity
+
+**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
+- SES commands
+- Fan control
+- LCC (Link Control Card) troubleshooting
+- Maintenance procedures
+
+**Check enclosure status**:
+```bash
+ssh truenas 'sg_ses --page=0x02 /dev/sgX'  # Element descriptor
+ssh truenas 'smartctl --scan'              # List all drives
+```
+
+---
+
+## Storage Network Architecture
+
+### Internal Storage Network (10.10.10.20.0/24)
+
+**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
+
+**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
+**Subnet**: 10.10.10.20.0/24
+**DHCP**: No
+**Gateway**: No (internal only, no internet)
+
+**Connected VMs**:
+- TrueNAS VM (secondary NIC)
+- Saltbox VM (secondary NIC) - for NFS mounts
+- Other VMs needing storage access
+
+**Configuration**:
+```bash
+# On TrueNAS VM - check second NIC
+ssh truenas 'ip addr show enp6s19'
+
+# On Saltbox - check NFS mounts
+ssh saltbox 'mount | grep nfs'
+```
+
+**Benefits**:
+- Separates storage traffic from general network
+- Prevents NFS/SMB from saturating main network
+- Better performance for storage-heavy workloads
+
+---
+
+## Storage Capacity Planning
+
+### Current Usage (Estimate)
+
+**Needs actual audit**:
+```bash
+# PVE pools
+ssh pve 'zpool list -o name,size,alloc,free'
+
+# PVE2 pools
+ssh pve2 'zpool list -o name,size,alloc,free'
+
+# TrueNAS vault pool
+ssh truenas 'zpool list vault'
+
+# Get detailed breakdown
+ssh truenas 'zfs list -r vault -o name,used,avail'
+```
+
+### Growth Rate
+
+**Needs tracking** - recommend monthly snapshots of capacity:
+
+```bash
+#!/bin/bash
+# Save as ~/bin/storage-capacity-report.sh
+
+DATE=$(date +%Y-%m-%d)
+REPORT=~/Backups/storage-reports/capacity-$DATE.txt
+
+mkdir -p ~/Backups/storage-reports
+
+echo "Storage Capacity Report - $DATE" > $REPORT
+echo "================================" >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE Pools:" >> $REPORT
+ssh pve 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE2 Pools:" >> $REPORT
+ssh pve2 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Pools:" >> $REPORT
+ssh truenas 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Datasets:" >> $REPORT
+ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
+
+echo "Report saved to $REPORT"
+```
+
+**Run monthly via cron**:
+```cron
+0 9 1 * * ~/bin/storage-capacity-report.sh
+```
+
+### Expansion Planning
+
+**When to expand**:
+- Pool reaches 80% capacity
+- Performance degrades
+- New workloads require more space
+
+**Expansion options**:
+1. Add drives to existing pools (if mirrors, add mirror vdev)
+2. Add new NVMe drives to PVE/PVE2
+3. Expand EMC enclosure (add more drives)
+4. Add second EMC enclosure
+
+**Cost estimates**: TBD
+
+---
+
+## ZFS Health Monitoring
+
+### Daily Health Checks
+
+```bash
+# Check for errors on all pools
+ssh pve 'zpool status -x'     # Shows only unhealthy pools
+ssh pve2 'zpool status -x'
+ssh truenas 'zpool status -x'
+
+# Check scrub status
+ssh pve 'zpool status | grep scrub'
+ssh pve2 'zpool status | grep scrub'
+ssh truenas 'zpool status | grep scrub'
+```
+
+### Scrub Schedule
+
+**Recommended**: Monthly scrub on all pools
+
+**Configure scrub**:
+```bash
+# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
+# Or via cron:
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**On TrueNAS**:
+- Configure via UI: Storage → Pools → Scrub Tasks
+- Recommended: 1st of every month at 2 AM
+
+### SMART Monitoring
+
+**Check drive health**:
+```bash
+# PVE
+ssh pve 'smartctl -a /dev/nvme0'
+ssh pve 'smartctl -a /dev/sda'
+
+# TrueNAS
+ssh truenas 'smartctl --scan'
+ssh truenas 'smartctl -a /dev/sdX'  # For each drive
+```
+
+**Configure SMART tests**:
+- TrueNAS UI: Tasks → S.M.A.R.T. Tests
+- Recommended: Weekly short test, monthly long test
+
+### Alerts
+
+**Set up email alerts for**:
+- ZFS pool errors
+- SMART test failures
+- Pool capacity > 80%
+- Scrub failures
+
+---
+
+## Storage Performance Tuning
+
+### ZFS ARC (Cache)
+
+**Check ARC usage**:
+```bash
+ssh pve 'arc_summary'
+ssh truenas 'arc_summary'
+```
+
+**Tuning** (if needed):
+- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
+- TrueNAS: Configure via UI (System → Advanced → Tunables)
+
+### NFS Performance
+
+**Mount options** (on clients like Saltbox):
+```
+rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
+```
+
+**Verify NFS mounts**:
+```bash
+ssh saltbox 'mount | grep nfs'
+```
+
+### Record Size Optimization
+
+**Different workloads need different record sizes**:
+- VMs: 64K (default, good for VMs)
+- Databases: 8K or 16K
+- Media files: 1M (large sequential reads)
+
+**Set record size** (on TrueNAS datasets):
+```bash
+ssh truenas 'zfs set recordsize=1M vault/movies'
+```
+
+---
+
+## Disaster Recovery
+
+### Pool Recovery
+
+**If a pool fails to import**:
+```bash
+# Try importing with different name
+zpool import -f -N poolname newpoolname
+
+# Check pool with readonly
+zpool import -f -o readonly=on poolname
+
+# Force import (last resort)
+zpool import -f -F poolname
+```
+
+### Drive Replacement
+
+**When a drive fails**:
+```bash
+# Identify failed drive
+zpool status poolname
+
+# Replace drive
+zpool replace poolname old-device new-device
+
+# Monitor resilver
+watch zpool status poolname
+```
+
+### Data Recovery
+
+**If pool is completely lost**:
+1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
+2. Recreate pool structure
+3. Restore data
+
+**Critical**: This is why we need offsite backups!
+
+---
+
+## Quick Reference
+
+### Common Commands
+
+```bash
+# Pool status
+zpool status [poolname]
+zpool list
+
+# Dataset usage
+zfs list
+zfs list -r vault
+
+# Check pool health (only unhealthy)
+zpool status -x
+
+# Scrub pool
+zpool scrub poolname
+
+# Get pool IO stats
+zpool iostat -v 1
+
+# Snapshot management
+zfs snapshot poolname/dataset@snapname
+zfs list -t snapshot
+zfs rollback poolname/dataset@snapname
+zfs destroy poolname/dataset@snapname
+```
+
+### Storage Locations by Use Case
+
+| Use Case | Recommended Storage | Why |
+|----------|---------------------|-----|
+| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
+| Database | nvme-mirror1/2 | Low latency |
+| Media files | TrueNAS vault | Large capacity |
+| Development | nvme-mirror2 | Fast, mid-tier |
+| Containers | rpool | Good performance |
+| Backups | TrueNAS or rpool | Large capacity |
+| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
+
+---
+
+## Investigation Needed
+
+- [ ] Get complete TrueNAS dataset list
+- [ ] Document NFS/SMB share configuration
+- [ ] Inventory EMC enclosure drives (count, capacity, model)
+- [ ] Document current pool usage percentages
+- [ ] Set up monthly capacity reports
+- [ ] Configure ZFS scrub schedules
+- [ ] Set up storage health alerts
+
+---
+
+## Related Documentation
+
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
+- [VMS.md](VMS.md) - VM storage assignments
+- [NETWORK.md](NETWORK.md) - Storage network configuration
+
+---
+
+**Last Updated**: 2025-12-22
--- a/TRAEFIK.md
+++ b/TRAEFIK.md
@@ -0,0 +1,672 @@
+# Traefik Reverse Proxy
+
+Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
+
+## Overview
+
+There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
+
+| Instance | Location | IP | Purpose | Managed By |
+|----------|----------|-----|---------|------------|
+| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
+| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
+
+---
+
+## ⚠️ CRITICAL RULE: Which Traefik to Use
+
+### When Adding ANY New Service:
+
+✅ **USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
+❌ **DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
+
+### Why This Matters:
+
+- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
+- Messing with it breaks Plex, Sonarr, Radarr, and all media services
+- Each Traefik has its own Let's Encrypt certificates
+- Mixing them causes certificate conflicts and routing issues
+
+---
+
+## Traefik-Primary (CT 202) - For New Services
+
+### Configuration
+
+**Location**: Container 202 on PVE (10.10.10.250)
+**Config Directory**: `/etc/traefik/`
+**Main Config**: `/etc/traefik/traefik.yaml`
+**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
+
+### Access Traefik Config
+
+```bash
+# From Mac Mini:
+ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
+ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
+
+# Edit a service config:
+ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
+
+# View logs:
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+### Services Using Traefik-Primary
+
+| Service | Domain | Backend |
+|---------|--------|---------|
+| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
+| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
+| Gitea | git.htsn.io | 10.10.10.220:3000 |
+| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
+| LM Dev | lmdev.htsn.io | 10.10.10.111 |
+| Pi-hole | pihole.htsn.io | 10.10.10.200 |
+| TrueNAS | truenas.htsn.io | 10.10.10.200 |
+| Proxmox | pve.htsn.io | 10.10.10.120 |
+| Copyparty | copyparty.htsn.io | 10.10.10.201 |
+| AI Trade | aitrade.htsn.io | (trading server) |
+| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
+| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
+
+---
+
+## Traefik-Saltbox (VM 101) - DO NOT MODIFY
+
+### Configuration
+
+**Location**: `/opt/traefik/` inside Saltbox VM
+**Managed By**: Saltbox Ansible playbooks (automatic)
+**Docker Mount**: `/opt/traefik` → `/etc/traefik` in container
+
+### Services Using Traefik-Saltbox
+
+- Plex (plex.htsn.io)
+- Sonarr, Radarr, Lidarr
+- SABnzbd, NZBGet, qBittorrent
+- Overseerr, Tautulli, Organizr
+- Jackett, NZBHydra2
+- Authelia (SSO authentication)
+- All other Saltbox-managed containers
+
+### View Saltbox Traefik (Read-Only)
+
+```bash
+# View config (don't edit!)
+ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
+
+# View logs
+ssh saltbox 'docker logs -f traefik'
+```
+
+**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
+
+---
+
+## Adding a New Public Service - Complete Workflow
+
+Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
+
+### Step 0: Deploy Your Service
+
+First, deploy your service on the appropriate host.
+
+#### Option A: Docker on docker-host (10.10.10.206)
+
+```bash
+ssh hutson@10.10.10.206
+sudo mkdir -p /opt/myservice
+cat > /opt/myservice/docker-compose.yml << 'EOF'
+version: "3.8"
+services:
+  myservice:
+    image: myimage:latest
+    ports:
+      - "8080:80"
+    restart: unless-stopped
+EOF
+cd /opt/myservice && sudo docker-compose up -d
+```
+
+#### Option B: New LXC Container on PVE
+
+```bash
+ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
+  --hostname myservice --memory 2048 --cores 2 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
+  --rootfs local-zfs:8 --unprivileged 1 --start 1'
+```
+
+#### Option C: New VM on PVE
+
+```bash
+ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
+  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
+```
+
+### Step 1: Create Traefik Config File
+
+Use this template for new services on **Traefik-Primary (CT 202)**:
+
+#### Basic Template
+
+```yaml
+# /etc/traefik/conf.d/myservice.yaml
+http:
+  routers:
+    # HTTPS router
+    myservice-secure:
+      entryPoints:
+        - websecure
+      rule: "Host(`myservice.htsn.io`)"
+      service: myservice
+      tls:
+        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
+      priority: 50
+
+    # HTTP → HTTPS redirect
+    myservice-redirect:
+      entryPoints:
+        - web
+      rule: "Host(`myservice.htsn.io`)"
+      middlewares:
+        - myservice-https-redirect
+      service: myservice
+      priority: 50
+
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: "http://10.10.10.XXX:PORT"
+
+  middlewares:
+    myservice-https-redirect:
+      redirectScheme:
+        scheme: https
+        permanent: true
+```
+
+#### Deploy the Config
+
+```bash
+# Create file on CT 202
+ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
+<paste config here>
+EOF"'
+
+# Traefik auto-reloads (watches conf.d directory)
+# Check logs:
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+### Step 2: Add Cloudflare DNS Entry
+
+#### Cloudflare Credentials
+
+| Field | Value |
+|-------|-------|
+| Email | cloudflare@htsn.io |
+| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
+| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
+| Public IP | 70.237.94.174 |
+
+#### Method 1: Manual (Cloudflare Dashboard)
+
+1. Go to https://dash.cloudflare.com/
+2. Select `htsn.io` domain
+3. DNS → Add Record
+4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
+
+#### Method 2: Automated (CLI)
+
+Save this as `~/bin/add-cloudflare-dns.sh`:
+
+```bash
+#!/bin/bash
+# Add DNS record to Cloudflare for htsn.io
+
+SUBDOMAIN="$1"
+CF_EMAIL="cloudflare@htsn.io"
+CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
+PUBLIC_IP="70.237.94.174"
+
+if [ -z "$SUBDOMAIN" ]; then
+  echo "Usage: $0 <subdomain>"
+  echo "Example: $0 myservice  # Creates myservice.htsn.io"
+  exit 1
+fi
+
+curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data "{
+    \"type\":\"A\",
+    \"name\":\"$SUBDOMAIN\",
+    \"content\":\"$PUBLIC_IP\",
+    \"ttl\":1,
+    \"proxied\":true
+  }" | jq .
+```
+
+**Usage**:
+```bash
+chmod +x ~/bin/add-cloudflare-dns.sh
+~/bin/add-cloudflare-dns.sh myservice  # Creates myservice.htsn.io
+```
+
+### Step 3: Testing
+
+```bash
+# Check if DNS resolves
+dig myservice.htsn.io
+
+# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
+
+# Test HTTP redirect
+curl -I http://myservice.htsn.io
+
+# Expected: 301 redirect to https://
+
+# Test HTTPS
+curl -I https://myservice.htsn.io
+
+# Expected: 200 OK
+
+# Check Traefik dashboard (if enabled)
+# http://10.10.10.250:8080/dashboard/
+```
+
+### Step 4: Update Documentation
+
+After deploying, update:
+
+1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
+2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
+3. **CLAUDE.md** - Update quick reference if needed
+
+---
+
+## SSL Certificates
+
+Traefik has **two certificate resolvers** configured:
+
+| Resolver | Use When | Challenge Type | Notes |
+|----------|----------|----------------|-------|
+| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
+| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
+
+### ⚠️ Important: HTTP Challenge vs DNS Challenge
+
+**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
+
+**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
+
+### Certificate Resolver Configuration
+
+**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
+
+```ini
+Environment="CF_API_EMAIL=cloudflare@htsn.io"
+Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+```
+
+### Certificate Storage
+
+| Resolver | Storage File |
+|----------|--------------|
+| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
+| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
+
+**Permissions**: Must be `600` (read/write owner only)
+
+```bash
+# Check permissions
+ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
+
+# Fix if needed
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
+```
+
+### Certificate Renewal
+
+- **Automatic** via Traefik
+- Checks every 24 hours
+- Renews 30 days before expiry
+- No manual intervention needed
+
+### Troubleshooting Certificates
+
+#### Certificate Fails to Issue
+
+```bash
+# Check Traefik logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
+
+# Verify Cloudflare API access
+curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
+  -H "X-Auth-Email: cloudflare@htsn.io" \
+  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+
+# Check acme.json permissions
+ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
+```
+
+#### Force Certificate Renewal
+
+```bash
+# Delete certificate (Traefik will re-request)
+ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+
+# Watch logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+---
+
+## Quick Deployment - One-Liner
+
+For fast deployment, use this all-in-one command:
+
+```bash
+# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
+
+# 1. Create Traefik config
+ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
+http:
+  routers:
+    myservice-secure:
+      entryPoints: [websecure]
+      rule: Host(\\\`myservice.htsn.io\\\`)
+      service: myservice
+      tls: {certResolver: cloudflare}
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: http://10.10.10.206:8080
+EOF"'
+
+# 2. Add Cloudflare DNS
+curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
+  -H "X-Auth-Email: cloudflare@htsn.io" \
+  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
+  -H "Content-Type: application/json" \
+  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
+
+# 3. Test (wait a few seconds for DNS propagation)
+curl -I https://myservice.htsn.io
+```
+
+---
+
+## Docker Service with Traefik Labels (Alternative)
+
+If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
+
+**Requirements**:
+- Traefik must have access to Docker socket
+- Service must be on same Docker network as Traefik
+
+**Example docker-compose.yml**:
+
+```yaml
+version: "3.8"
+
+services:
+  myservice:
+    image: myimage:latest
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
+      - "traefik.http.routers.myservice.entrypoints=websecure"
+      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
+      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
+    networks:
+      - traefik
+
+networks:
+  traefik:
+    external: true
+```
+
+**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
+
+---
+
+## Cloudflare API Reference
+
+### API Credentials
+
+| Field | Value |
+|-------|-------|
+| Email | cloudflare@htsn.io |
+| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
+| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
+
+### Common API Operations
+
+Set credentials:
+```bash
+CF_EMAIL="cloudflare@htsn.io"
+CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
+```
+
+**List all DNS records**:
+```bash
+curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" | jq
+```
+
+**Add A record**:
+```bash
+curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data '{
+    "type":"A",
+    "name":"subdomain",
+    "content":"70.237.94.174",
+    "proxied":true
+  }'
+```
+
+**Delete record**:
+```bash
+curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY"
+```
+
+**Update record** (toggle proxy):
+```bash
+curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data '{"proxied":false}'
+```
+
+---
+
+## Troubleshooting
+
+### Service Not Accessible
+
+```bash
+# 1. Check if DNS resolves
+dig myservice.htsn.io
+
+# 2. Check if backend is reachable
+curl -I http://10.10.10.XXX:PORT
+
+# 3. Check Traefik logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+
+# 4. Check Traefik config is valid
+ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
+
+# 5. Restart Traefik (if needed)
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+```
+
+### Certificate Issues
+
+```bash
+# Check certificate status in acme.json
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
+
+# Check certificate expiry
+echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
+```
+
+### 502 Bad Gateway
+
+**Cause**: Backend service is down or unreachable
+
+```bash
+# Check if backend is running
+ssh backend-host 'systemctl status myservice'
+
+# Check if port is open
+nc -zv 10.10.10.XXX PORT
+
+# Check firewall
+ssh backend-host 'iptables -L -n | grep PORT'
+```
+
+### 404 Not Found
+
+**Cause**: Traefik can't match the request to a router
+
+```bash
+# Check router rule matches domain
+ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
+
+# Should be: rule: "Host(`myservice.htsn.io`)"
+
+# Check DNS is pointing to correct IP
+dig myservice.htsn.io
+
+# Restart Traefik to reload config
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+```
+
+---
+
+## Advanced Configuration Examples
+
+### WebSocket Support
+
+For services that use WebSockets (like Home Assistant):
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      entryPoints:
+        - websecure
+      rule: "Host(`myservice.htsn.io`)"
+      service: myservice
+      tls:
+        certResolver: cloudflare
+
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: "http://10.10.10.XXX:PORT"
+        # No special config needed - WebSockets work by default in Traefik v2+
+```
+
+### Custom Headers
+
+Add custom headers (e.g., security headers):
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      middlewares:
+        - myservice-headers
+
+  middlewares:
+    myservice-headers:
+      headers:
+        customResponseHeaders:
+          X-Frame-Options: "DENY"
+          X-Content-Type-Options: "nosniff"
+          Referrer-Policy: "strict-origin-when-cross-origin"
+```
+
+### Basic Authentication
+
+Protect a service with basic auth:
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      middlewares:
+        - myservice-auth
+
+  middlewares:
+    myservice-auth:
+      basicAuth:
+        users:
+          - "user:$apr1$..." # Generate with: htpasswd -nb user password
+```
+
+---
+
+## Maintenance
+
+### Monthly Checks
+
+```bash
+# Check Traefik status
+ssh pve 'pct exec 202 -- systemctl status traefik'
+
+# Review logs for errors
+ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
+
+# Check certificate expiry dates
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
+
+# Verify all services responding
+for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
+  echo "Testing $domain..."
+  curl -sI https://$domain | head -1
+done
+```
+
+### Backup Traefik Config
+
+```bash
+# Backup all configs
+ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
+
+# Copy to safe location
+scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
+```
+
+---
+
+## Related Documentation
+
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
+- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
+- [SERVICES.md](#) - Complete service inventory (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
--- a/UPS.md
+++ b/UPS.md
@@ -0,0 +1,605 @@
+# UPS and Power Management
+
+Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
+
+## Hardware
+
+### Current UPS
+
+| Specification | Value |
+|---------------|-------|
+| **Model** | CyberPower OR2200PFCRT2U |
+| **Capacity** | 2200VA / 1320W |
+| **Form Factor** | 2U rackmount |
+| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
+| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
+| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
+| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
+| **Installed** | 2025-12-21 |
+| **Status** | Active |
+
+### ⚠️ Temporary Wiring Modification
+
+**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
+**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
+**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
+**Current draw**: ~1000-1350W total (safe margin)
+**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
+
+### Previous UPS
+
+| Model | Capacity | Issue | Replaced |
+|-------|----------|-------|----------|
+| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
+
+**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
+
+---
+
+## Power Draw Estimates
+
+### Typical Load
+
+| Component | Idle | Load | Notes |
+|-----------|------|------|-------|
+| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
+| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
+| Network gear | ~50W | ~50W | Router, switches |
+| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
+
+**UPS Load**: ~33-50% typical, 70-80% under heavy load
+
+### Runtime Calculation
+
+At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
+At **660W load** (50%): ~10-12 min estimated
+At **1000W load** (75%): ~6-8 min estimated
+
+**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
+
+---
+
+## NUT (Network UPS Tools) Configuration
+
+### Architecture
+
+```
+UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
+                      │
+                      └──> Home Assistant (monitoring only)
+```
+
+**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
+**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
+
+### NUT Server Configuration (PVE)
+
+#### 1. UPS Driver Config: `/etc/nut/ups.conf`
+
+```ini
+[cyberpower]
+    driver = usbhid-ups
+    port = auto
+    desc = "CyberPower OR2200PFCRT2U"
+    override.battery.charge.low = 20
+    override.battery.runtime.low = 120
+```
+
+**Key settings**:
+- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
+- `port = auto`: Auto-detect USB device
+- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
+
+#### 2. NUT Server Config: `/etc/nut/upsd.conf`
+
+```ini
+LISTEN 127.0.0.1 3493
+LISTEN 10.10.10.120 3493
+```
+
+**Listens on**:
+- Localhost (for local monitoring)
+- LAN IP (for PVE2 to connect)
+
+#### 3. User Config: `/etc/nut/upsd.users`
+
+```ini
+[admin]
+    password = upsadmin123
+    actions = SET
+    instcmds = ALL
+
+[upsmon]
+    password = upsmon123
+    upsmon master
+```
+
+**Users**:
+- `admin`: Full control, can run commands
+- `upsmon`: Monitoring only (used by PVE2)
+
+#### 4. Monitor Config: `/etc/nut/upsmon.conf`
+
+```ini
+MONITOR cyberpower@localhost 1 upsmon upsmon123 master
+
+MINSUPPLIES 1
+SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
+NOTIFYCMD /usr/sbin/upssched
+POLLFREQ 5
+POLLFREQALERT 5
+HOSTSYNC 15
+DEADTIME 15
+POWERDOWNFLAG /etc/killpower
+
+NOTIFYMSG ONLINE    "UPS %s on line power"
+NOTIFYMSG ONBATT    "UPS %s on battery"
+NOTIFYMSG LOWBATT   "UPS %s battery is low"
+NOTIFYMSG FSD       "UPS %s: forced shutdown in progress"
+NOTIFYMSG COMMOK    "Communications with UPS %s established"
+NOTIFYMSG COMMBAD   "Communications with UPS %s lost"
+NOTIFYMSG SHUTDOWN  "Auto logout and shutdown proceeding"
+NOTIFYMSG REPLBATT  "UPS %s battery needs to be replaced"
+NOTIFYMSG NOCOMM    "UPS %s is unavailable"
+NOTIFYMSG NOPARENT  "upsmon parent process died - shutdown impossible"
+
+NOTIFYFLAG ONLINE   SYSLOG+WALL
+NOTIFYFLAG ONBATT   SYSLOG+WALL
+NOTIFYFLAG LOWBATT  SYSLOG+WALL
+NOTIFYFLAG FSD      SYSLOG+WALL
+NOTIFYFLAG COMMOK   SYSLOG+WALL
+NOTIFYFLAG COMMBAD  SYSLOG+WALL
+NOTIFYFLAG SHUTDOWN SYSLOG+WALL
+NOTIFYFLAG REPLBATT SYSLOG+WALL
+NOTIFYFLAG NOCOMM   SYSLOG+WALL
+NOTIFYFLAG NOPARENT SYSLOG
+```
+
+**Key settings**:
+- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
+- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
+- `POLLFREQ 5`: Check UPS every 5 seconds
+
+#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
+
+```udev
+SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
+```
+
+**Purpose**: Ensure NUT can access USB UPS device
+
+**Apply rule**:
+```bash
+udevadm control --reload-rules
+udevadm trigger
+```
+
+### NUT Client Configuration (PVE2)
+
+#### Monitor Config: `/etc/nut/upsmon.conf`
+
+```ini
+MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
+
+MINSUPPLIES 1
+SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
+POLLFREQ 5
+POLLFREQALERT 5
+HOSTSYNC 15
+DEADTIME 15
+POWERDOWNFLAG /etc/killpower
+
+# Same NOTIFYMSG and NOTIFYFLAG as PVE
+```
+
+**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
+
+---
+
+## Custom Shutdown Script
+
+### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
+
+```bash
+#!/bin/bash
+# Graceful VM/CT shutdown when UPS battery low
+
+LOG="/var/log/ups-shutdown.log"
+
+log() {
+    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
+}
+
+log "=== UPS Shutdown Triggered ==="
+log "Battery low - initiating graceful shutdown of VMs/CTs"
+
+# Get list of running VMs (skip TrueNAS for now)
+VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
+for VMID in $VMS; do
+    log "Stopping VM $VMID..."
+    qm shutdown $VMID
+done
+
+# Get list of running containers
+CTS=$(pct list | awk '$2=="running" {print $1}')
+for CTID in $CTS; do
+    log "Stopping CT $CTID..."
+    pct shutdown $CTID
+done
+
+# Wait for VMs/CTs to stop
+log "Waiting 60 seconds for VMs/CTs to shut down..."
+sleep 60
+
+# Now stop TrueNAS (storage - must be last)
+if qm status 100 | grep -q running; then
+    log "Stopping TrueNAS (VM 100) last..."
+    qm shutdown 100
+    sleep 30
+fi
+
+log "All VMs/CTs stopped. Host will remain running until UPS dies."
+log "=== UPS Shutdown Complete ==="
+```
+
+**Make executable**:
+```bash
+chmod +x /usr/local/bin/ups-shutdown.sh
+```
+
+**Script behavior**:
+1. Stops all VMs (except TrueNAS)
+2. Stops all containers
+3. Waits 60 seconds
+4. Stops TrueNAS last (storage must be cleanly unmounted)
+5. **Does NOT shut down Proxmox hosts** - intentionally left running
+
+**Why not shut down hosts?**
+- BIOS configured to "Restore on AC Power Loss"
+- When power returns, servers auto-boot and start VMs in order
+- Avoids need for manual intervention
+
+---
+
+## Power Failure Behavior
+
+### When Power Fails
+
+1. **UPS switches to battery** (`OB DISCHRG` status)
+2. **NUT monitors runtime** - polls every 5 seconds
+3. **At 120 seconds (2 min) remaining**:
+   - NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
+   - Script gracefully stops all VMs/CTs
+   - TrueNAS stopped last (storage integrity)
+4. **Hosts remain running** until UPS battery depletes
+5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
+
+### When Power Returns
+
+1. **UPS charges battery**, power returns to servers
+2. **BIOS "Restore on AC Power Loss"** boots both servers
+3. **Proxmox starts** and auto-starts VMs in configured order:
+
+| Order | Wait | VMs/CTs | Reason |
+|-------|------|---------|--------|
+| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
+| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
+| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
+| 4 | 5s | pihole, traefik, findshyt | Containers |
+
+PVE2 VMs: order=1, wait=10s
+
+**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
+
+---
+
+## UPS Status Codes
+
+| Code | Meaning | Action |
+|------|---------|--------|
+| `OL` | Online (AC power) | Normal operation |
+| `OB` | On Battery | Power outage - monitor runtime |
+| `LB` | Low Battery | <2 min remaining - shutdown imminent |
+| `CHRG` | Charging | Battery charging after power restored |
+| `DISCHRG` | Discharging | On battery, draining |
+| `FSD` | Forced Shutdown | NUT triggered shutdown |
+
+---
+
+## Monitoring & Commands
+
+### Check UPS Status
+
+```bash
+# Full status
+ssh pve 'upsc cyberpower@localhost'
+
+# Key metrics only
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+
+# Example output:
+# battery.charge: 100
+# battery.runtime: 1234        (seconds remaining)
+# ups.load: 33                  (% load)
+# ups.status: OL                (online)
+```
+
+### Control UPS Beeper
+
+```bash
+# Mute beeper (temporary - until next power event)
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
+
+# Disable beeper (permanent)
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
+
+# Enable beeper
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
+```
+
+### Test Shutdown Procedure
+
+**Simulate low battery** (careful - this will shut down VMs!):
+
+```bash
+# Set a very high low battery threshold to trigger shutdown
+ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
+
+# Watch it trigger (when runtime drops below 300 seconds)
+ssh pve 'tail -f /var/log/ups-shutdown.log'
+
+# Reset to normal
+ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
+```
+
+**Better test**: Run shutdown script manually without actually triggering NUT:
+```bash
+ssh pve '/usr/local/bin/ups-shutdown.sh'
+```
+
+---
+
+## Home Assistant Integration
+
+UPS metrics are exposed to Home Assistant via NUT integration.
+
+### Available Sensors
+
+| Entity ID | Description |
+|-----------|-------------|
+| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
+| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
+| `sensor.cyberpower_load` | Load % (0-100) |
+| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
+| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
+| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
+
+### Configuration
+
+**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
+
+### Example Automations
+
+**Send notification when on battery**:
+```yaml
+automation:
+  - alias: "UPS On Battery Alert"
+    trigger:
+      - platform: state
+        entity_id: sensor.cyberpower_status
+        to: "OB"
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
+```
+
+**Alert when battery low**:
+```yaml
+automation:
+  - alias: "UPS Low Battery Alert"
+    trigger:
+      - platform: numeric_state
+        entity_id: sensor.cyberpower_battery_runtime
+        below: 300
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
+```
+
+---
+
+## Testing Results
+
+### Full Power Failure Test (2025-12-21)
+
+Complete end-to-end test of power failure and recovery:
+
+| Event | Time | Duration | Notes |
+|-------|------|----------|-------|
+| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
+| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
+| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
+| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
+| **Power restored** | ~22:47 | - | Plugged back in |
+| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
+| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
+| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
+| **Total recovery** | - | **~7 min** | From power return to fully operational |
+
+**Results**:
+✅ VMs shut down gracefully
+✅ Hosts remained running until UPS died (as intended)
+✅ Auto-boot on power restoration worked
+✅ VMs started in correct order with appropriate delays
+✅ No data corruption or issues
+
+**Runtime calculation**:
+- Load: ~33% (440W estimated)
+- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
+- Matches manufacturer estimate for 33% load
+
+---
+
+## Proxmox Cluster Quorum Fix
+
+### Problem
+
+With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
+
+During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
+
+### Solution
+
+Modified `/etc/pve/corosync.conf` to enable 2-node mode:
+
+```
+quorum {
+    provider: corosync_votequorum
+    two_node: 1
+}
+```
+
+**Effect**:
+- Either node can operate independently if the other is down
+- No more waiting for quorum when one server is offline
+- Both nodes visible in single Proxmox interface when both up
+
+**Applied**: 2025-12-21
+
+---
+
+## Maintenance
+
+### Monthly Checks
+
+```bash
+# Check UPS status
+ssh pve 'upsc cyberpower@localhost'
+
+# Check NUT server running
+ssh pve 'systemctl status nut-server'
+ssh pve 'systemctl status nut-monitor'
+
+# Check NUT client running (PVE2)
+ssh pve2 'systemctl status nut-monitor'
+
+# Verify PVE2 can see UPS
+ssh pve2 'upsc cyberpower@10.10.10.120'
+
+# Check logs for errors
+ssh pve 'journalctl -u nut-server -n 50'
+ssh pve 'journalctl -u nut-monitor -n 50'
+```
+
+### Battery Health
+
+**Check battery stats**:
+```bash
+ssh pve 'upsc cyberpower@localhost | grep battery'
+
+# Key metrics:
+# battery.charge: 100          (should be near 100 when on AC)
+# battery.runtime: 1200+       (seconds at current load)
+# battery.voltage: ~24V        (normal for 24V battery system)
+```
+
+**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
+```bash
+ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
+```
+
+CyberPower batteries typically last 3-5 years.
+
+### Firmware Updates
+
+Check CyberPower website for firmware updates:
+https://www.cyberpowersystems.com/support/firmware/
+
+---
+
+## Troubleshooting
+
+### UPS Not Detected
+
+```bash
+# Check USB connection
+ssh pve 'lsusb | grep Cyber'
+
+# Expected:
+# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
+
+# Restart NUT driver
+ssh pve 'systemctl restart nut-driver'
+ssh pve 'systemctl status nut-driver'
+```
+
+### PVE2 Can't Connect
+
+```bash
+# Verify NUT server listening
+ssh pve 'netstat -tuln | grep 3493'
+
+# Should show:
+# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
+
+# Test connection from PVE2
+ssh pve2 'telnet 10.10.10.120 3493'
+
+# Check firewall (should allow port 3493)
+ssh pve 'iptables -L -n | grep 3493'
+```
+
+### Shutdown Script Not Running
+
+```bash
+# Check script permissions
+ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
+
+# Should be: -rwxr-xr-x (executable)
+
+# Check logs
+ssh pve 'cat /var/log/ups-shutdown.log'
+
+# Test script manually
+ssh pve '/usr/local/bin/ups-shutdown.sh'
+```
+
+### UPS Status Shows UNKNOWN
+
+```bash
+# Driver may not be compatible
+ssh pve 'upsc cyberpower@localhost ups.status'
+
+# Try different driver (in /etc/nut/ups.conf)
+# driver = usbhid-ups
+# or
+# driver = blazer_usb
+
+# Restart after change
+ssh pve 'systemctl restart nut-driver nut-server'
+```
+
+---
+
+## Future Improvements
+
+- [ ] Add email alerts for UPS events (power fail, low battery)
+- [ ] Log runtime statistics to track battery degradation
+- [ ] Set up Grafana dashboard for UPS metrics
+- [ ] Test battery runtime at different load levels
+- [ ] Upgrade to 20A circuit, restore original 5-20P plug
+- [ ] Consider adding network management card for out-of-band UPS access
+
+---
+
+## Related Documentation
+
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
+- [VMS.md](VMS.md) - VM startup order configuration
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
+
+---
+
+**Last Updated**: 2025-12-22
--- a/VMS.md
+++ b/VMS.md
@@ -0,0 +1,579 @@
+# VMs and Containers
+
+Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
+
+## Overview
+
+| Server | VMs | LXCs | Total |
+|--------|-----|------|-------|
+| **PVE** (10.10.10.120) | 6 | 3 | 9 |
+| **PVE2** (10.10.10.102) | 2 | 0 | 2 |
+| **Total** | **8** | **3** | **11** |
+
+---
+
+## PVE (10.10.10.120) - Primary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
+| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
+| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
+| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
+| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
+| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
+| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
+
+### LXC Containers
+
+| CTID | Name | IP | RAM | Storage | Purpose |
+|------|------|-----|-----|---------|---------|
+| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
+| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
+| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
+
+---
+
+## PVE2 (10.10.10.102) - Secondary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
+| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
+
+### LXC Containers
+
+None on PVE2.
+
+---
+
+## VM Details
+
+### 100 - TrueNAS (Storage Server)
+
+**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
+
+**Specs**:
+- **OS**: TrueNAS SCALE
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
+- **Network**:
+  - Primary: 10 Gb (vmbr2)
+  - Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
+
+**Hardware Passthrough**:
+- LSI SAS2308 HBA (for EMC enclosure drives)
+- Samsung NVMe (for ZFS caching)
+
+**ZFS Pools**:
+- `vault`: Main storage pool on EMC drives
+- Boot pool on passed-through NVMe
+
+**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
+
+---
+
+### 101 - Saltbox (Media Automation)
+
+**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 16 GB
+- **Storage**: nvme-mirror1
+- **Network**: 10 Gb (vmbr2)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (for Plex hardware transcoding)
+
+**Services**:
+- Plex Media Server (plex.htsn.io)
+- Sonarr, Radarr, Lidarr (TV/movie/music automation)
+- SABnzbd, NZBGet (downloaders)
+- Overseerr (request management)
+- Tautulli (Plex stats)
+- Organizr (dashboard)
+- Authelia (SSO authentication)
+- Traefik (reverse proxy - separate from CT 202)
+
+**Managed By**: Saltbox Ansible playbooks
+**See**: [SALTBOX.md](#) (coming soon)
+
+---
+
+### 105 - fs-dev (Development Environment)
+
+**Purpose**: General development work, testing, prototyping
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 10
+- **RAM**: 8 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+---
+
+### 110 - Home Assistant (Home Automation)
+
+**Purpose**: Smart home automation platform
+
+**Specs**:
+- **OS**: Home Assistant OS
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**:
+- Web UI: https://homeassistant.htsn.io
+- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
+
+**Special Notes**:
+- ❌ No QEMU agent (Home Assistant OS doesn't support it)
+- No SSH server by default (access via web terminal)
+
+---
+
+### 111 - lmdev1 (AI/LLM Development)
+
+**Purpose**: AI model development, fine-tuning, inference
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
+
+**Installed**:
+- CUDA toolkit
+- Python 3.11+
+- PyTorch, TensorFlow
+- Hugging Face transformers
+
+---
+
+### 201 - Copyparty (File Sharing)
+
+**Purpose**: Simple HTTP file sharing server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://copyparty.htsn.io
+
+---
+
+### 206 - docker-host (Docker Services)
+
+**Purpose**: General-purpose Docker host for miscellaneous services
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+- **CPU**: `host` passthrough (for x86-64-v3 support)
+
+**Services Running**:
+- Excalidraw (excalidraw.htsn.io) - Whiteboard
+- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
+- Pulse (pulse.htsn.io) - Monitoring dashboard
+
+**Docker Compose Files**: `/opt/*/docker-compose.yml`
+
+---
+
+### 300 - gitea-vm (Git Server)
+
+**Purpose**: Self-hosted Git server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://git.htsn.io
+
+**Repositories**:
+- homelab-docs (this documentation)
+- Personal projects
+- Private repos
+
+---
+
+### 301 - trading-vm (AI Trading Platform)
+
+**Purpose**: Algorithmic trading system with AI models
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
+
+**Software**:
+- Trading algorithms
+- AI models for market prediction
+- Real-time data feeds
+- Backtesting infrastructure
+
+---
+
+## LXC Container Details
+
+### 200 - Pi-hole (DNS & Ad Blocking)
+
+**Purpose**: Network-wide DNS server and ad blocker
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.10
+**Storage**: rpool
+
+**Access**:
+- Web UI: http://10.10.10.10/admin
+- Public URL: https://pihole.htsn.io
+
+**Configuration**:
+- Upstream DNS: Cloudflare (1.1.1.1)
+- DHCP: Disabled (router handles DHCP)
+- Interface: All interfaces
+
+**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
+
+---
+
+### 202 - Traefik (Reverse Proxy)
+
+**Purpose**: Primary reverse proxy for all public-facing services
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.250
+**Storage**: rpool
+
+**Configuration**: `/etc/traefik/`
+**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
+
+**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
+
+---
+
+### 205 - FindShyt (Custom App)
+
+**Purpose**: Custom application (details TBD)
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.8
+**Storage**: rpool
+
+**Access**: https://findshyt.htsn.io
+
+---
+
+## VM Startup Order & Dependencies
+
+### Power-On Sequence
+
+When servers boot (after power failure or restart), VMs/CTs start in this order:
+
+#### PVE (10.10.10.120)
+
+| Order | Wait | VMID | Name | Reason |
+|-------|------|------|------|--------|
+| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
+| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
+| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
+| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
+
+**Configure startup order** (already set):
+```bash
+# View current config
+ssh pve 'qm config 100 | grep -E "startup|onboot"'
+
+# Set startup order (example)
+ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
+ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
+```
+
+#### PVE2 (10.10.10.102)
+
+| Order | Wait | VMID | Name |
+|-------|------|------|------|
+| **1** | 10s | 300, 301 | All VMs |
+
+**Less critical** - no dependencies between PVE2 VMs.
+
+---
+
+## Resource Allocation Summary
+
+### Total Allocated (PVE)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
+| **RAM** | 98 GB | 128 GB | 77% |
+
+**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
+
+### Total Allocated (PVE2)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 18 | 64 | 28% |
+| **RAM** | 36 GB | 128 GB | 28% |
+
+**PVE2** has significant headroom for additional VMs.
+
+---
+
+## Adding a New VM
+
+### Quick Template
+
+```bash
+# Create VM
+ssh pve 'qm create VMID \
+  --name myvm \
+  --memory 4096 \
+  --cores 2 \
+  --net0 virtio,bridge=vmbr0 \
+  --scsihw virtio-scsi-pci \
+  --scsi0 nvme-mirror1:32 \
+  --boot order=scsi0 \
+  --ostype l26 \
+  --agent enabled=1'
+
+# Attach ISO for installation
+ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
+
+# Start VM
+ssh pve 'qm start VMID'
+
+# Access console
+ssh pve 'qm vncproxy VMID' # Then connect with VNC client
+# Or via Proxmox web UI
+```
+
+### Cloud-Init Template (Faster)
+
+Use cloud-init for automated VM deployment:
+
+```bash
+# Download cloud image
+ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
+
+# Create VM
+ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
+
+# Import disk
+ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
+
+# Attach disk
+ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
+
+# Add cloud-init drive
+ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
+
+# Set boot disk
+ssh pve 'qm set VMID --boot order=scsi0'
+
+# Configure cloud-init (user, SSH key, network)
+ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
+
+# Enable QEMU agent
+ssh pve 'qm set VMID --agent enabled=1'
+
+# Resize disk (cloud images are small by default)
+ssh pve 'qm resize VMID scsi0 +30G'
+
+# Start VM
+ssh pve 'qm start VMID'
+```
+
+**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
+
+---
+
+## Adding a New LXC Container
+
+```bash
+# Download template (if not already downloaded)
+ssh pve 'pveam update'
+ssh pve 'pveam available | grep ubuntu'
+ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
+
+# Create container
+ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
+  --hostname mycontainer \
+  --memory 2048 \
+  --cores 2 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
+  --rootfs local-zfs:8 \
+  --unprivileged 1 \
+  --features nesting=1 \
+  --start 1'
+
+# Set root password
+ssh pve 'pct exec CTID -- passwd'
+
+# Add SSH key
+ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
+ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
+ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
+```
+
+---
+
+## GPU Passthrough Configuration
+
+### Current GPU Assignments
+
+| GPU | Location | Passed To | VMID | Purpose |
+|-----|----------|-----------|------|---------|
+| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
+| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
+| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
+
+### How to Pass GPU to VM
+
+1. **Identify GPU PCI ID**:
+   ```bash
+   ssh pve 'lspci | grep -i nvidia'
+   # Example output:
+   # 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
+   # 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
+   ```
+
+2. **Pass GPU to VM** (include both VGA and Audio):
+   ```bash
+   ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
+   # If multi-function device (GPU + Audio), use:
+   ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
+   ```
+
+3. **Configure VM for GPU**:
+   ```bash
+   # Set machine type to q35
+   ssh pve 'qm set VMID --machine q35'
+
+   # Set BIOS to OVMF (UEFI)
+   ssh pve 'qm set VMID --bios ovmf'
+
+   # Add EFI disk
+   ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
+   ```
+
+4. **Reboot VM** and install NVIDIA drivers inside the VM
+
+**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
+
+---
+
+## Backup Priority
+
+See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
+
+### Critical VMs (Must Backup)
+
+| Priority | VMID | Name | Reason |
+|----------|------|------|--------|
+| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
+| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
+| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
+| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
+| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
+
+### Medium Priority
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 200 | pihole | Easy to rebuild, but DNS config valuable |
+| 202 | traefik | Config files backed up separately |
+
+### Low Priority (Ephemeral/Rebuildable)
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 105 | fs-dev | Development - code is in Git |
+| 111 | lmdev1 | Ephemeral development |
+| 201 | copyparty | Simple app, easy to redeploy |
+| 206 | docker-host | Docker Compose files backed up separately |
+
+---
+
+## Quick Reference Commands
+
+```bash
+# List all VMs
+ssh pve 'qm list'
+ssh pve2 'qm list'
+
+# List all containers
+ssh pve 'pct list'
+
+# Start/stop VM
+ssh pve 'qm start VMID'
+ssh pve 'qm stop VMID'
+ssh pve 'qm shutdown VMID'  # Graceful
+
+# Start/stop container
+ssh pve 'pct start CTID'
+ssh pve 'pct stop CTID'
+ssh pve 'pct shutdown CTID'  # Graceful
+
+# VM console
+ssh pve 'qm terminal VMID'
+
+# Container console
+ssh pve 'pct enter CTID'
+
+# Clone VM
+ssh pve 'qm clone VMID NEW_VMID --name newvm'
+
+# Delete VM
+ssh pve 'qm destroy VMID'
+
+# Delete container
+ssh pve 'pct destroy CTID'
+```
+
+---
+
+## Related Documentation
+
+- [STORAGE.md](STORAGE.md) - Storage pool assignments
+- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
+- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
+
+---
+
+**Last Updated**: 2025-12-22