From 56b82df4972240bb2471df6cdef0e959c98bbba9 Mon Sep 17 00:00:00 2001
From: Hutson <hutson@htsn.io>
Date: Tue, 23 Dec 2025 00:34:21 -0500
Subject: [PATCH] Complete Phase 2 documentation: Add HARDWARE, SERVICES,
 MONITORING, MAINTENANCE
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---
 BACKUP-STRATEGY.md  |  358 ++++++++++++
 CLAUDE.md           | 1273 ++++++++-----------------------------------
 HARDWARE.md         |  455 ++++++++++++++++
 HOMEASSISTANT.md    |   36 ++
 MAINTENANCE.md      |  618 +++++++++++++++++++++
 MONITORING.md       |  546 +++++++++++++++++++
 POWER-MANAGEMENT.md |  509 +++++++++++++++++
 README.md           |  148 +++++
 SERVICES.md         |  591 ++++++++++++++++++++
 SSH-ACCESS.md       |  464 ++++++++++++++++
 STORAGE.md          |  510 +++++++++++++++++
 TRAEFIK.md          |  672 +++++++++++++++++++++++
 UPS.md              |  605 ++++++++++++++++++++
 VMS.md              |  579 ++++++++++++++++++++
 14 files changed, 6328 insertions(+), 1036 deletions(-)
 create mode 100644 BACKUP-STRATEGY.md
 create mode 100644 HARDWARE.md
 create mode 100644 MAINTENANCE.md
 create mode 100644 MONITORING.md
 create mode 100644 POWER-MANAGEMENT.md
 create mode 100644 README.md
 create mode 100644 SERVICES.md
 create mode 100644 SSH-ACCESS.md
 create mode 100644 STORAGE.md
 create mode 100644 TRAEFIK.md
 create mode 100644 UPS.md
 create mode 100644 VMS.md

diff --git a/BACKUP-STRATEGY.md b/BACKUP-STRATEGY.md
new file mode 100644
index 0000000..d1d93b8
--- /dev/null
+++ b/BACKUP-STRATEGY.md
@@ -0,0 +1,358 @@
+# Backup Strategy
+
+## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
+
+This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
+
+## Executive Summary
+
+### What We Have ✅
+- **Syncthing**: File synchronization across 5+ devices
+- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
+- **Proxmox**: Built-in backup capabilities (not yet configured)
+
+### What We DON'T Have 🚨
+- ❌ No documented VM/CT backups
+- ❌ No ZFS snapshot schedule
+- ❌ No offsite backups
+- ❌ No disaster recovery plan
+- ❌ No tested restore procedures
+- ❌ No configuration backups
+
+**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
+
+---
+
+## Current State Analysis
+
+### Syncthing (File Synchronization)
+
+**What it is**: Real-time file sync across devices
+**What it is NOT**: A backup solution
+
+| Folder | Devices | Size | Protected? |
+|--------|---------|------|------------|
+| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
+| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
+| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
+| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
+| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
+
+**Limitations**:
+- ❌ Accidental deletion → deleted everywhere
+- ❌ Ransomware/corruption → spreads everywhere
+- ❌ No point-in-time recovery
+- ❌ No version history (unless file versioning enabled - not documented)
+
+**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
+
+### ZFS on TrueNAS (Potential Backup Target)
+
+**Current Status**: ❓ Unknown - snapshots may or may not be configured
+
+**Needs Investigation**:
+```bash
+# Check if snapshots exist
+ssh truenas 'zfs list -t snapshot'
+
+# Check if automated snapshots are configured
+ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
+
+# Check snapshot schedule via TrueNAS API/UI
+```
+
+**If configured**, ZFS snapshots provide:
+- ✅ Point-in-time recovery
+- ✅ Protection against accidental deletion
+- ✅ Fast rollback capability
+- ⚠️ Still single location (no offsite protection)
+
+### Proxmox VM/CT Backups
+
+**Current Status**: ❓ Unknown - no backup jobs documented
+
+**Needs Investigation**:
+```bash
+# Check backup configuration
+ssh pve 'pvesh get /cluster/backup'
+
+# Check if any backups exist
+ssh pve 'ls -lh /var/lib/vz/dump/'
+ssh pve2 'ls -lh /var/lib/vz/dump/'
+```
+
+**Critical VMs Needing Backup**:
+| VM/CT | VMID | Priority | Notes |
+|-------|------|----------|-------|
+| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
+| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
+| homeassistant | 110 | 🟡 HIGH | Home automation config |
+| gitea-vm | 300 | 🟡 HIGH | Git repositories |
+| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
+| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
+| trading-vm | 301 | 🟡 HIGH | AI trading platform |
+| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
+
+---
+
+## Recommended Backup Strategy
+
+### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
+
+**ZFS Snapshots on TrueNAS**
+
+Schedule automatic snapshots for all datasets:
+
+| Dataset | Frequency | Retention |
+|---------|-----------|-----------|
+| vault/documents | Every 15 min | 1 hour |
+| vault/documents | Hourly | 24 hours |
+| vault/documents | Daily | 30 days |
+| vault/documents | Weekly | 12 weeks |
+| vault/documents | Monthly | 12 months |
+
+**Implementation**:
+```bash
+# Via TrueNAS UI: Storage → Snapshots → Add
+# Or via CLI:
+ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
+```
+
+**Proxmox VM Backups**
+
+Configure weekly backups to local storage:
+
+```bash
+# Create backup job via Proxmox UI:
+# Datacenter → Backup → Add
+# - Schedule: Weekly (Sunday 2 AM)
+# - Storage: local-zfs or nvme-mirror1
+# - Mode: Snapshot (fast)
+# - Retention: 4 backups
+```
+
+**Or via CLI**:
+```bash
+ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
+```
+
+### Tier 2: Offsite Backups (CRITICAL GAP)
+
+**Option A: Cloud Storage (Recommended)**
+
+Use **rclone** or **restic** to sync critical data to cloud:
+
+| Provider | Cost | Pros | Cons |
+|----------|------|------|------|
+| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
+| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
+| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
+
+**Implementation Example (Backblaze B2)**:
+```bash
+# Install on TrueNAS
+ssh truenas 'pkg install rclone restic'
+
+# Configure B2
+rclone config  # Follow prompts for B2
+
+# Daily backup critical folders
+0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
+```
+
+**Option B: Offsite TrueNAS Replication**
+
+- Set up second TrueNAS at friend/family member's house
+- Use ZFS replication to sync snapshots
+- Requires: Static IP or Tailscale, trust
+
+**Option C: USB Drive Rotation**
+
+- Weekly backup to external USB drive
+- Rotate 2-3 drives (one always offsite)
+- Manual but simple
+
+### Tier 3: Configuration Backups
+
+**Proxmox Configuration**
+
+```bash
+# Backup /etc/pve (configs are already in cluster filesystem)
+# But also backup to external location:
+ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
+
+# Copy to safe location
+scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
+```
+
+**VM-Specific Configs**
+
+- Traefik configs: `/etc/traefik/` on CT 202
+- Saltbox configs: `/srv/git/saltbox/` on VM 101
+- Home Assistant: `/config/` on VM 110
+
+**Script to backup all configs**:
+```bash
+#!/bin/bash
+# Save as ~/bin/backup-homelab-configs.sh
+
+DATE=$(date +%Y%m%d)
+BACKUP_DIR=~/Backups/homelab-configs/$DATE
+
+mkdir -p $BACKUP_DIR
+
+# Proxmox configs
+ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
+ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
+
+# Traefik
+ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
+
+# Saltbox
+ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
+
+# Home Assistant
+ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
+
+echo "Configs backed up to $BACKUP_DIR"
+```
+
+---
+
+## Disaster Recovery Scenarios
+
+### Scenario 1: Single VM Failure
+
+**Impact**: Medium
+**Recovery Time**: 30-60 minutes
+
+1. Restore from Proxmox backup:
+   ```bash
+   ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
+   ```
+2. Start VM and verify
+3. Update IP if needed
+
+### Scenario 2: TrueNAS Failure
+
+**Impact**: CATASTROPHIC (all storage lost)
+**Recovery Time**: Unknown - NO PLAN
+
+**Current State**: 🚨 NO RECOVERY PLAN
+**Needed**:
+- Offsite backup of critical datasets
+- Documented ZFS pool creation steps
+- Share configuration export
+
+### Scenario 3: Complete PVE Server Failure
+
+**Impact**: SEVERE
+**Recovery Time**: 4-8 hours
+
+**Current State**: ⚠️ PARTIALLY RECOVERABLE
+**Needed**:
+- VM backups stored on TrueNAS or PVE2
+- Proxmox reinstall procedure
+- Network config documentation
+
+### Scenario 4: Complete Site Disaster (Fire/Flood)
+
+**Impact**: TOTAL LOSS
+**Recovery Time**: Unknown
+
+**Current State**: 🚨 NO RECOVERY PLAN
+**Needed**:
+- Offsite backups (cloud or physical)
+- Critical data prioritization
+- Restore procedures
+
+---
+
+## Action Plan
+
+### Immediate (Next 7 Days)
+
+- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
+  ```bash
+  ssh truenas 'zfs list -t snapshot'
+  ssh pve 'ls -lh /var/lib/vz/dump/'
+  ```
+
+- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
+
+- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
+
+- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
+
+### Short-term (Next 30 Days)
+
+- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
+
+- [ ] **Install backup tools**: rclone or restic on TrueNAS
+
+- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
+
+- [ ] **Document restore procedures**: Step-by-step guides for each scenario
+
+### Long-term (Next 90 Days)
+
+- [ ] **Implement monitoring**: Alerts for backup failures
+
+- [ ] **Quarterly restore test**: Verify backups actually work
+
+- [ ] **Backup rotation policy**: Automate old backup cleanup
+
+- [ ] **Configuration backup automation**: Weekly cron job
+
+---
+
+## Monitoring & Validation
+
+### Backup Health Checks
+
+```bash
+# Check last ZFS snapshot
+ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
+
+# Check Proxmox backup status
+ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
+
+# Check cloud sync status (if using rclone)
+ssh truenas 'rclone ls b2:homelab-backup | wc -l'
+```
+
+### Alerts to Set Up
+
+- Email alert if no snapshot created in 24 hours
+- Email alert if Proxmox backup fails
+- Email alert if cloud sync fails
+- Weekly backup status report
+
+---
+
+## Cost Estimate
+
+**Monthly Backup Costs**:
+
+| Component | Cost | Notes |
+|-----------|------|-------|
+| Local storage (already owned) | $0 | Using existing TrueNAS |
+| Proxmox backups (local) | $0 | Using existing storage |
+| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
+| **Total** | **~$10/mo** | Minimal cost for peace of mind |
+
+**One-time**:
+- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
+
+---
+
+## Related Documentation
+
+- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
+- [VMS.md](VMS.md) - VM inventory and prioritization
+- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED
diff --git a/CLAUDE.md b/CLAUDE.md
index dcc8ba9..7c805c5 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,1176 +1,377 @@
-# Homelab Infrastructure
+# Homelab Infrastructure - Quick Reference
+
+**Start here**: [README.md](README.md) - Documentation index and overview
+
+This is your **quick reference guide** for common homelab tasks. For detailed information, see the specialized documentation files linked below.
+
+---
 
 ## Quick Reference - Common Tasks
 
-| Task | Section | Quick Command |
-|------|---------|---------------|
-| **Add new public service** | [Reverse Proxy](#reverse-proxy-architecture-traefik) | Create Traefik config + Cloudflare DNS |
-| **Add Cloudflare DNS** | [Cloudflare API](#cloudflare-api-access) | `curl -X POST cloudflare.com/...` |
+| Task | Documentation | Quick Command |
+|------|--------------|---------------|
+| **Add new public service** | [TRAEFIK.md](TRAEFIK.md) | Create Traefik config + Cloudflare DNS |
+| **Check UPS status** | [UPS.md](UPS.md) | `ssh pve 'upsc cyberpower@localhost'` |
 | **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` |
-| **Syncthing issues** | [Troubleshooting](#troubleshooting-runbooks) | Check API connections |
-| **SSL cert issues** | [Traefik DNS Challenge](#ssl-certificates) | Use `cloudflare` resolver |
+| **Syncthing issues** | [SYNCTHING.md](SYNCTHING.md) | Check API connections |
+| **VM/CT management** | [VMS.md](VMS.md) | `ssh pve 'qm list'` |
+| **Storage issues** | [STORAGE.md](STORAGE.md) | `ssh pve 'zpool status'` |
+| **SSH access** | [SSH-ACCESS.md](SSH-ACCESS.md) | Use host aliases in `~/.ssh/config` |
+| **Power optimization** | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU states |
+| **Backup strategy** | [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | ⚠️ CRITICAL GAPS |
 
-**Key Credentials (see sections for full details):**
-- Cloudflare: `cloudflare@htsn.io` / API Key in [Cloudflare API](#cloudflare-api-access)
+**Key Credentials:**
 - SSH Password: `GrilledCh33s3#`
-- Traefik: CT 202 @ 10.10.10.250
+- Cloudflare: `cloudflare@htsn.io` / `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
+- See individual docs for service-specific credentials
 
 ---
 
 ## Role
 
-You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include:
+You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure.
 
-- **Infrastructure Management**: Proxmox servers, VMs, containers, networking
-- **File Sync**: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android)
-- **Network Administration**: Router config, SSH access, Tailscale, device management
-- **Power Optimization**: CPU governors, GPU power states, service tuning
-- **Documentation**: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date
-- **Automation**: Shell aliases, startup scripts, scheduled tasks
+**Responsibilities:**
+- Infrastructure Management (Proxmox, VMs, containers)
+- File Sync (Syncthing across all devices)
+- Network Administration
+- Power Optimization
+- Documentation (keep all docs current)
+- Automation (shell aliases, scripts, scheduled tasks)
 
-You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure.
+**Full access via**: SSH keys, APIs, QEMU guest agent
 
-### Proactive Behaviors
+---
 
-When the user mentions issues or asks questions, proactively:
-- **"sync not working"** → Check Syncthing status on ALL devices, identify which is offline
-- **"device offline"** → Ping both local and Tailscale IPs, check if service is running
-- **"slow"** → Check CPU usage, running processes, Syncthing rescan activity
+## Proactive Behaviors
+
+When the user mentions issues or asks questions:
+- **"sync not working"** → Check Syncthing on ALL devices, identify which is offline
+- **"device offline"** → Ping local + Tailscale IPs, check if service running
+- **"slow"** → Check CPU usage, processes, Syncthing rescan activity
 - **"check status"** → Run full health check across all systems
-- **"something's wrong"** → Run diagnostics on likely culprits based on context
+- **"something's wrong"** → Run diagnostics on likely culprits
 
-### Quick Health Checks
+---
 
-Run these to get a quick overview of the homelab:
+## Quick Health Checks
 
 ```bash
 # === FULL HEALTH CHECK ===
+
 # Syncthing connections (Mac Mini)
-curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
 
 # Proxmox VMs
 ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable"
 ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable"
 
-# Ping critical devices
+# Critical devices
 ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN"
 ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN"
 
-# Check Windows PC Syncthing (often goes offline)
-nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN"
+# Windows PC Syncthing
+nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN"
 ```
 
-### Troubleshooting Runbooks
+---
 
-| Symptom | Check | Fix |
-|---------|-------|-----|
-| Device not syncing | `curl Syncthing API → connections` | Check if device online, restart Syncthing |
-| Windows PC offline | `ping 10.10.10.150` then `nc -z 22000` | SSH in, `Start-ScheduledTask -TaskName "Syncthing"` |
-| Phone not syncing | Phone Syncthing app in background? | User must open app, keep screen on |
-| High CPU on TrueNAS | Syncthing rescan? KSM? | Check rescan intervals, disable KSM |
-| VM won't start | Storage available? RAM free? | `ssh pve 'qm start VMID'`, check logs |
-| Tailscale offline | `tailscale status` | `tailscale up` or restart service |
-| Tailscale no subnet access | Check subnet routers | Verify pve or ucg-fiber advertising routes |
-| Sync stuck at X% | Folder errors? Conflicts? | Check `rest/folder/errors?folder=NAME` |
-| Server running hot | Check KSM, check CPU processes | Disable KSM, identify runaway process |
-| Storage enclosure loud | Check fan speed via SES | See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
-| Drives not detected | Check SAS link, LCC status | Switch LCC, rescan SCSI hosts |
+## Troubleshooting Runbooks
+
+| Symptom | Check | Fix | Docs |
+|---------|-------|-----|------|
+| Device not syncing | `curl Syncthing API` | Restart Syncthing | [SYNCTHING.md](SYNCTHING.md) |
+| VM won't start | Storage/RAM available? | `ssh pve 'qm start VMID'` | [VMS.md](VMS.md) |
+| Server running hot | Check KSM, CPU processes | Disable KSM | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) |
+| Storage enclosure loud | Check fan speed via SES | Switch LCC | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) |
+| UPS on battery | Check runtime | Monitor shutdown script | [UPS.md](UPS.md) |
+| Service unreachable | Check Traefik config | Fix routing | [TRAEFIK.md](TRAEFIK.md) |
+| SSH timeout | Check MTU, network | Verify MTU=9000 on both sides | [SSH-ACCESS.md](SSH-ACCESS.md) |
+
+---
+
+## Server Temperature Check
 
-### Server Temperature Check
 ```bash
 # Check temps on both servers (Threadripper PRO max safe: 90°C Tctl)
-ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
-ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
-```
-**Healthy temps**: 70-80°C under load. **Warning**: >85°C. **Throttle**: 90°C.
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
 
-### Service Dependencies
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Healthy**: 70-80°C under load | **Warning**: >85°C | **Throttle**: 90°C
+
+---
+
+## Service Dependencies
 
 ```
 TrueNAS (10.10.10.200)
-├── Central Syncthing hub - if down, sync breaks between devices
+├── Central Syncthing hub - if down, sync breaks
 ├── NFS/SMB shares for VMs
 └── Media storage for Plex
 
 PiHole (CT 200)
-└── DNS for entire network - if down, name resolution fails
+└── DNS for entire network
 
 Traefik (CT 202)
-└── Reverse proxy - if down, external access to services fails
+└── Reverse proxy - external access
 
 Router (10.10.10.1)
-└── Everything - gateway for all traffic
+└── Gateway for all traffic
 ```
 
-### API Quick Reference
+---
+
+## API Quick Reference
 
 | Service | Device | Endpoint | Auth |
 |---------|--------|----------|------|
 | Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` |
-| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` (via SSH) | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
+| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` |
 | Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` |
-| Proxmox | PVE | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
-| Proxmox | PVE2 | `https://10.10.10.102:8006/api2/json/` | SSH key auth |
+| Proxmox | PVE/PVE2 | `https://10.10.10.120:8006/api2/json/` | SSH key auth |
 
-### Common Maintenance Tasks
+**See**: [SYNCTHING.md](SYNCTHING.md), [HOMEASSISTANT.md](HOMEASSISTANT.md) for more APIs
 
-When user asks for maintenance or you notice issues:
+---
 
-1. **Check Syncthing sync status** - Any folders behind? Errors?
-2. **Verify all devices connected** - Run connection check
-3. **Check disk space** - `ssh pve 'df -h'`, `ssh pve2 'df -h'`
-4. **Review ZFS pool health** - `ssh pve 'zpool status'`
-5. **Check for stuck processes** - High CPU? Memory pressure?
-6. **Verify backups** - Are critical folders syncing?
-
-### Emergency Commands
+## Emergency Commands
 
 ```bash
-# Restart VM on Proxmox
+# Restart VM
 ssh pve 'qm stop VMID && qm start VMID'
 
-# Check what's using CPU
+# Check CPU usage
 ssh pve 'ps aux --sort=-%cpu | head -10'
 
-# Check ZFS pool status (via QEMU agent)
+# Check ZFS pool (via QEMU agent)
 ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
 
-# Check EMC enclosure fans
-ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"'
-
 # Force Syncthing rescan
-curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"
+curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \
+  -H "X-API-Key: API_KEY"
 
-# Restart Syncthing on Windows (when stuck)
-sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
-
-# Get all device IPs from router
-expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+# Restart Syncthing on Windows
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \
+  'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
 ```
 
-## Overview
+---
 
-Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads.
+## Infrastructure Overview
 
-## Servers
+### Servers
 
-### PVE (10.10.10.120) - Primary
-- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
-- **RAM**: 128 GB
-- **Storage**:
-  - `nvme-mirror1`: 2x Sabrent Rocket Q NVMe (3.6TB usable)
-  - `nvme-mirror2`: 2x Kingston SFYRD 2TB (1.8TB usable)
-  - `rpool`: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable)
-- **GPUs**:
-  - NVIDIA Quadro P2000 (75W TDP) - Plex transcoding
-  - NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1
-- **Role**: Primary VM host, TrueNAS, media services
+| Server | CPU | RAM | Role | Details |
+|--------|-----|-----|------|---------|
+| **PVE** (10.10.10.120) | Threadripper PRO 3975WX (32C) | 128GB | Primary | [VMS.md](VMS.md) |
+| **PVE2** (10.10.10.102) | Threadripper PRO 3975WX (32C) | 128GB | Secondary | [VMS.md](VMS.md) |
 
-### PVE2 (10.10.10.102) - Secondary
-- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP)
-- **RAM**: 128 GB
-- **Storage**:
-  - `nvme-mirror3`: 2x NVMe mirror
-  - `local-zfs2`: 2x WD Red 6TB HDD mirror
-- **GPUs**:
-  - NVIDIA RTX A6000 (300W TDP) - passed to trading-vm
-- **Role**: Trading platform, development
+**Power**: ~1000-1350W under load | **UPS**: CyberPower 2200VA/1320W | **See**: [UPS.md](UPS.md), [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md)
 
-## SSH Access
+### Critical VMs
 
-### SSH Key Authentication (All Hosts)
+| VMID | Name | IP | Purpose | Docs |
+|------|------|-----|---------|------|
+| 100 | truenas | 10.10.10.200 | NAS/storage | [STORAGE.md](STORAGE.md) |
+| 101 | saltbox | 10.10.10.100 | Media stack (Plex) | [VMS.md](VMS.md) |
+| 110 | homeassistant | 10.10.10.110 | Home automation | [HOMEASSISTANT.md](HOMEASSISTANT.md) |
+| 202 | traefik (CT) | 10.10.10.250 | Reverse proxy | [TRAEFIK.md](TRAEFIK.md) |
 
-SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. Use the `~/.ssh/homelab` key.
+**Complete inventory**: [VMS.md](VMS.md) | **IP assignments**: [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
 
-| Host Alias | IP | User | Type | Notes |
-|------------|-----|------|------|-------|
-| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
-| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
-| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
-| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
-| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
-| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
-| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
-| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
-| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
-| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
-| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
-| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
-| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
+---
 
-**Usage examples:**
-```bash
-ssh pve 'qm list'                    # List VMs
-ssh truenas 'zpool status vault'     # Check ZFS pool
-ssh saltbox 'docker ps'              # List containers
-ssh pihole 'pihole status'           # Check Pi-hole
-```
+## Common Maintenance Tasks
 
-### Password Auth (Special Cases)
+1. **Check Syncthing sync** - Folders behind? Errors?
+2. **Verify devices connected** - Run connection check
+3. **Check disk space** - `ssh pve 'df -h'`
+4. **Review ZFS health** - `ssh pve 'zpool status'`
+5. **Check for stuck processes** - High CPU? Memory pressure?
+6. **Verify backups** - Critical folders syncing? → See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md)
 
-| Device | IP | User | Auth Method | Notes |
-|--------|-----|------|-------------|-------|
-| UniFi Router | 10.10.10.1 | root | expect (keyboard-interactive) | Gateway |
-| Windows PC | 10.10.10.150 | claude | sshpass | PowerShell, use `;` not `&&` |
-| HomeAssistant | 10.10.10.110 | - | QEMU agent only | No SSH server |
+---
 
-**Router access (requires expect):**
-```bash
-# Run command on router
-expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+## Network Quick Reference
 
-# Get ARP table (all device IPs)
-expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
-```
+**Ranges**: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage)
+**Jumbo Frames**: MTU 9000 enabled
+**Tailscale**: VPN with subnet routing (HA failover)
 
-**Windows PC access:**
-```bash
-sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
-```
+**See**: [NETWORK.md](NETWORK.md) for complete details
 
-**HomeAssistant (no SSH, use QEMU agent):**
-```bash
-ssh pve 'qm guest exec 110 -- bash -c "ha core info"'
-```
-
-## VMs and Containers
-
-### PVE (10.10.10.120)
-| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
-|------|------|-------|-----|---------|-----------------|------------|
-| 100 | truenas | 8 | 32GB | NAS, storage | LSI SAS2308 HBA, Samsung NVMe | Yes |
-| 101 | saltbox | 16 | 16GB | Media automation | TITAN RTX | Yes |
-| 105 | fs-dev | 10 | 8GB | Development | - | Yes |
-| 110 | homeassistant | 2 | 2GB | Home automation | - | No |
-| 111 | lmdev1 | 8 | 32GB | AI/LLM development | TITAN RTX | Yes |
-| 201 | copyparty | 2 | 2GB | File sharing | - | Yes |
-| 206 | docker-host | 2 | 4GB | Docker services | - | Yes |
-| 200 | pihole (CT) | - | - | DNS/Ad blocking | - | N/A |
-| 202 | traefik (CT) | - | - | Reverse proxy | - | N/A |
-| 205 | findshyt (CT) | - | - | Custom app | - | N/A |
-
-### PVE2 (10.10.10.102)
-| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent |
-|------|------|-------|-----|---------|-----------------|------------|
-| 300 | gitea-vm | 2 | 4GB | Git server | - | Yes |
-| 301 | trading-vm | 16 | 32GB | AI trading platform | RTX A6000 | Yes |
-
-### QEMU Guest Agent
-VMs with QEMU agent can be managed via `qm guest exec`:
-```bash
-# Execute command in VM
-ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
-
-# Get VM IP addresses
-ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
-```
-Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead.
-
-## Power Management
-
-### Estimated Power Draw
-- **PVE**: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs)
-- **PVE2**: 450-600W (CPU + RTX A6000 + storage)
-- **Combined**: ~1000-1350W under load
-
-### Optimizations Applied
-1. **KSMD Disabled** (2024-12-17 updated)
-   - Was consuming 44-57% CPU on PVE with negative profit
-   - Caused CPU temp to rise from 74°C to 83°C
-   - Savings: ~7-10W + significant temp reduction
-   - Made permanent via:
-     - systemd service: `/etc/systemd/system/disable-ksm.service`
-     - **ksmtuned masked**: `systemctl mask ksmtuned` (prevents re-enabling)
-   - **Note**: KSM can get re-enabled by Proxmox updates. If CPU is hot, check:
-     ```bash
-     cat /sys/kernel/mm/ksm/run  # Should be 0
-     ps aux | grep ksmd          # Should show 0% CPU
-     # If KSM is running (run=1), disable it:
-     echo 0 > /sys/kernel/mm/ksm/run
-     systemctl mask ksmtuned
-     ```
-
-2. **Syncthing Rescan Intervals** (2024-12-16)
-   - Changed aggressive 60s rescans to 3600s for large folders
-   - Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
-   - Savings: ~60-80W (TrueNAS VM was at constant 86% CPU)
-
-3. **CPU Governor Optimization** (2024-12-16)
-   - PVE: `powersave` governor + `balance_power` EPP (amd-pstate-epp driver)
-   - PVE2: `schedutil` governor (acpi-cpufreq driver)
-   - Made permanent via systemd service: `/etc/systemd/system/cpu-powersave.service`
-   - Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz)
-
-4. **GPU Power States** (2024-12-16) - Verified optimal
-   - RTX A6000: 11W idle (P8 state)
-   - TITAN RTX: 2-3W idle (P8 state)
-   - Quadro P2000: 25W (P0 - Plex keeps it active)
-
-5. **ksmtuned Disabled** (2024-12-16)
-   - KSM tuning daemon was still running after KSMD disabled
-   - Stopped and disabled on both servers
-   - Savings: ~2-5W
-
-6. **HDD Spindown on PVE2** (2024-12-16)
-   - local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7
-   - Set 30-minute spindown via `hdparm -S 241`
-   - Persistent via udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
-   - Savings: ~10-16W when spun down
-
-### Potential Optimizations
-- [ ] PCIe ASPM power management
-- [ ] NMI watchdog disable
-
-## Memory Configuration
-- Ballooning enabled on most VMs but not actively used
-- No memory overcommit (98GB allocated on 128GB physical for PVE)
-- KSMD was wasting CPU with no benefit (negative general_profit)
-
-## Network
-
-See [NETWORK.md](NETWORK.md) for full details.
-
-### Network Ranges
-| Network | Range | Purpose |
-|---------|-------|---------|
-| LAN | 10.10.10.0/24 | Primary network, all external access |
-| Internal | 10.10.20.0/24 | Inter-VM only (storage, NFS/iSCSI) |
-
-### PVE Bridges (10.10.10.120)
-| Bridge | NIC | Speed | Purpose | Use For |
-|--------|-----|-------|---------|---------|
-| vmbr0 | enp1s0 | 1 Gb | Management | General VMs/CTs |
-| vmbr1 | enp35s0f0 | 10 Gb | High-speed LXC | Bandwidth-heavy containers |
-| vmbr2 | enp35s0f1 | 10 Gb | High-speed VM | TrueNAS, Saltbox, storage VMs |
-| vmbr3 | (none) | Virtual | Internal only | NFS/iSCSI traffic, no internet |
-
-### Quick Reference
-```bash
-# Add VM to standard network (1Gb)
-qm set VMID --net0 virtio,bridge=vmbr0
-
-# Add VM to high-speed network (10Gb)
-qm set VMID --net0 virtio,bridge=vmbr2
-
-# Add secondary NIC for internal storage network
-qm set VMID --net1 virtio,bridge=vmbr3
-```
-
-### MTU 9000 (Jumbo Frames)
-
-Jumbo frames are enabled across the network for improved throughput on large transfers.
-
-| Device | Interface | MTU | Persistent |
-|--------|-----------|-----|------------|
-| Mac Mini | en0 | 9000 | Yes (networksetup) |
-| PVE | vmbr0, enp1s0 | 9000 | Yes (/etc/network/interfaces) |
-| PVE2 | vmbr0, nic1 | 9000 | Yes (/etc/network/interfaces) |
-| TrueNAS | enp6s18, enp6s19 | 9000 | Yes |
-| UCG-Fiber | br0 | 9216 | Yes (default) |
-
-**Verify MTU:**
-```bash
-# Mac Mini
-ifconfig en0 | grep mtu
-
-# PVE/PVE2
-ssh pve 'ip link show vmbr0 | grep mtu'
-ssh pve2 'ip link show vmbr0 | grep mtu'
-
-# Test jumbo frames
-ping -c 1 -D -s 8000 10.10.10.120  # 8000 + 8 byte header = 8008 bytes
-```
-
-**Important:** When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped.
-
-### Tailscale VPN
-
-Tailscale provides secure remote access to the homelab from anywhere.
-
-**Subnet Routers (HA Failover)**
-
-Two devices advertise the `10.10.10.0/24` subnet for redundancy:
-
-| Device | Tailscale IP | Role | Notes |
-|--------|--------------|------|-------|
-| pve | 100.113.177.80 | Primary | Proxmox host |
-| ucg-fiber | 100.94.246.32 | Failover | UniFi router (always on) |
-
-If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec).
-
-**Router Tailscale Setup (UCG-Fiber)**
-- Installed via: `curl -fsSL https://tailscale.com/install.sh | sh`
-- Config: `tailscale up --advertise-routes=10.10.10.0/24 --accept-routes`
-- Survives reboots (systemd service)
-- Routes must be approved in [Tailscale Admin Console](https://login.tailscale.com/admin/machines)
-
-**Tailscale IPs Quick Reference**
-
-| Device | Tailscale IP | Local IP |
-|--------|--------------|----------|
-| Mac Mini | 100.108.89.58 | 10.10.10.125 |
-| PVE | 100.113.177.80 | 10.10.10.120 |
-| UCG-Fiber | 100.94.246.32 | 10.10.10.1 |
-| TrueNAS | 100.100.94.71 | 10.10.10.200 |
-| Pi-hole | 100.112.59.128 | 10.10.10.10 |
-
-**Check Tailscale Status**
-```bash
-# From Mac Mini
-/Applications/Tailscale.app/Contents/MacOS/Tailscale status
-
-# From router
-expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
-```
+---
 
 ## Common Commands
+
 ```bash
-# Check VM status
-ssh pve 'qm list'
-ssh pve2 'qm list'
+# VM management
+ssh pve 'qm list'                    # List VMs
+ssh pve 'qm start VMID'              # Start VM
+ssh pve 'qm shutdown VMID'           # Graceful shutdown
 
-# Check container status
-ssh pve 'pct list'
+# Container management
+ssh pve 'pct list'                   # List containers
+ssh pve 'pct enter CTID'             # Enter container shell
 
-# Monitor CPU/power
-ssh pve 'top -bn1 | head -20'
+# Storage
+ssh pve 'zpool status'               # Check ZFS pools
+ssh truenas 'zpool status vault'     # Check TrueNAS pool
 
-# Check ZFS pools
-ssh pve 'zpool status'
-
-# Check GPU (if nvidia-smi installed in VM)
-ssh pve 'lspci | grep -i nvidia'
+# QEMU guest agent
+ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
 ```
 
-## Remote Claude Code Sessions (Mac Mini)
+**See**: [SSH-ACCESS.md](SSH-ACCESS.md), [VMS.md](VMS.md)
 
-### Overview
-The Mac Mini (`hutson-mac-mini.local`) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required.
+---
 
-### Architecture
-```
-Mac Mini (100.108.89.58 via Tailscale)
-├── launchd (auto-starts on boot)
-│   └── com.hutson.happy-daemon.plist (starts Happy daemon)
-├── Happy Coder daemon (manages remote sessions)
-└── Tailscale (secure remote access)
-```
+## Documentation Index
 
-### How It Works
-1. Happy daemon runs on Mac Mini (auto-starts on boot)
-2. Open Happy Coder app on phone/tablet
-3. Start a new Claude session from the app
-4. Session runs in any working directory you choose
-5. Session ends when you're done - no cleanup needed
+### Infrastructure
+- [README.md](README.md) - Start here
+- [VMS.md](VMS.md) - VM/CT inventory
+- [STORAGE.md](STORAGE.md) - ZFS pools, shares
+- [NETWORK.md](NETWORK.md) - Bridges, VLANs, Tailscale
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Optimizations
+- [UPS.md](UPS.md) - UPS config, NUT monitoring
 
-### Quick Commands
-```bash
-# Check daemon status
-happy daemon list
+### Services
+- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy, SSL
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home automation
+- [SYNCTHING.md](SYNCTHING.md) - File sync
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure
 
-# Start a new session manually (from Mac Mini terminal)
-cd ~/Projects/homelab && happy claude
+### Operations
+- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH keys, hosts
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - ⚠️ Backups (CRITICAL)
+- [SHELL-ALIASES.md](SHELL-ALIASES.md) - ZSH aliases
 
-# Check active sessions
-happy daemon list
-```
+---
 
-### Mobile Access Setup (One-time)
-1. Download Happy Coder app:
-   - iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505
-   - Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy
-2. On Mac Mini, ensure self-hosted server is configured:
-   ```bash
-   echo 'export HAPPY_SERVER_URL="https://happy.htsn.io"' >> ~/.zshrc
-   source ~/.zshrc
-   ```
-3. Authenticate with the Happy server:
-   ```bash
-   happy auth login --force  # Opens browser, scan QR with app
-   ```
-4. Connect Claude API access:
-   ```bash
-   happy connect claude      # Links your Anthropic API credentials
-   ```
-5. Ensure Claude is logged in locally (critical for spawned sessions):
-   ```bash
-   claude                    # Start Claude Code
-   /login                    # Authenticate if prompted
-   ```
-6. Daemon auto-starts on login via launchd
-
-### Daemon Management
-```bash
-happy daemon start    # Start daemon
-happy daemon stop     # Stop daemon
-happy daemon status   # Check status
-happy daemon list     # List active sessions
-```
-
-### Remote Access via SSH + Tailscale
-From any device on Tailscale network:
-```bash
-# SSH to Mac Mini
-ssh hutson@100.108.89.58
-
-# Or via hostname
-ssh hutson@mac-mini
-
-# Start Claude in desired directory
-cd ~/Projects/homelab && happy claude
-```
-
-### Files & Configuration
-| File | Purpose |
-|------|---------|
-| `~/Library/LaunchAgents/com.hutson.happy-daemon.plist` | User LaunchAgent (starts at login) |
-| `~/.happy/` | Happy Coder config, state, and logs |
-| `~/.zshrc` | Contains `HAPPY_SERVER_URL` export |
-
-**Server:** `https://happy.htsn.io` (self-hosted Happy server on docker-host)
-
-### Troubleshooting
-```bash
-# Check if daemon is running
-pgrep -f "happy.*daemon"
-
-# Check launchd status
-launchctl list | grep happy
-
-# List active sessions
-happy daemon list
-
-# Restart daemon
-happy daemon stop && happy daemon start
-
-# If Tailscale is disconnected
-/Applications/Tailscale.app/Contents/MacOS/Tailscale up
-```
-
-**Common Issues:**
-
-| Issue | Cause | Fix |
-|-------|-------|-----|
-| "Invalid API key" in spawned session | Claude not logged in locally | Run `claude` then `/login` on Mac Mini |
-| "Failed to start daemon" | Stale lock file | `rm -f ~/.happy/daemon.state.json.lock ~/.happy/daemon.state.json` |
-| Sessions not showing on phone | HAPPY_SERVER_URL not set | Add to `~/.zshrc`: `export HAPPY_SERVER_URL="https://happy.htsn.io"` |
-| Slow responses | Cloudflare proxy enabled | Disable proxy for happy.htsn.io subdomain |
-
-## Happy Server (Self-Hosted Relay)
-
-Self-hosted Happy Coder relay server for lower latency and no external dependencies.
-
-### Architecture
-```
-Phone App → https://happy.htsn.io → Traefik → docker-host:3002 → Happy Server
-                                                    ↓
-                                    PostgreSQL + Redis + MinIO (local)
-```
-
-### Service Details
-
-| Component | Location | Port | Notes |
-|-----------|----------|------|-------|
-| Happy Server | docker-host (10.10.10.206) | 3002 | Main relay service |
-| PostgreSQL | docker-host | 5432 (internal) | User/session data |
-| Redis | docker-host | 6379 (internal) | Real-time events |
-| MinIO | docker-host | 9000 (internal) | File/image storage |
-| Traefik | CT 202 | 443 | SSL termination |
-
-### Configuration
-
-**Docker Compose**: `/opt/happy-server/docker-compose.yml`
-**Traefik Config**: `/etc/traefik/conf.d/happy.yaml` (on CT 202)
-**DNS**: happy.htsn.io → 70.237.94.174 (Cloudflare DNS-only, NOT proxied for WebSocket performance)
-
-**Credentials**:
-- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
-- MinIO: `happyadmin` / `happyadmin123`
-- PostgreSQL: `happy` / `happypass`
-
-### Quick Commands
-```bash
-# Check status
-ssh docker-host 'docker ps --filter "name=happy"'
-
-# View logs
-ssh docker-host 'docker logs -f happy-server'
-
-# Restart stack
-ssh docker-host 'cd /opt/happy-server && sudo docker-compose restart'
-
-# Health check
-curl https://happy.htsn.io/health
-
-# Run migrations (if needed)
-ssh docker-host 'docker exec happy-server npx prisma migrate deploy'
-```
-
-### Connecting Devices
-
-**Phone (Happy App)**:
-1. Settings → Relay Server URL
-2. Enter: `https://happy.htsn.io`
-3. Save and reconnect
-
-**CLI (Mac/Linux)**:
-```bash
-export HAPPY_SERVER_URL="https://happy.htsn.io"
-happy auth  # Re-authenticate with new server
-```
-
-### Maintenance
-
-**Backup data**:
-```bash
-ssh docker-host 'docker exec happy-postgres pg_dump -U happy happy > /tmp/happy-backup.sql'
-```
-
-**Update Happy Server**:
-```bash
-ssh docker-host 'cd /opt/happy-server && git pull && sudo docker-compose build && sudo docker-compose up -d'
-```
-
-## Agent and Tool Guidelines
+## Agent & Tool Guidelines
 
 ### Background Agents
-- **Always spin up background agents when doing multiple independent tasks**
-- Background agents allow parallel execution of tasks that don't depend on each other
-- This improves efficiency and reduces total execution time
-- Use background agents for tasks like running tests, builds, or searches simultaneously
+**Always** spin up background agents for multiple independent tasks:
+- Parallel execution improves efficiency
+- Use for: tests, builds, searches simultaneously
 
-### MCP Tools for Web Searches
+### MCP Tools
 
-#### ref.tools - Documentation Lookups
-- **`mcp__Ref__ref_search_documentation`**: Search through documentation for specific topics
-- **`mcp__Ref__ref_read_url`**: Read and parse content from documentation URLs
+| Tool | Provider | Use Case |
+|------|----------|----------|
+| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation |
+| `mcp__Ref__ref_read_url` | ref.tools | Read doc URLs |
+| `mcp__exa__web_search_exa` | Exa | General web search |
+| `mcp__exa__get_code_context_exa` | Exa | Code-specific search |
 
-#### Exa MCP - General Web and Code Searches
-- **`mcp__exa__web_search_exa`**: General web searches for current information
-- **`mcp__exa__get_code_context_exa`**: Code-related searches and repository lookups
-
-### MCP Tools Reference Table
-
-| Tool Name | Provider | Purpose | Use Case |
-|-----------|----------|---------|----------|
-| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | Finding specific topics in official docs |
-| `mcp__Ref__ref_read_url` | ref.tools | Read documentation URLs | Parsing and extracting content from doc pages |
-| `mcp__exa__web_search_exa` | Exa MCP | General web search | Current events, general information lookup |
-| `mcp__exa__get_code_context_exa` | Exa MCP | Code-specific search | Finding code examples, repository searches |
-
-## Reverse Proxy Architecture (Traefik)
-
-### Overview
-There are **TWO separate Traefik instances** handling different services:
-
-| Instance | Location | IP | Purpose | Manages |
-|----------|----------|-----|---------|---------|
-| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | All non-Saltbox services |
-| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services | Plex, *arr apps, media stack |
-
-### ⚠️ CRITICAL RULE: Which Traefik to Use
-
-**When adding ANY new service:**
-- ✅ **Use Traefik-Primary (10.10.10.250)** - Unless service lives inside Saltbox VM
-- ❌ **DO NOT touch Traefik-Saltbox** - It manages Saltbox services with their own certificates
-
-**Why this matters:**
-- Traefik-Saltbox has complex Saltbox-managed configs
-- Messing with it breaks Plex, Sonarr, Radarr, and all media services
-- Each Traefik has its own Let's Encrypt certificates
-- Mixing them causes certificate conflicts
-
-### Traefik-Primary (CT 202) - For New Services
-
-**Location**: `/etc/traefik/` on Container 202
-**Config**: `/etc/traefik/traefik.yaml`
-**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
-
-**Services using Traefik-Primary (10.10.10.250):**
-- excalidraw.htsn.io → 10.10.10.206:8080 (docker-host)
-- findshyt.htsn.io → 10.10.10.205 (CT 205)
-- gitea (git.htsn.io) → 10.10.10.220:3000
-- homeassistant → 10.10.10.110
-- lmdev → 10.10.10.111
-- pihole → 10.10.10.200
-- truenas → 10.10.10.200
-- proxmox → 10.10.10.120
-- copyparty → 10.10.10.201
-- aitrade → trading server
-- pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring)
-- happy.htsn.io → 10.10.10.206:3002 (Happy Coder relay server)
-
-**Access Traefik config:**
-```bash
-# From Mac Mini:
-ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
-ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
-
-# Edit a service config:
-ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
-```
-
-### Traefik-Saltbox (VM 101) - DO NOT MODIFY
-
-**Location**: `/opt/traefik/` inside Saltbox VM
-**Managed by**: Saltbox Ansible playbooks
-**Mounts**: Docker bind mount from `/opt/traefik` → `/etc/traefik` in container
-
-**Services using Traefik-Saltbox (10.10.10.100):**
-- Plex (plex.htsn.io)
-- Sonarr, Radarr, Lidarr
-- SABnzbd, NZBGet, qBittorrent
-- Overseerr, Tautulli, Organizr
-- Jackett, NZBHydra2
-- Authelia (SSO)
-- All other Saltbox-managed containers
-
-**View Saltbox Traefik (read-only):**
-```bash
-ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
-```
-
-### Adding a New Public Service - Complete Workflow
-
-Follow these steps to deploy a new service and make it publicly accessible at `servicename.htsn.io`.
-
-#### Step 0. Deploy Your Service
-
-First, deploy your service on the appropriate host:
-
-**Option A: Docker on docker-host (10.10.10.206)**
-```bash
-ssh hutson@10.10.10.206
-sudo mkdir -p /opt/myservice
-cat > /opt/myservice/docker-compose.yml << 'EOF'
-version: "3.8"
-services:
-  myservice:
-    image: myimage:latest
-    ports:
-      - "8080:80"
-    restart: unless-stopped
-EOF
-cd /opt/myservice && sudo docker-compose up -d
-```
-
-**Option B: New LXC Container on PVE**
-```bash
-ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
-  --hostname myservice --memory 2048 --cores 2 \
-  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
-  --rootfs local-zfs:8 --unprivileged 1 --start 1'
-```
-
-**Option C: New VM on PVE**
-```bash
-ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
-  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
-```
-
-#### Step 1. Create Traefik Config File
-
-Use this template for new services on **Traefik-Primary (CT 202)**:
-
-```yaml
-# /etc/traefik/conf.d/myservice.yaml
-http:
-  routers:
-    # HTTPS router
-    myservice-secure:
-      entryPoints:
-        - websecure
-      rule: "Host(`myservice.htsn.io`)"
-      service: myservice
-      tls:
-        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
-      priority: 50
-
-    # HTTP → HTTPS redirect
-    myservice-redirect:
-      entryPoints:
-        - web
-      rule: "Host(`myservice.htsn.io`)"
-      middlewares:
-        - myservice-https-redirect
-      service: myservice
-      priority: 50
-
-  services:
-    myservice:
-      loadBalancer:
-        servers:
-          - url: "http://10.10.10.XXX:PORT"
-
-  middlewares:
-    myservice-https-redirect:
-      redirectScheme:
-        scheme: https
-        permanent: true
-```
-
-### SSL Certificates
-
-Traefik has **two certificate resolvers** configured:
-
-| Resolver | Use When | Challenge Type | Notes |
-|----------|----------|----------------|-------|
-| `letsencrypt` | Cloudflare DNS-only (gray cloud) | HTTP-01 | Requires port 80 reachable |
-| `cloudflare` | Cloudflare Proxied (orange cloud) | DNS-01 | Works with Cloudflare proxy |
-
-**⚠️ Important:** If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use `cloudflare` resolver instead.
-
-**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
-```bash
-Environment="CF_API_EMAIL=cloudflare@htsn.io"
-Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
-```
-
-**Certificate storage:**
-- HTTP challenge certs: `/etc/traefik/acme.json`
-- DNS challenge certs: `/etc/traefik/acme-cf.json`
-
-**Deploy the config:**
-```bash
-# Create file on CT 202
-ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
-<paste config here>
-EOF"'
-
-# Traefik auto-reloads (watches conf.d directory)
-# Check logs:
-ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
-```
-
-#### 2. Add Cloudflare DNS Entry
-
-**Cloudflare Credentials:**
-- Email: `cloudflare@htsn.io`
-- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
-
-**Manual method (via Cloudflare Dashboard):**
-1. Go to https://dash.cloudflare.com/
-2. Select `htsn.io` domain
-3. DNS → Add Record
-4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
-
-**Automated method (CLI script):**
-
-Save this as `~/bin/add-cloudflare-dns.sh`:
-```bash
-#!/bin/bash
-# Add DNS record to Cloudflare for htsn.io
-
-SUBDOMAIN="$1"
-CF_EMAIL="cloudflare@htsn.io"
-CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
-ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"  # htsn.io zone
-PUBLIC_IP="70.237.94.174"  # Update if IP changes: curl -s ifconfig.me
-
-if [ -z "$SUBDOMAIN" ]; then
-  echo "Usage: $0 <subdomain>"
-  echo "Example: $0 myservice  # Creates myservice.htsn.io"
-  exit 1
-fi
-
-curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-  -H "X-Auth-Email: $CF_EMAIL" \
-  -H "X-Auth-Key: $CF_API_KEY" \
-  -H "Content-Type: application/json" \
-  --data "{
-    \"type\":\"A\",
-    \"name\":\"$SUBDOMAIN\",
-    \"content\":\"$PUBLIC_IP\",
-    \"ttl\":1,
-    \"proxied\":true
-  }" | jq .
-```
-
-**Usage:**
-```bash
-chmod +x ~/bin/add-cloudflare-dns.sh
-~/bin/add-cloudflare-dns.sh excalidraw  # Creates excalidraw.htsn.io
-```
-
-#### 3. Testing
-
-```bash
-# Check if DNS resolves
-dig myservice.htsn.io
-
-# Test HTTP redirect
-curl -I http://myservice.htsn.io
-
-# Test HTTPS
-curl -I https://myservice.htsn.io
-
-# Check Traefik dashboard (if enabled)
-# Access: http://10.10.10.250:8080/dashboard/
-```
-
-#### Step 4. Update Documentation
-
-After deploying, update these files:
-
-1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
-2. **CLAUDE.md** - Add to "Services using Traefik-Primary" list (line ~495)
-
-### Quick Reference - One-Liner Commands
-
-```bash
-# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
-
-# 1. Create Traefik config
-ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
-http:
-  routers:
-    myservice-secure:
-      entryPoints: [websecure]
-      rule: Host(\\\`myservice.htsn.io\\\`)
-      service: myservice
-      tls: {certResolver: letsencrypt}
-  services:
-    myservice:
-      loadBalancer:
-        servers:
-          - url: http://10.10.10.206:8080
-EOF"'
-
-# 2. Add Cloudflare DNS
-curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
-  -H "X-Auth-Email: cloudflare@htsn.io" \
-  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
-  -H "Content-Type: application/json" \
-  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
-
-# 3. Test (wait a few seconds for DNS propagation)
-curl -I https://myservice.htsn.io
-```
-
-### Traefik Troubleshooting
-
-```bash
-# View Traefik logs (CT 202)
-ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
-
-# Check if config is valid
-ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
-
-# List all dynamic configs
-ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/'
-
-# Check certificate
-ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq'
-
-# Restart Traefik (if needed)
-ssh pve 'pct exec 202 -- systemctl restart traefik'
-```
-
-### Certificate Management
-
-**Let's Encrypt certificates** are automatically managed by Traefik.
-
-**Certificate storage:**
-- Traefik-Primary: `/etc/traefik/acme.json` on CT 202
-- Traefik-Saltbox: `/opt/traefik/acme.json` on VM 101
-
-**Certificate renewal:**
-- Automatic via HTTP-01 challenge
-- Traefik checks every 24h
-- Renews 30 days before expiry
-
-**If certificates fail:**
-```bash
-# Check acme.json permissions (must be 600)
-ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json'
-
-# Check Traefik can reach Let's Encrypt
-ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory'
-
-# Delete bad certificate (Traefik will re-request)
-ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json'
-ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json'
-ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
-ssh pve 'pct exec 202 -- systemctl restart traefik'
-```
-
-### Docker Service with Traefik Labels (Alternative)
-
-If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files:
-
-```yaml
-# docker-compose.yml
-services:
-  myservice:
-    image: myimage:latest
-    labels:
-      - "traefik.enable=true"
-      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
-      - "traefik.http.routers.myservice.entrypoints=websecure"
-      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
-      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
-    networks:
-      - traefik
-
-networks:
-  traefik:
-    external: true
-```
-
-**Note**: This requires Traefik to have access to Docker socket and be on same network.
-
-## Cloudflare API Access
-
-**Credentials** (stored in Saltbox config):
-- Email: `cloudflare@htsn.io`
-- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc`
-- Domain: `htsn.io`
-
-**Retrieve from Saltbox:**
-```bash
-ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"'
-```
-
-**Cloudflare API Documentation:**
-- API Docs: https://developers.cloudflare.com/api/
-- DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record
-
-**Common API operations:**
-
-```bash
-# Set credentials
-CF_EMAIL="cloudflare@htsn.io"
-CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
-ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
-
-# List all DNS records
-curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-  -H "X-Auth-Email: $CF_EMAIL" \
-  -H "X-Auth-Key: $CF_API_KEY" | jq
-
-# Add A record
-curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-  -H "X-Auth-Email: $CF_EMAIL" \
-  -H "X-Auth-Key: $CF_API_KEY" \
-  -H "Content-Type: application/json" \
-  --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}'
-
-# Delete record
-curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-  -H "X-Auth-Email: $CF_EMAIL" \
-  -H "X-Auth-Key: $CF_API_KEY"
-```
+---
 
 ## Git Repository
 
-This documentation is stored at:
 - **Gitea**: https://git.htsn.io/hutson/homelab-docs
 - **Local**: `~/Projects/homelab`
 - **Notes**: `~/Notes/05_Homelab` (symlink)
 
 ```bash
-# Clone
-git clone git@git.htsn.io:hutson/homelab-docs.git
-
-# Push changes
 cd ~/Projects/homelab
 git add -A && git commit -m "Update docs" && git push
 ```
 
-## Related Documentation
-
-| File | Description |
-|------|-------------|
-| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | EMC storage enclosure (SES commands, LCC troubleshooting, maintenance) |
-| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API access, automations, integrations |
-| [NETWORK.md](NETWORK.md) | Network bridges, VLANs, which bridge to use for new VMs |
-| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices and services |
-| [SYNCTHING.md](SYNCTHING.md) | Syncthing setup, API access, device list, troubleshooting |
-| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.) |
-| [configs/](configs/) | Symlinks to shared shell configs |
-
 ---
 
 ## Backlog
 
-Future improvements and maintenance tasks:
-
 | Priority | Task | Notes |
 |----------|------|-------|
-| Medium | **Re-IP all devices** | Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254 |
-| Low | Install SSH on HomeAssistant | Currently only accessible via QEMU agent |
-| Low | Set up SSH key for router | Currently requires expect/password |
+| Medium | Re-IP all devices | Current IPs inconsistent |
+| Medium | Upgrade to 20A circuit for UPS | Plug rewired 5-20P→5-15P |
+| Low | Install SSH on HomeAssistant | Currently QEMU agent only |
 
 ---
 
-## Changelog
+## Recent Changes
+
+### 2025-12-22
+- Created comprehensive Phase 1 documentation split
+- New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md
+- Cleaned up CLAUDE.md to quick reference only
+
+### 2025-12-21
+- UPS upgrade: CyberPower OR2200PFCRT2U (1320W)
+- NUT monitoring configured (master/slave)
+- Full power failure test successful (~7 min recovery)
+- Happy Server self-hosted relay deployed
+- PVE Tailscale routing fix
+- Proxmox 2-node cluster quorum fix
+
+**Full changelog**: See end of this file
+
+---
+
+**Last Updated**: 2025-12-22
+**Documentation Status**: ✅ Phase 1 Complete
+
+---
+
+<details>
+<summary><b>Full Changelog (Click to expand)</b></summary>
 
 ### 2025-12-21
 
+**UPS Upgrade**
+- Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W)
+- Temporarily rewired plug 5-20P → 5-15P for 15A circuit
+- Runtime: ~15-20 min at 33% load
+
+**NUT Monitoring**
+- Configured NUT on PVE (master), PVE2 (slave)
+- Shutdown threshold: 120 seconds runtime
+- Custom shutdown script: `/usr/local/bin/ups-shutdown.sh`
+- Home Assistant integration (UPS sensors)
+
 **Happy Server Self-Hosted Relay**
-- Deployed self-hosted Happy Coder relay server on docker-host (10.10.10.206)
-- Stack includes: Happy Server, PostgreSQL, Redis, MinIO (all containerized)
-- Configured Traefik reverse proxy at https://happy.htsn.io
-- Added Cloudflare DNS record (proxied)
-- Fixed Dockerfile to include Prisma migrations on startup
+- Deployed on docker-host (10.10.10.206)
+- Stack: Happy Server + PostgreSQL + Redis + MinIO
+- URL: https://happy.htsn.io
+- Traefik reverse proxy configured
 
-**Docker-host CPU Upgrade**
-- Changed VM 206 CPU from emulated to `host` passthrough
-- Fixes x86-64-v2 compatibility issues with modern binaries (Sharp, MinIO)
-- Requires: `ssh pve 'qm set 206 -cpu host'` + VM reboot
+**Proxmox Fixes**
+- PVE Tailscale routing: Added rule for local network access
+- PVE2 MTU fix: vmbr0 + nic1 both set to 9000
+- 2-node cluster quorum: `two_node: 1` in corosync.conf
 
-**PVE Tailscale Routing Fix**
-- Fixed issue where PVE was unreachable via local network (10.10.10.120)
-- Root cause: Tailscale routing table 52 was capturing local subnet traffic
-- Fix: Added routing rule `ip rule add from 10.10.10.120 table main priority 5200`
-- Made permanent in `/etc/network/interfaces` under vmbr0
+**Power Failure Test**
+- Full end-to-end test successful
+- VMs stopped gracefully at 2 min runtime
+- Total recovery: ~7 minutes
 
 ### 2024-12-20
 
-**Git Repository Setup**
-- Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs)
-- Set up SSH key authentication for git@git.htsn.io
-- Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab
-- Added Gitea API token for future automation
-
-**SSH Key Deployment - All Systems**
-- Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key)
-- Updated `~/.ssh/config` with complete host aliases
-- Fixed permissions: FindShyt LXC `.ssh` ownership, enabled PermitRootLogin on LXCs
-- Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt
-
-**Documentation Updates**
-- Rewrote SSH Access section with complete host table
-- Added Password Auth section for router/Windows/HomeAssistant
-- Added Backlog section with re-IP task
-- Added Git Repository section with clone/push instructions
+**Git & SSH**
+- Created homelab-docs repo on Gitea
+- Deployed SSH keys to all VMs/LXCs (13 hosts)
+- Updated ~/.ssh/config with host aliases
 
 ### 2024-12-19
 
-**EMC Storage Enclosure - LCC B Failure**
-- Diagnosed loud fan issue (speed code 5 → 4160 RPM)
-- Root cause: Faulty LCC B controller causing false readings
-- Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM)
-- Replacement ordered: EMC 303-108-000E ($14.95 eBay)
-- Created [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) with full documentation
+**EMC Storage Enclosure**
+- LCC B failure diagnosed, switched to LCC A
+- Fans now quiet (speed code 3 vs 5)
+- Created EMC-ENCLOSURE.md documentation
 
-**SSH Key Consolidation**
-- Renamed `~/.ssh/ai_trading_ed25519` → `~/.ssh/homelab`
-- Updated `~/.ssh/config` on MacBook with all homelab hosts
-- SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm
-- No more sshpass needed for PVE servers
+**QEMU Guest Agent**
+- Installed on docker-host, fs-dev, copyparty
+- All VMs now have agent except homeassistant
 
-**QEMU Guest Agent Deployment**
-- Installed on: docker-host (206), fs-dev (105), copyparty (201)
-- All PVE VMs now have agent except homeassistant (110)
-- Can now use `qm guest exec` for remote commands
-
-**VM Configuration Updates**
-- docker-host: Fixed SSH key in cloud-init
-- fs-dev: Fixed `.ssh` directory ownership (1000 → 1001)
-- copyparty: Changed from DHCP to static IP (10.10.10.201)
-
-**Documentation Updates**
-- Updated CLAUDE.md SSH section (removed sshpass examples)
-- Added QEMU Agent column to VM tables
-- Added storage enclosure troubleshooting to runbooks
+</details>
diff --git a/HARDWARE.md b/HARDWARE.md
new file mode 100644
index 0000000..e063500
--- /dev/null
+++ b/HARDWARE.md
@@ -0,0 +1,455 @@
+# Hardware Inventory
+
+Complete hardware specifications for all homelab equipment.
+
+## Servers
+
+### PVE (10.10.10.120) - Primary Proxmox Server
+
+#### CPU
+- **Model**: AMD Ryzen Threadripper PRO 3975WX
+- **Cores**: 32 cores / 64 threads
+- **Base Clock**: 3.5 GHz
+- **Boost Clock**: 4.2 GHz
+- **TDP**: 280W
+- **Architecture**: Zen 2 (7nm)
+- **Socket**: sTRX4
+- **Features**: ECC support, PCIe 4.0
+
+#### RAM
+- **Capacity**: 128 GB
+- **Type**: DDR4 ECC Registered
+- **Speed**: Unknown (needs investigation)
+- **Channels**: 8-channel (quad-channel per socket)
+- **Idle Power**: ~30-40W
+
+#### Storage
+
+**OS/VM Storage:**
+
+| Pool | Devices | Type | Capacity | Purpose |
+|------|---------|------|----------|---------|
+| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
+| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
+| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
+
+**Total Storage**: ~9 TB usable
+
+#### GPUs
+
+| Model | Slot | VRAM | TDP | Purpose | Passed To |
+|-------|------|------|-----|---------|-----------|
+| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
+| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
+
+**Total GPU Power**: 75W + 280W = 355W (under load)
+
+#### Network Cards
+
+| Interface | Model | Speed | Purpose | Bridge |
+|-----------|-------|-------|---------|--------|
+| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
+| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
+| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
+
+**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
+
+#### Storage Controllers
+
+| Model | Interface | Purpose |
+|-------|-----------|---------|
+| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
+| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
+
+#### Motherboard
+- **Model**: Unknown - needs investigation
+- **Chipset**: AMD TRX40
+- **Form Factor**: ATX/EATX
+- **PCIe Slots**: Multiple PCIe 4.0 slots
+- **Features**: IOMMU support, ECC memory
+
+#### Power Supply
+- **Model**: Unknown
+- **Wattage**: Likely 1000W+ (needs investigation)
+- **Type**: ATX, 80+ certification unknown
+
+#### Cooling
+- **CPU Cooler**: Unknown - likely large tower or AIO
+- **Case Fans**: Unknown quantity
+- **Note**: CPU temps 70-80°C under load (healthy)
+
+---
+
+### PVE2 (10.10.10.102) - Secondary Proxmox Server
+
+#### CPU
+- **Model**: AMD Ryzen Threadripper PRO 3975WX
+- **Specs**: Same as PVE (32C/64T, 280W TDP)
+
+#### RAM
+- **Capacity**: 128 GB DDR4 ECC
+- **Same specs as PVE**
+
+#### Storage
+
+| Pool | Devices | Type | Capacity | Purpose |
+|------|---------|------|----------|---------|
+| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
+| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
+
+**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
+
+#### GPUs
+
+| Model | Slot | VRAM | TDP | Purpose | Passed To |
+|-------|------|------|-----|---------|-----------|
+| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
+
+#### Network Cards
+
+| Interface | Model | Speed | Purpose |
+|-----------|-------|-------|---------|
+| nic1 | Unknown (onboard) | 1 Gb | Management |
+
+**Note**: MTU set to 9000 for jumbo frames
+
+#### Motherboard
+- **Model**: Unknown
+- **Chipset**: AMD TRX40
+- **Similar to PVE**
+
+---
+
+## Network Equipment
+
+### UniFi Dream Machine Pro (UCG-Fiber)
+
+- **Model**: UniFi Cloud Gateway Fiber
+- **IP**: 10.10.10.1
+- **Ports**: Multiple 1Gb + SFP+ uplink
+- **Features**: Router, firewall, VPN, IDS/IPS
+- **MTU**: 9216 (supports jumbo frames)
+- **Tailscale**: Installed for VPN failover
+
+### Switches
+
+**Details needed** - investigate current switch setup:
+- 10Gb switch for high-speed connections?
+- 1Gb switch for general devices?
+- PoE capabilities?
+
+```bash
+# Check what's connected to 10Gb interfaces
+ssh pve 'ip link show enp35s0f0'
+ssh pve 'ip link show enp35s0f1'
+```
+
+---
+
+## Storage Hardware
+
+### EMC Storage Enclosure
+
+**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
+
+- **Model**: EMC KTN-STL4 (or similar)
+- **Form Factor**: 4U rackmount
+- **Drive Bays**: 25x 3.5" SAS/SATA
+- **Controllers**: Dual LCC (Link Control Cards)
+- **Connection**: SAS via LSI SAS2308 HBA
+- **Passed to**: TrueNAS VM (VMID 100)
+
+**Current Status**:
+- LCC A: Active (working)
+- LCC B: Failed (replacement ordered)
+
+**Drive Inventory**: Unknown - needs audit
+
+```bash
+# Get drive list from TrueNAS
+ssh truenas 'smartctl --scan'
+ssh truenas 'lsblk'
+```
+
+### NVMe Drives
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
+| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
+| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
+| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
+
+### SSDs
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
+
+### HDDs
+
+| Model | Quantity | Capacity | Location | Pool |
+|-------|----------|----------|----------|------|
+| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
+| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
+
+---
+
+## UPS
+
+### Current UPS
+
+| Specification | Value |
+|---------------|-------|
+| **Model** | CyberPower OR2200PFCRT2U |
+| **Capacity** | 2200VA / 1320W |
+| **Form Factor** | 2U rackmount |
+| **Input** | NEMA 5-15P (rewired from 5-20P) |
+| **Outlets** | 2x 5-20R + 6x 5-15R |
+| **Output** | PFC Sinewave |
+| **Runtime** | ~15-20 min @ 33% load |
+| **Interface** | USB (connected to PVE) |
+
+**See [UPS.md](UPS.md) for configuration details**
+
+---
+
+## Client Devices
+
+### Mac Mini (Hutson's Workstation)
+
+- **Model**: Unknown generation
+- **CPU**: Unknown
+- **RAM**: Unknown
+- **Storage**: Unknown
+- **Network**: 1Gb Ethernet (en0) - MTU 9000
+- **Tailscale IP**: 100.108.89.58
+- **Local IP**: 10.10.10.125 (static)
+- **Purpose**: Primary workstation, Happy Coder daemon host
+
+### MacBook (Mobile)
+
+- **Model**: Unknown
+- **Network**: Wi-Fi + Ethernet adapter
+- **Tailscale IP**: Unknown
+- **Purpose**: Mobile work, development
+
+### Windows PC
+
+- **Model**: Unknown
+- **CPU**: Unknown
+- **Network**: 1Gb Ethernet
+- **IP**: 10.10.10.150
+- **Purpose**: Gaming, Windows development, Syncthing node
+
+### Phone (Android)
+
+- **Model**: Unknown
+- **IP**: 10.10.10.54 (when on Wi-Fi)
+- **Purpose**: Syncthing mobile node, Happy Coder client
+
+---
+
+## Rack Layout (If Applicable)
+
+**Needs documentation** - Current rack configuration unknown
+
+Suggested format:
+```
+U42: Blank panel
+U41: UPS (CyberPower 2U)
+U40: UPS (CyberPower 2U)
+U39: Switch (10Gb)
+U38-U35: EMC Storage Enclosure (4U)
+U34: PVE Server
+U33: PVE2 Server
+...
+```
+
+---
+
+## Power Consumption
+
+### Measured Power Draw
+
+| Component | Idle | Typical | Peak | Notes |
+|-----------|------|---------|------|-------|
+| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
+| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
+| Network Gear | ~50W | ~50W | ~50W | Router + switches |
+| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
+
+**UPS Capacity**: 1320W
+**Typical Load**: 33-50% (safe margin)
+**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
+
+### Power Optimizations Applied
+
+**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
+
+- KSMD disabled: ~60-80W saved
+- CPU governors: ~60-120W saved
+- Syncthing rescans: ~60-80W saved
+- HDD spindown: ~10-16W saved when idle
+- **Total savings**: ~150-300W
+
+---
+
+## Thermal Management
+
+### CPU Cooling
+
+**PVE & PVE2**:
+- CPU cooler: Unknown model
+- Thermal paste: Unknown, likely needs refresh if temps >85°C
+- Target temp: 70-80°C under load
+- Max safe: 90°C Tctl (Threadripper PRO spec)
+
+### GPU Cooling
+
+All GPUs are passively managed (stock coolers):
+- TITAN RTX: 2-3W idle, 280W load
+- RTX A6000: 11W idle, 300W load
+- Quadro P2000: 25W constant (Plex active)
+
+### Case Airflow
+
+**Unknown** - needs investigation:
+- Case model?
+- Fan configuration?
+- Positive or negative pressure?
+
+---
+
+## Cable Management
+
+### Network Cables
+
+| Connection | Type | Length | Speed |
+|------------|------|--------|-------|
+| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
+| PVE2 → Router | Cat6 | Unknown | 1Gb |
+| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
+| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
+
+### Power Cables
+
+**Critical**: All servers on UPS battery-backed outlets
+
+---
+
+## Maintenance Schedule
+
+### Annual Maintenance
+
+- [ ] Clean dust from servers (every 6-12 months)
+- [ ] Check thermal paste on CPUs (every 2-3 years)
+- [ ] Test UPS battery runtime (annually)
+- [ ] Verify all fans operational
+- [ ] Check for bulging capacitors on PSUs
+
+### Drive Health
+
+```bash
+# Check SMART status on all drives
+ssh pve 'smartctl -a /dev/nvme0'
+ssh pve2 'smartctl -a /dev/sda'
+ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
+```
+
+### Temperature Monitoring
+
+```bash
+# Check all temps (needs lm-sensors installed)
+ssh pve 'sensors'
+ssh pve2 'sensors'
+```
+
+---
+
+## Warranty & Purchase Info
+
+**Needs documentation**:
+- When were servers purchased?
+- Where were components bought?
+- Any warranties still active?
+- Replacement part sources?
+
+---
+
+## Upgrade Path
+
+### Short-term Upgrades (< 6 months)
+
+- [ ] 20A circuit for UPS (restore original 5-20P plug)
+- [ ] Document missing hardware specs
+- [ ] Label all cables
+- [ ] Create rack diagram
+
+### Medium-term Upgrades (6-12 months)
+
+- [ ] Additional 10Gb NIC for PVE2?
+- [ ] More NVMe storage?
+- [ ] Upgrade network switches?
+- [ ] Replace EMC enclosure with newer model?
+
+### Long-term Upgrades (1-2 years)
+
+- [ ] CPU upgrade to newer Threadripper?
+- [ ] RAM expansion to 256GB?
+- [ ] Additional GPU for AI workloads?
+- [ ] Migrate to PCIe 5.0 storage?
+
+---
+
+## Investigation Needed
+
+High-priority items to document:
+
+- [ ] Get exact motherboard model (both servers)
+- [ ] Get PSU model and wattage
+- [ ] CPU cooler models
+- [ ] Network switch models and configuration
+- [ ] Complete drive inventory in EMC enclosure
+- [ ] RAM speed and timings
+- [ ] Case models
+- [ ] Exact NVMe models for all drives
+
+**Commands to gather info**:
+
+```bash
+# Motherboard
+ssh pve 'dmidecode -t baseboard'
+
+# CPU details
+ssh pve 'lscpu'
+
+# RAM details
+ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
+
+# Storage devices
+ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
+
+# Network cards
+ssh pve 'lspci | grep -i network'
+
+# GPU details
+ssh pve 'lspci | grep -i vga'
+ssh pve 'nvidia-smi -L'  # If nvidia-smi available
+```
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - VM resource allocation
+- [STORAGE.md](STORAGE.md) - Storage pools and usage
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
+- [UPS.md](UPS.md) - UPS configuration
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Incomplete - many specs need investigation
diff --git a/HOMEASSISTANT.md b/HOMEASSISTANT.md
index 43a8784..3c23476 100644
--- a/HOMEASSISTANT.md
+++ b/HOMEASSISTANT.md
@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
 - **Philips Hue** - Lights
 - **Sonos** - Speakers
 - **Motion Sensors** - Various locations
+- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
+
+### NUT / UPS Integration
+
+Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
+
+**Connection:**
+- Host: 10.10.10.120
+- Port: 3493
+- Username: upsmon
+- Password: upsmon123
+
+**Entities:**
+| Entity ID | Description |
+|-----------|-------------|
+| `sensor.cyberpower_battery_charge` | Battery percentage |
+| `sensor.cyberpower_load` | Current load % |
+| `sensor.cyberpower_input_voltage` | Input voltage |
+| `sensor.cyberpower_output_voltage` | Output voltage |
+| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
+| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
+
+**Dashboard Card Example:**
+```yaml
+type: entities
+title: UPS Status
+entities:
+  - entity: sensor.cyberpower_status
+    name: Status
+  - entity: sensor.cyberpower_battery_charge
+    name: Battery
+  - entity: sensor.cyberpower_load
+    name: Load
+  - entity: sensor.cyberpower_input_voltage
+    name: Input Voltage
+```
 
 ## Automations
 
diff --git a/MAINTENANCE.md b/MAINTENANCE.md
new file mode 100644
index 0000000..fef9648
--- /dev/null
+++ b/MAINTENANCE.md
@@ -0,0 +1,618 @@
+# Maintenance Procedures and Schedules
+
+Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
+
+## Overview
+
+| Frequency | Tasks | Estimated Time |
+|-----------|-------|----------------|
+| **Daily** | Quick health check | 2-5 min |
+| **Weekly** | Service status, logs review | 15-30 min |
+| **Monthly** | Updates, backups verification | 1-2 hours |
+| **Quarterly** | Full system audit, testing | 2-4 hours |
+| **Annual** | Hardware maintenance, planning | 4-8 hours |
+
+---
+
+## Daily Maintenance (Automated)
+
+### Quick Health Check Script
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily via cron**:
+```bash
+# Add to crontab
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+## Weekly Maintenance
+
+### Service Status Review
+
+**Check all critical services**:
+```bash
+# Proxmox services
+ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
+ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
+
+# NUT (UPS monitoring)
+ssh pve 'systemctl status nut-server nut-monitor'
+ssh pve2 'systemctl status nut-monitor'
+
+# Container services
+ssh pve 'pct exec 200 -- systemctl status pihole-FTL'  # Pi-hole
+ssh pve 'pct exec 202 -- systemctl status traefik'     # Traefik
+
+# VM services (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"'  # TrueNAS
+```
+
+### Log Review
+
+**Check for errors in critical logs**:
+```bash
+# Proxmox system logs
+ssh pve 'journalctl -p err -b | tail -50'
+ssh pve2 'journalctl -p err -b | tail -50'
+
+# VM logs (if QEMU agent available)
+ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
+
+# Traefik access logs
+ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
+```
+
+### Syncthing Sync Status
+
+**Check for sync errors**:
+```bash
+# Check all folder errors
+for folder in documents downloads desktop movies pictures notes config; do
+  echo "=== $folder ==="
+  curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+    "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
+done
+```
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+## Monthly Maintenance
+
+### System Updates
+
+#### Proxmox Updates
+
+**Check for updates**:
+```bash
+ssh pve 'apt update && apt list --upgradable'
+ssh pve2 'apt update && apt list --upgradable'
+```
+
+**Apply updates**:
+```bash
+# PVE
+ssh pve 'apt update && apt dist-upgrade -y'
+
+# PVE2
+ssh pve2 'apt update && apt dist-upgrade -y'
+
+# Reboot if kernel updated
+ssh pve 'reboot'
+ssh pve2 'reboot'
+```
+
+**⚠️ Important**:
+- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
+- Test on PVE2 first if possible
+- Ensure all VMs are backed up before updating
+- Monitor VMs after reboot - some may need manual restart
+
+#### Container Updates (LXC)
+
+```bash
+# Update all containers
+ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
+```
+
+#### VM Updates
+
+**Update VMs individually via SSH**:
+```bash
+# Ubuntu/Debian VMs
+ssh truenas 'apt update && apt upgrade -y'
+ssh docker-host 'apt update && apt upgrade -y'
+ssh fs-dev 'apt update && apt upgrade -y'
+
+# Check if reboot required
+ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
+```
+
+### ZFS Scrubs
+
+**Schedule**: Run monthly on all pools
+
+**PVE**:
+```bash
+# Start scrub on all pools
+ssh pve 'zpool scrub nvme-mirror1'
+ssh pve 'zpool scrub nvme-mirror2'
+ssh pve 'zpool scrub rpool'
+
+# Check scrub status
+ssh pve 'zpool status | grep -A2 scrub'
+```
+
+**PVE2**:
+```bash
+ssh pve2 'zpool scrub nvme-mirror3'
+ssh pve2 'zpool scrub local-zfs2'
+ssh pve2 'zpool status | grep -A2 scrub'
+```
+
+**TrueNAS**:
+```bash
+# Scrub via TrueNAS web UI or SSH
+ssh truenas 'zpool scrub vault'
+ssh truenas 'zpool status vault | grep -A2 scrub'
+```
+
+**Automate scrubs**:
+```bash
+# Add to crontab (run on 1st of month at 2 AM)
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub nvme-mirror2
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**See**: [STORAGE.md](STORAGE.md) for pool details
+
+### SMART Tests
+
+**Run extended SMART tests monthly**:
+
+```bash
+# TrueNAS drives (via QEMU agent)
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
+
+# Check results after 4-8 hours
+ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
+
+# PVE drives
+ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+
+# PVE2 drives
+ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
+```
+
+**Automate SMART tests**:
+```bash
+# Add to crontab (run on 15th of month at 3 AM)
+0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
+0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
+```
+
+### Certificate Renewal Verification
+
+**Check SSL certificate expiry**:
+```bash
+# Check Traefik certificates
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
+
+# Check specific service
+echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
+```
+
+**Certificates should auto-renew 30 days before expiry via Traefik**
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
+
+### Backup Verification
+
+**⚠️ TODO**: No backup strategy currently in place
+
+**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
+
+---
+
+## Quarterly Maintenance
+
+### Full System Audit
+
+**Check all systems comprehensively**:
+
+1. **ZFS Pool Health**:
+   ```bash
+   ssh pve 'zpool status -v'
+   ssh pve2 'zpool status -v'
+   ssh truenas 'zpool status -v vault'
+   ```
+   Look for: errors, degraded vdevs, resilver operations
+
+2. **SMART Health**:
+   ```bash
+   # Run SMART health check script
+   ~/bin/smart-health-check.sh
+   ```
+   Look for: reallocated sectors, pending sectors, failures
+
+3. **Disk Space Trends**:
+   ```bash
+   # Check growth rate
+   ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
+   ssh truenas 'df -h /mnt/vault'
+   ```
+   Plan for expansion if >80% full
+
+4. **VM Resource Usage**:
+   ```bash
+   # Check if VMs need more/less resources
+   ssh pve 'qm list'
+   ssh pve 'pvesh get /nodes/pve/status'
+   ```
+
+5. **Network Performance**:
+   ```bash
+   # Test bandwidth between critical nodes
+   iperf3 -s  # On one host
+   iperf3 -c 10.10.10.120  # From another
+   ```
+
+6. **Temperature Monitoring**:
+   ```bash
+   # Check max temps over past quarter
+   # TODO: Set up Prometheus/Grafana for historical data
+   ssh pve 'sensors'
+   ssh pve2 'sensors'
+   ```
+
+### Service Dependency Testing
+
+**Test critical paths**:
+
+1. **Power failure recovery** (if safe to test):
+   - See [UPS.md](UPS.md) for full procedure
+   - Verify VM startup order works
+   - Confirm all services come back online
+
+2. **Failover testing**:
+   - Tailscale subnet routing (PVE → UCG-Fiber)
+   - NUT monitoring (PVE server → PVE2 client)
+
+3. **Backup restoration** (when backups implemented):
+   - Test restoring a VM from backup
+   - Test restoring files from Syncthing versioning
+
+### Documentation Review
+
+- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
+- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
+- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
+- [ ] Update any changed procedures in this document
+
+---
+
+## Annual Maintenance
+
+### Hardware Maintenance
+
+**Physical cleaning**:
+```bash
+# Shut down servers (coordinate with users)
+ssh pve 'shutdown -h now'
+ssh pve2 'shutdown -h now'
+
+# Clean dust from:
+# - CPU heatsinks
+# - GPU fans
+# - Case fans
+# - PSU vents
+# - Storage enclosure fans
+
+# Check for:
+# - Bulging capacitors on PSU/motherboard
+# - Loose cables
+# - Fan noise/vibration
+```
+
+**Thermal paste inspection** (every 2-3 years):
+- Check CPU temps vs baseline
+- If temps >85°C under load, consider reapplying paste
+- Threadripper PRO: Tctl max safe = 90°C
+
+**See**: [HARDWARE.md](HARDWARE.md) for component details
+
+### UPS Battery Test
+
+**Runtime test**:
+```bash
+# Check battery health
+ssh pve 'upsc cyberpower@localhost | grep battery'
+
+# Perform runtime test (coordinate power loss)
+# 1. Note current runtime estimate
+# 2. Unplug UPS from wall
+# 3. Let battery drain to 20%
+# 4. Note actual runtime vs estimate
+# 5. Plug back in before shutdown triggers
+
+# Battery replacement if:
+# - Runtime < 10 min at typical load
+# - Battery age > 3-5 years
+# - Battery charge < 100% when on AC for 24h
+```
+
+**See**: [UPS.md](UPS.md) for full UPS details
+
+### Drive Replacement Planning
+
+**Check drive age and health**:
+```bash
+# Get drive hours and health
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
+done'
+```
+
+**Replace drives if**:
+- Reallocated sectors > 0
+- Pending sectors > 0
+- SMART pre-fail warnings
+- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
+- Hours > 50,000 for consumer drives
+
+**Budget for replacements**:
+- HDDs: WD Red 6TB (~$150/drive)
+- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
+
+### Capacity Planning
+
+**Review growth trends**:
+```bash
+# Storage growth (compare to last year)
+ssh pve 'zpool list'
+ssh truenas 'df -h /mnt/vault'
+
+# Network bandwidth (if monitoring in place)
+# Review Grafana dashboards
+
+# Power consumption
+ssh pve 'upsc cyberpower@localhost ups.load'
+```
+
+**Plan expansions**:
+- Storage: Add drives if >70% full
+- RAM: Check if VMs hitting limits
+- Network: Upgrade if bandwidth saturation
+- UPS: Upgrade if load >80%
+
+### License and Subscription Review
+
+**Proxmox subscription** (if applicable):
+- Community (free) or Enterprise subscription?
+- Check for updates to pricing/features
+
+**Service subscriptions**:
+- Domain registration (htsn.io)
+- Cloudflare plan (currently free)
+- Let's Encrypt (free, no action needed)
+
+---
+
+## Update Schedules
+
+### Proxmox
+
+| Component | Frequency | Notes |
+|-----------|-----------|-------|
+| Security patches | Weekly | Via `apt upgrade` |
+| Minor updates | Monthly | Test on PVE2 first |
+| Major versions | Quarterly | Read release notes, plan downtime |
+| Kernel updates | Monthly | Requires reboot |
+
+**Update procedure**:
+1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
+2. Backup VM configs: `vzdump --dumpdir /tmp`
+3. Update: `apt update && apt dist-upgrade`
+4. Reboot if kernel changed: `reboot`
+5. Verify VMs auto-started: `qm list`
+
+### Containers (LXC)
+
+| Container | Update Frequency | Package Manager |
+|-----------|------------------|-----------------|
+| Pi-hole (200) | Weekly | `apt` |
+| Traefik (202) | Monthly | `apt` |
+| FindShyt (205) | As needed | `apt` |
+
+**Update command**:
+```bash
+ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
+```
+
+### VMs
+
+| VM | Update Frequency | Notes |
+|----|------------------|-------|
+| TrueNAS | Monthly | Via web UI or `apt` |
+| Saltbox | Weekly | Managed by Saltbox updates |
+| HomeAssistant | Monthly | Via HA supervisor |
+| Docker-host | Weekly | `apt` + Docker images |
+| Trading-VM | As needed | Via SSH |
+| Gitea-VM | Monthly | Via web UI + `apt` |
+
+**Docker image updates**:
+```bash
+ssh docker-host 'docker-compose pull && docker-compose up -d'
+```
+
+### Firmware Updates
+
+| Component | Check Frequency | Update Method |
+|-----------|----------------|---------------|
+| Motherboard BIOS | Annually | Manual flash (high risk) |
+| GPU firmware | Rarely | `nvidia-smi` or manual |
+| SSD/NVMe firmware | Quarterly | Vendor tools |
+| HBA firmware | Annually | LSI tools |
+| UPS firmware | Annually | PowerPanel or manual |
+
+**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
+- Critical security issue
+- Needed for hardware compatibility
+- Fixing known bug affecting you
+
+---
+
+## Testing Checklists
+
+### Pre-Update Checklist
+
+Before ANY system update:
+- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
+- [ ] Verify backups are current (when backup system in place)
+- [ ] Check for critical VMs/services that can't have downtime
+- [ ] Review update changelog/release notes
+- [ ] Test on non-critical system first (PVE2 or test VM)
+- [ ] Plan rollback strategy if update fails
+- [ ] Notify users if downtime expected
+
+### Post-Update Checklist
+
+After system update:
+- [ ] Verify system booted correctly: `uptime`
+- [ ] Check all VMs/CTs started: `qm list`, `pct list`
+- [ ] Test critical services:
+  - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
+  - [ ] Traefik routing: `curl -I https://plex.htsn.io`
+  - [ ] NFS/SMB shares: Test mount from VM
+  - [ ] Syncthing sync: Check all devices connected
+- [ ] Review logs for errors: `journalctl -p err -b`
+- [ ] Check temperatures: `sensors`
+- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
+
+### Disaster Recovery Test
+
+**Quarterly test** (when backup system in place):
+- [ ] Simulate VM failure: Restore from backup
+- [ ] Simulate storage failure: Import pool on different system
+- [ ] Simulate network failure: Verify Tailscale failover
+- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
+- [ ] Document recovery time and issues
+
+---
+
+## Log Rotation
+
+**System logs** are automatically rotated by systemd-journald and logrotate.
+
+**Check log sizes**:
+```bash
+# Journalctl size
+ssh pve 'journalctl --disk-usage'
+
+# Traefik logs
+ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
+```
+
+**Configure retention**:
+```bash
+# Limit journald to 500MB
+ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
+ssh pve 'systemctl restart systemd-journald'
+```
+
+**Traefik log rotation** (already configured):
+```bash
+# /etc/logrotate.d/traefik on CT 202
+/var/log/traefik/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+}
+```
+
+---
+
+## Monitoring Integration
+
+**TODO**: Set up automated monitoring for these procedures
+
+**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
+- ZFS scrub completion/errors
+- SMART test failures
+- Certificate expiry warnings (<30 days)
+- Update availability notifications
+- Disk space thresholds (>80%)
+- Temperature warnings (>85°C)
+
+---
+
+## Related Documentation
+
+- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
+- [UPS.md](UPS.md) - Power failure procedures
+- [STORAGE.md](STORAGE.md) - ZFS pool management
+- [HARDWARE.md](HARDWARE.md) - Hardware specifications
+- [SERVICES.md](SERVICES.md) - Service inventory
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Manual procedures only - monitoring automation needed
diff --git a/MONITORING.md b/MONITORING.md
new file mode 100644
index 0000000..fad8286
--- /dev/null
+++ b/MONITORING.md
@@ -0,0 +1,546 @@
+# Monitoring and Alerting
+
+Documentation for system monitoring, health checks, and alerting across the homelab.
+
+## Current Monitoring Status
+
+| Component | Monitored? | Method | Alerts | Notes |
+|-----------|------------|--------|--------|-------|
+| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
+| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
+| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
+| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
+| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
+| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
+| **Network** | ❌ No | - | ❌ No | No uptime monitoring |
+| **Services** | ❌ No | - | ❌ No | No health checks |
+| **Backups** | ❌ No | - | ❌ No | No verification |
+
+**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts
+
+---
+
+## Existing Monitoring
+
+### UPS Monitoring (NUT)
+
+**Status**: ✅ **Active and working**
+
+**What's monitored**:
+- Battery charge percentage
+- Runtime remaining (seconds)
+- Load percentage
+- Input/output voltage
+- UPS status (OL/OB/LB)
+
+**Access**:
+```bash
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+
+# Key metrics
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+```
+
+**Home Assistant Integration**:
+- Sensors: `sensor.cyberpower_*`
+- Can be used for automation/alerts
+- Currently: No alerts configured
+
+**See**: [UPS.md](UPS.md)
+
+---
+
+### Syncthing Monitoring
+
+**Status**: ⚠️ **Partial** - API available, no automated monitoring
+
+**What's available**:
+- Device connection status
+- Folder sync status
+- Sync errors
+- Bandwidth usage
+
+**Manual Checks**:
+```bash
+# Check connections (Mac Mini)
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+
+# Check folder status
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
+
+# Check errors
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
+```
+
+**Needs**: Automated monitoring script + alerts
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+### Temperature Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Method**:
+```bash
+# CPU temperature (Threadripper Tctl)
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
+  label=$(cat ${f%_input}_label 2>/dev/null); \
+  if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Thresholds**:
+- Healthy: 70-80°C under load
+- Warning: >85°C
+- Critical: >90°C (throttling)
+
+**Needs**: Automated monitoring + alert if >85°C
+
+---
+
+### Proxmox VM Monitoring
+
+**Status**: ⚠️ **Manual only**
+
+**Current Access**:
+- Proxmox Web UI: Node → Summary
+- CLI: `ssh pve 'qm list'`
+
+**Metrics Available** (via Proxmox):
+- CPU usage per VM
+- RAM usage per VM
+- Disk I/O
+- Network I/O
+- VM uptime
+
+**Needs**: API-based monitoring + alerts for VM down
+
+---
+
+## Recommended Monitoring Stack
+
+### Option 1: Prometheus + Grafana (Recommended)
+
+**Why**:
+- Industry standard
+- Extensive integrations
+- Beautiful dashboards
+- Flexible alerting
+
+**Architecture**:
+```
+Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
+                              ↓
+                          Alertmanager (alerts)
+```
+
+**Required Exporters**:
+| Exporter | Monitors | Install On |
+|----------|----------|------------|
+| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
+| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
+| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
+| nut_exporter | UPS metrics | PVE |
+| proxmox_exporter | VM/CT stats | PVE, PVE2 |
+| cadvisor | Docker containers | Saltbox, docker-host |
+
+**Deployment**:
+```bash
+# Create monitoring VM
+ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
+  --net0 virtio,bridge=vmbr0'
+
+# Install Prometheus + Grafana (via Docker)
+# /opt/monitoring/docker-compose.yml
+```
+
+**Estimated Setup Time**: 4-6 hours
+
+---
+
+### Option 2: Uptime Kuma (Simpler Alternative)
+
+**Why**:
+- Lightweight
+- Easy to set up
+- Web-based dashboard
+- Built-in alerts (email, Slack, etc.)
+
+**What it monitors**:
+- HTTP/HTTPS endpoints
+- Ping (ICMP)
+- Ports (TCP)
+- Docker containers
+
+**Deployment**:
+```bash
+ssh docker-host 'mkdir -p /opt/uptime-kuma'
+cat > docker-compose.yml << 'EOF'
+version: "3.8"
+services:
+  uptime-kuma:
+    image: louislam/uptime-kuma:latest
+    ports:
+      - "3001:3001"
+    volumes:
+      - ./data:/app/data
+    restart: unless-stopped
+EOF
+
+# Access: http://10.10.10.206:3001
+# Add Traefik config for uptime.htsn.io
+```
+
+**Estimated Setup Time**: 1-2 hours
+
+---
+
+### Option 3: Netdata (Real-time Monitoring)
+
+**Why**:
+- Real-time metrics (1-second granularity)
+- Auto-discovers services
+- Low overhead
+- Beautiful web UI
+
+**Deployment**:
+```bash
+# Install on each server
+ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
+
+# Access:
+# http://10.10.10.120:19999 (PVE)
+# http://10.10.10.102:19999 (PVE2)
+```
+
+**Parent-Child Setup** (optional):
+- Configure PVE as parent
+- Stream metrics from PVE2 → PVE
+- Single dashboard for both servers
+
+**Estimated Setup Time**: 1 hour
+
+---
+
+## Critical Metrics to Monitor
+
+### Server Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **CPU usage** | >90% for 5 min | Alert |
+| **CPU temp** | >85°C | Alert |
+| **CPU temp** | >90°C | Critical alert |
+| **RAM usage** | >95% | Alert |
+| **Disk space** | >80% | Warning |
+| **Disk space** | >90% | Alert |
+| **Load average** | >CPU count | Alert |
+
+### Storage Health
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **ZFS pool errors** | >0 | Alert immediately |
+| **ZFS pool degraded** | Any degraded vdev | Critical alert |
+| **ZFS scrub failed** | Last scrub error | Alert |
+| **SMART reallocated sectors** | >0 | Warning |
+| **SMART pending sectors** | >0 | Alert |
+| **SMART failure** | Pre-fail | Critical - replace drive |
+
+### UPS
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Battery charge** | <20% | Warning |
+| **Battery charge** | <10% | Alert |
+| **On battery** | >5 min | Alert |
+| **Runtime** | <5 min | Critical |
+
+### Network
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **Device unreachable** | >2 min down | Alert |
+| **High packet loss** | >5% | Warning |
+| **Bandwidth saturation** | >90% | Warning |
+
+### VMs/Services
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| **VM stopped** | Critical VM down | Alert immediately |
+| **Service unreachable** | HTTP 5xx or timeout | Alert |
+| **Backup failed** | Any backup failure | Alert |
+| **Certificate expiry** | <30 days | Warning |
+| **Certificate expiry** | <7 days | Alert |
+
+---
+
+## Alert Destinations
+
+### Email Alerts
+
+**Recommended**: Set up SMTP relay for email alerts
+
+**Options**:
+1. Gmail SMTP (free, rate-limited)
+2. SendGrid (free tier: 100 emails/day)
+3. Mailgun (free tier available)
+4. Self-hosted mail server (complex)
+
+**Configuration Example** (Prometheus Alertmanager):
+```yaml
+# /etc/alertmanager/alertmanager.yml
+receivers:
+  - name: 'email'
+    email_configs:
+      - to: 'hutson@example.com'
+        from: 'alerts@htsn.io'
+        smarthost: 'smtp.gmail.com:587'
+        auth_username: 'alerts@htsn.io'
+        auth_password: 'app-password-here'
+```
+
+---
+
+### Push Notifications
+
+**Options**:
+- **Pushover**: $5 one-time, reliable
+- **Pushbullet**: Free tier available
+- **Telegram Bot**: Free
+- **Discord Webhook**: Free
+- **Slack**: Free tier available
+
+**Recommended**: Pushover or Telegram for mobile alerts
+
+---
+
+### Home Assistant Alerts
+
+Since Home Assistant is already running, use it for alerts:
+
+**Automation Example**:
+```yaml
+automation:
+  - alias: "UPS Low Battery Alert"
+    trigger:
+      - platform: numeric_state
+        entity_id: sensor.cyberpower_battery_charge
+        below: 20
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
+
+  - alias: "Server High Temperature"
+    trigger:
+      - platform: template
+        value_template: "{{ sensor.pve_cpu_temp > 85 }}"
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
+```
+
+**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
+
+---
+
+## Monitoring Scripts
+
+### Daily Health Check
+
+Save as `~/bin/homelab-health-check.sh`:
+
+```bash
+#!/bin/bash
+# Daily homelab health check
+
+echo "=== Homelab Health Check ==="
+echo "Date: $(date)"
+echo ""
+
+echo "=== Server Status ==="
+ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
+ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
+echo ""
+
+echo "=== CPU Temperatures ==="
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
+echo ""
+
+echo "=== UPS Status ==="
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+echo ""
+
+echo "=== ZFS Pools ==="
+ssh pve 'zpool status -x' 2>/dev/null
+ssh pve2 'zpool status -x' 2>/dev/null
+ssh truenas 'zpool status -x vault'
+echo ""
+
+echo "=== Disk Space ==="
+ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
+ssh truenas 'df -h /mnt/vault'
+echo ""
+
+echo "=== VM Status ==="
+ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
+ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
+echo ""
+
+echo "=== Syncthing Connections ==="
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections" | \
+  python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
+  [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
+echo ""
+
+echo "=== Check Complete ==="
+```
+
+**Run daily**:
+```cron
+0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
+```
+
+---
+
+### ZFS Scrub Checker
+
+```bash
+#!/bin/bash
+# Check last ZFS scrub status
+
+echo "=== ZFS Scrub Status ==="
+
+for host in pve pve2; do
+  echo "--- $host ---"
+  ssh $host 'zpool status | grep -A1 scrub'
+  echo ""
+done
+
+echo "--- TrueNAS ---"
+ssh truenas 'zpool status vault | grep -A1 scrub'
+```
+
+---
+
+### SMART Health Checker
+
+```bash
+#!/bin/bash
+# Check SMART health on all drives
+
+echo "=== SMART Health Check ==="
+
+echo "--- TrueNAS Drives ---"
+ssh truenas 'smartctl --scan | while read dev type; do
+  echo "=== $dev ===";
+  smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
+done'
+
+echo "--- PVE Drives ---"
+ssh pve 'for dev in /dev/nvme* /dev/sd*; do
+  [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
+done'
+```
+
+---
+
+## Dashboard Recommendations
+
+### Grafana Dashboard Layout
+
+**Page 1: Overview**
+- Server uptime
+- CPU usage (all servers)
+- RAM usage (all servers)
+- Disk space (all pools)
+- Network traffic
+- UPS status
+
+**Page 2: Storage**
+- ZFS pool health
+- SMART status for all drives
+- I/O latency
+- Scrub progress
+- Disk temperatures
+
+**Page 3: VMs**
+- VM status (up/down)
+- VM resource usage
+- VM disk I/O
+- VM network traffic
+
+**Page 4: Services**
+- Service health checks
+- HTTP response times
+- Certificate expiry dates
+- Syncthing sync status
+
+---
+
+## Implementation Plan
+
+### Phase 1: Basic Monitoring (Week 1)
+
+- [ ] Install Uptime Kuma or Netdata
+- [ ] Add HTTP checks for all services
+- [ ] Configure UPS alerts in Home Assistant
+- [ ] Set up daily health check email
+
+**Estimated Time**: 4-6 hours
+
+---
+
+### Phase 2: Advanced Monitoring (Week 2-3)
+
+- [ ] Install Prometheus + Grafana
+- [ ] Deploy node_exporter on all servers
+- [ ] Deploy zfs_exporter
+- [ ] Deploy smartmon_exporter
+- [ ] Create Grafana dashboards
+
+**Estimated Time**: 8-12 hours
+
+---
+
+### Phase 3: Alerting (Week 4)
+
+- [ ] Configure Alertmanager
+- [ ] Set up email/push notifications
+- [ ] Create alert rules for all critical metrics
+- [ ] Test all alert paths
+- [ ] Document alert procedures
+
+**Estimated Time**: 4-6 hours
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS monitoring details
+- [STORAGE.md](STORAGE.md) - ZFS health checks
+- [SERVICES.md](SERVICES.md) - Service inventory
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
+- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed**
diff --git a/POWER-MANAGEMENT.md b/POWER-MANAGEMENT.md
new file mode 100644
index 0000000..dbc049f
--- /dev/null
+++ b/POWER-MANAGEMENT.md
@@ -0,0 +1,509 @@
+# Power Management and Optimization
+
+Documentation of power optimizations applied to reduce idle power consumption and heat generation.
+
+## Overview
+
+Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
+
+Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
+
+---
+
+## Power Draw Estimates
+
+### PVE (10.10.10.120)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
+| NVIDIA Quadro P2000 | 25W | 70W | 75W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
+| HBAs, fans, misc | 20-30W | 20-30W | - |
+| **Total** | **250-350W** | **800-940W** | - |
+
+### PVE2 (10.10.10.102)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA RTX A6000 | 11W | 280W | 300W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
+| Fans, misc | 15-20W | 15-20W | - |
+| **Total** | **226-330W** | **765-890W** | - |
+
+### Combined
+
+| Metric | Idle | Load |
+|--------|------|------|
+| Servers | 476-680W | 1565-1830W |
+| Network gear | ~50W | ~50W |
+| **Total** | **~530-730W** | **~1615-1880W** |
+| **UPS Load** | 40-55% | 120-140% ⚠️ |
+
+**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
+
+---
+
+## Optimizations Applied
+
+### 1. KSMD Disabled (2024-12-17)
+
+**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
+
+**Problem**:
+- KSMD was consuming 44-57% CPU continuously on PVE
+- Caused CPU temp to rise from 74°C to 83°C
+- **Negative profit**: More power spent scanning than saved from deduplication
+
+**Solution**: Disabled KSM permanently
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/disable-ksm.service`
+```ini
+[Unit]
+Description=Disable KSM (Kernel Same-page Merging)
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable and start**:
+```bash
+systemctl daemon-reload
+systemctl enable --now disable-ksm
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Verify**:
+```bash
+# KSM should be disabled (run=0)
+cat /sys/kernel/mm/ksm/run  # Should output: 0
+
+# ksmd should show 0% CPU
+ps aux | grep ksmd
+```
+
+**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
+
+**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
+```bash
+cat /sys/kernel/mm/ksm/run
+# If 1, disable it:
+echo 0 > /sys/kernel/mm/ksm/run
+systemctl mask ksmtuned
+```
+
+---
+
+### 2. CPU Governor Optimization (2024-12-16)
+
+Default CPU governor keeps cores at max frequency even when idle, wasting power.
+
+#### PVE: `amd-pstate-epp` Driver
+
+**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
+**Governor**: `powersave`
+**EPP**: `balance_power`
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to powersave with balance_power EPP
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+# Check governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: powersave
+
+# Check EPP
+cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
+# Output: balance_power
+
+# Check current frequency (should be low when idle)
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
+```
+
+#### PVE2: `acpi-cpufreq` Driver
+
+**Driver**: `acpi-cpufreq` (older ACPI driver)
+**Governor**: `schedutil` (adaptive, better than powersave for this driver)
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to schedutil
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: schedutil
+
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle
+```
+
+**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
+
+**Performance impact**: Minimal - CPU still boosts to max frequency under load
+
+---
+
+### 3. GPU Power States (2024-12-16)
+
+GPUs automatically enter low-power states when idle. Verified optimal.
+
+| GPU | Location | Idle Power | P-State | Notes |
+|-----|----------|------------|---------|-------|
+| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
+| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
+| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
+
+**Check GPU power state**:
+```bash
+# Via nvidia-smi (if installed in VM)
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
+
+# Expected output:
+# name, power.draw [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, P8
+
+# Via lspci (from Proxmox host - shows link speed, not power)
+ssh pve 'lspci | grep -i nvidia'
+```
+
+**P-States**:
+- **P0**: Maximum performance
+- **P8**: Minimum power (idle)
+
+**No action needed** - GPUs automatically manage power states.
+
+**Savings**: N/A (already optimal)
+
+---
+
+### 4. Syncthing Rescan Intervals (2024-12-16)
+
+Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
+
+**Changed**:
+- Large folders: 60s → **3600s** (1 hour)
+- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
+
+**Configuration**: Via Syncthing UI on each device
+- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
+
+**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
+
+**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
+- Still acceptable for most use cases
+- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
+
+---
+
+### 5. ksmtuned Disabled (2024-12-16)
+
+**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
+
+**Solution**: Stopped and disabled on both servers
+
+```bash
+systemctl stop ksmtuned
+systemctl disable ksmtuned
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Savings**: ~2-5W
+
+---
+
+### 6. HDD Spindown on PVE2 (2024-12-16)
+
+**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
+
+**Solution**: Configure 30-minute spindown timeout
+
+**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
+```udev
+# Spin down WD Red 6TB drives after 30 minutes idle
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
+```
+
+**hdparm value**: 241 = 30 minutes
+- Formula: `value * 5 seconds = timeout`
+- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
+
+**Apply rule**:
+```bash
+udevadm control --reload-rules
+udevadm trigger
+
+# Verify drives have spindown set
+hdparm -I /dev/sda | grep -i standby
+hdparm -I /dev/sdb | grep -i standby
+```
+
+**Check if drives are spun down**:
+```bash
+hdparm -C /dev/sda
+# Output: drive state is:  standby  (spun down)
+# or:     drive state is:  active/idle  (spinning)
+```
+
+**Savings**: ~10-16W when spun down (8W per drive)
+
+**Trade-off**: 5-10 second delay when accessing pool after spindown
+
+---
+
+## Potential Optimizations (Not Yet Applied)
+
+### PCIe ASPM (Active State Power Management)
+
+**Benefit**: Reduce power of idle PCIe devices
+**Risk**: May cause stability issues with some devices
+**Estimated savings**: 5-15W
+
+**Test**:
+```bash
+# Check current ASPM state
+lspci -vv | grep -i aspm
+
+# Enable ASPM (test first)
+# Add to kernel cmdline: pcie_aspm=force
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
+
+# Update grub
+update-grub
+reboot
+```
+
+### NMI Watchdog Disable
+
+**Benefit**: Reduce CPU wakeups
+**Risk**: Harder to debug kernel hangs
+**Estimated savings**: 1-3W
+
+**Test**:
+```bash
+# Disable NMI watchdog
+echo 0 > /proc/sys/kernel/nmi_watchdog
+
+# Make permanent (add to kernel cmdline)
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
+
+update-grub
+reboot
+```
+
+---
+
+## Monitoring
+
+### CPU Frequency
+
+```bash
+# Current frequency on all cores
+ssh pve 'grep MHz /proc/cpuinfo | head -10'
+
+# Governor
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
+
+# Available governors
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
+```
+
+### CPU Temperature
+
+```bash
+# PVE
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+# PVE2
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Healthy temps**: 70-80°C under load
+**Warning**: >85°C
+**Throttle**: 90°C (Tctl max for Threadripper PRO)
+
+### GPU Power Draw
+
+```bash
+# If nvidia-smi installed in VM
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
+
+# Sample output:
+# name, power.draw [W], power.limit [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
+```
+
+### Power Consumption (UPS)
+
+```bash
+# Check UPS load percentage
+ssh pve 'upsc cyberpower@localhost ups.load'
+
+# Battery runtime (seconds)
+ssh pve 'upsc cyberpower@localhost battery.runtime'
+
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+```
+
+See [UPS.md](UPS.md) for more UPS monitoring details.
+
+### ZFS ARC Memory Usage
+
+```bash
+# PVE
+ssh pve 'arc_summary | grep -A5 "ARC size"'
+
+# TrueNAS
+ssh truenas 'arc_summary | grep -A5 "ARC size"'
+```
+
+**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
+
+```bash
+# Limit ARC to 32 GB (example)
+# Edit /etc/modprobe.d/zfs.conf:
+options zfs zfs_arc_max=34359738368
+
+# Apply (reboot required)
+update-initramfs -u
+reboot
+```
+
+---
+
+## Troubleshooting
+
+### CPU Not Downclocking
+
+```bash
+# Check current governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+
+# Should be: powersave (PVE) or schedutil (PVE2)
+# If not, systemd service may have failed
+
+# Check service status
+systemctl status cpu-powersave
+
+# Manually set governor (temporary)
+echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+# Check frequency
+grep MHz /proc/cpuinfo | head -5
+```
+
+### High Idle Power After Update
+
+**Common causes**:
+1. **KSM re-enabled** after Proxmox update
+   - Check: `cat /sys/kernel/mm/ksm/run`
+   - Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
+
+2. **CPU governor reset** to default
+   - Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
+   - Fix: `systemctl restart cpu-powersave`
+
+3. **GPU stuck in high-performance mode**
+   - Check: `nvidia-smi --query-gpu=pstate --format=csv`
+   - Fix: Restart VM or power cycle GPU
+
+### HDDs Won't Spin Down
+
+```bash
+# Check spindown setting
+hdparm -I /dev/sda | grep -i standby
+
+# Set spindown manually (temporary)
+hdparm -S 241 /dev/sda
+
+# Check if drive is idle (ZFS may keep it active)
+zpool iostat -v 1 5  # Watch for activity
+
+# Check what's accessing the drive
+lsof | grep /mnt/pool
+```
+
+---
+
+## Power Optimization Summary
+
+| Optimization | Savings | Applied | Notes |
+|--------------|---------|---------|-------|
+| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
+| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
+| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
+| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
+| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
+| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
+| PCIe ASPM | 5-15W | ❌ | Not yet tested |
+| NMI watchdog | 1-3W | ❌ | Not yet tested |
+| **Total savings** | **~150-300W** | - | Significant reduction |
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS capacity and power monitoring
+- [STORAGE.md](STORAGE.md) - HDD spindown configuration
+- [VMS.md](VMS.md) - VM resource allocation
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..c10c8fc
--- /dev/null
+++ b/README.md
@@ -0,0 +1,148 @@
+# Homelab Documentation
+
+Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
+
+## 🚀 Quick Start
+
+**New to this homelab?** Start here:
+1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
+2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
+3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
+4. [SERVICES.md](SERVICES.md) - What services are running
+
+**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
+
+## 📚 Documentation Index
+
+### Infrastructure
+
+| Document | Description |
+|----------|-------------|
+| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
+| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
+| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
+| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
+| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
+| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
+
+### Services & Applications
+
+| Document | Description |
+|----------|-------------|
+| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
+| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
+| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
+| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
+| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
+
+### Access & Security
+
+| Document | Description |
+|----------|-------------|
+| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
+| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
+| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
+
+### Operations
+
+| Document | Description |
+|----------|-------------|
+| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
+| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
+| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
+| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
+
+### Reference
+
+| Document | Description |
+|----------|-------------|
+| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
+| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
+
+## 🖥️ System Overview
+
+### Servers
+
+- **PVE** (10.10.10.120) - Primary Proxmox server
+  - AMD Threadripper PRO 3975WX (32-core)
+  - 128 GB RAM
+  - NVIDIA Quadro P2000 + TITAN RTX
+
+- **PVE2** (10.10.10.102) - Secondary Proxmox server
+  - AMD Threadripper PRO 3975WX (32-core)
+  - 128 GB RAM
+  - NVIDIA RTX A6000
+
+### Key Services
+
+| Service | Location | URL |
+|---------|----------|-----|
+| **Proxmox** | PVE | https://pve.htsn.io |
+| **TrueNAS** | VM 100 | https://truenas.htsn.io |
+| **Plex** | Saltbox VM | https://plex.htsn.io |
+| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
+| **Gitea** | VM 300 | https://git.htsn.io |
+| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
+| **Traefik** | CT 202 | http://10.10.10.250:8080 |
+
+[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
+
+## 🔥 Emergency Procedures
+
+### Power Failure
+1. UPS provides ~15 min runtime at typical load
+2. At 2 min remaining, NUT triggers graceful VM shutdown
+3. When power returns, servers auto-boot and start VMs in order
+
+See [UPS.md](UPS.md) for details.
+
+### Service Down
+
+```bash
+# Quick health check (run from Mac Mini)
+ssh pve 'qm list'                    # Check VMs on PVE
+ssh pve2 'qm list'                   # Check VMs on PVE2
+ssh pve 'pct list'                   # Check containers
+
+# Syncthing status
+curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/connections"
+
+# Restart a VM
+ssh pve 'qm stop VMID && qm start VMID'
+```
+
+See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
+
+## 📞 Getting Help
+
+**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
+
+**Key Contacts**:
+- Homelab Owner: Hutson
+- Git Repo: https://git.htsn.io/hutson/homelab-docs
+- Local Path: `~/Projects/homelab`
+
+## 🔄 Recent Changes
+
+See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
+
+## 📝 Contributing
+
+When updating docs:
+1. Keep CLAUDE.md as quick reference only
+2. Move detailed content to specialized docs
+3. Update cross-references
+4. Test all commands before committing
+5. Add entries to changelog
+
+```bash
+cd ~/Projects/homelab
+git add -A
+git commit -m "Update documentation: <description>"
+git push
+```
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/SERVICES.md b/SERVICES.md
new file mode 100644
index 0000000..b8dd970
--- /dev/null
+++ b/SERVICES.md
@@ -0,0 +1,591 @@
+# Services Inventory
+
+Complete inventory of all services running across the homelab infrastructure.
+
+## Overview
+
+| Category | Services | Location | Access |
+|----------|----------|----------|--------|
+| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
+| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
+| **Development** | Gitea, Docker services | VMs | Web UI |
+| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
+| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
+
+**Total Services**: 25+ running services
+
+---
+
+## Service URLs Quick Reference
+
+| Service | URL | Authentication | Purpose |
+|---------|-----|----------------|---------|
+| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
+| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
+| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
+| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
+| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
+| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
+| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
+| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
+| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
+| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
+| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
+| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
+
+---
+
+## Infrastructure Services
+
+### Proxmox VE (PVE & PVE2)
+
+**Purpose**: Virtualization platform, VM/CT host
+**Location**: Physical servers (10.10.10.120, 10.10.10.102)
+**Access**: https://pve.htsn.io:8006, SSH
+**Version**: Unknown (check: `pveversion`)
+
+**Key Features**:
+- Web-based management
+- VM and LXC container support
+- ZFS storage pools
+- Clustering (2-node)
+- API access
+
+**Common Operations**:
+```bash
+# List VMs
+ssh pve 'qm list'
+
+# Create VM
+ssh pve 'qm create VMID --name myvm ...'
+
+# Backup VM
+ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
+```
+
+**See**: [VMS.md](VMS.md)
+
+---
+
+### TrueNAS SCALE (VM 100)
+
+**Purpose**: Central file storage, NFS/SMB shares
+**Location**: VM on PVE (10.10.10.200)
+**Access**: https://truenas.htsn.io, SSH
+**Version**: TrueNAS SCALE (check version in UI)
+
+**Key Features**:
+- ZFS storage management
+- NFS exports
+- SMB shares
+- Syncthing hub
+- Snapshot management
+
+**Storage Pools**:
+- `vault`: Main data pool on EMC enclosure
+
+**Shares** (needs documentation):
+- NFS exports for Saltbox media
+- SMB shares for Windows access
+- Syncthing sync folders
+
+**See**: [STORAGE.md](STORAGE.md)
+
+---
+
+### Pi-hole (CT 200)
+
+**Purpose**: Network-wide DNS server and ad blocker
+**Location**: LXC on PVE (10.10.10.10)
+**Access**: http://10.10.10.10/admin
+**Version**: Unknown
+
+**Configuration**:
+- **Upstream DNS**: Cloudflare (1.1.1.1)
+- **Blocklists**: Unknown count
+- **Queries**: All network DNS traffic
+- **DHCP**: Disabled (router handles DHCP)
+
+**Stats** (example):
+```bash
+ssh pihole 'pihole -c -e'  # Stats
+ssh pihole 'pihole status'  # Status
+```
+
+**Common Tasks**:
+- Update blocklists: `ssh pihole 'pihole -g'`
+- Whitelist domain: `ssh pihole 'pihole -w example.com'`
+- View logs: `ssh pihole 'pihole -t'`
+
+---
+
+### Traefik (CT 202)
+
+**Purpose**: Reverse proxy for all public-facing services
+**Location**: LXC on PVE (10.10.10.250)
+**Access**: http://10.10.10.250:8080/dashboard/
+**Version**: Unknown (check: `traefik version`)
+
+**Managed Services**:
+- All *.htsn.io domains (except Saltbox services)
+- SSL/TLS certificates via Let's Encrypt
+- HTTP → HTTPS redirects
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
+
+---
+
+## Media Services (Saltbox VM)
+
+All media services run in Docker on the Saltbox VM (10.10.10.100).
+
+### Plex Media Server
+
+**Purpose**: Media streaming platform
+**URL**: https://plex.htsn.io
+**Access**: Plex account
+
+**Features**:
+- Hardware transcoding (TITAN RTX)
+- Libraries: Movies, TV, Music
+- Remote access enabled
+- Managed by Saltbox
+
+**Media Storage**:
+- Source: TrueNAS NFS mounts
+- Location: `/mnt/unionfs/`
+
+**Common Tasks**:
+```bash
+# View Plex status
+ssh saltbox 'docker logs -f plex'
+
+# Restart Plex
+ssh saltbox 'docker restart plex'
+
+# Scan library
+# (via Plex UI: Settings → Library → Scan)
+```
+
+---
+
+### *arr Apps (Media Automation)
+
+Running on Saltbox VM, managed via Traefik-Saltbox.
+
+| Service | Purpose | URL | Notes |
+|---------|---------|-----|-------|
+| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
+| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
+| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
+| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
+| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
+
+**Downloaders**:
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
+| **NZBGet** | Usenet downloader | nzbget.htsn.io |
+| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
+
+**Indexers**:
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
+| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
+
+---
+
+### Supporting Media Services
+
+| Service | Purpose | URL |
+|---------|---------|-----|
+| **Tautulli** | Plex statistics | tautulli.htsn.io |
+| **Organizr** | Service dashboard | organizr.htsn.io |
+| **Authelia** | SSO authentication | auth.htsn.io |
+
+---
+
+## Development Services
+
+### Gitea (VM 300)
+
+**Purpose**: Self-hosted Git server
+**Location**: VM on PVE2 (10.10.10.220)
+**URL**: https://git.htsn.io
+**Access**: Username/password
+
+**Repositories**:
+- homelab-docs (this documentation)
+- Personal projects
+- Private repos
+
+**Common Tasks**:
+```bash
+# SSH to Gitea VM
+ssh gitea-vm
+
+# View logs
+ssh gitea-vm 'journalctl -u gitea -f'
+
+# Backup
+ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
+```
+
+**See**: Gitea documentation for API usage
+
+---
+
+### Docker Services (docker-host VM)
+
+Running on VM 206 (10.10.10.206).
+
+| Service | URL | Purpose | Port |
+|---------|-----|---------|------|
+| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
+| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
+| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
+
+**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
+
+**Managing services**:
+```bash
+ssh docker-host 'docker ps'
+ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
+ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
+```
+
+---
+
+## Home Automation
+
+### Home Assistant (VM 110)
+
+**Purpose**: Smart home automation platform
+**Location**: VM on PVE (10.10.10.110)
+**URL**: https://homeassistant.htsn.io
+**Access**: Username/password
+
+**Integrations**:
+- UPS monitoring (NUT sensors)
+- Unknown other integrations (needs documentation)
+
+**Sensors**:
+- `sensor.cyberpower_battery_charge`
+- `sensor.cyberpower_load`
+- `sensor.cyberpower_battery_runtime`
+- `sensor.cyberpower_status`
+
+**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
+
+---
+
+### Happy Coder Relay (docker-host)
+
+**Purpose**: Self-hosted relay server for Happy Coder mobile app
+**Location**: docker-host (10.10.10.206)
+**URL**: https://happy.htsn.io
+**Access**: QR code authentication
+
+**Stack**:
+- Happy Server (Node.js)
+- PostgreSQL (user/session data)
+- Redis (real-time events)
+- MinIO (file/image storage)
+
+**Clients**:
+- Mac Mini (Happy daemon)
+- Mobile app (iOS/Android)
+
+**Credentials**:
+- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
+- PostgreSQL: `happy` / `happypass`
+- MinIO: `happyadmin` / `happyadmin123`
+
+---
+
+## File Sync & Storage
+
+### Syncthing
+
+**Purpose**: File synchronization across all devices
+**Devices**:
+- Mac Mini (10.10.10.125) - Hub
+- MacBook - Mobile sync
+- TrueNAS (10.10.10.200) - Central storage
+- Windows PC (10.10.10.150) - Windows sync
+- Phone (10.10.10.54) - Mobile sync
+
+**API Keys**:
+- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
+- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
+- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
+
+**Synced Folders**:
+- documents (~11 GB)
+- downloads (~38 GB)
+- pictures
+- notes
+- desktop (~7.2 GB)
+- config
+- movies
+
+**See**: [SYNCTHING.md](SYNCTHING.md)
+
+---
+
+### Copyparty (VM 201)
+
+**Purpose**: Simple HTTP file sharing
+**Location**: VM on PVE (10.10.10.201)
+**URL**: https://copyparty.htsn.io
+**Access**: Unknown
+
+**Features**:
+- Web-based file upload/download
+- Lightweight
+
+---
+
+## Trading & AI Services
+
+### AI Trading Platform (trading-vm)
+
+**Purpose**: Algorithmic trading with AI models
+**Location**: VM 301 on PVE2 (10.10.10.221)
+**URL**: https://aitrade.htsn.io (if accessible)
+**GPU**: RTX A6000 (48GB VRAM)
+
+**Components**:
+- Trading algorithms
+- AI models for market prediction
+- Real-time data feeds
+- Backtesting infrastructure
+
+**Access**: SSH only (no web UI documented)
+
+---
+
+### LM Dev (lmdev1)
+
+**Purpose**: AI/LLM development environment
+**Location**: VM 111 on PVE (10.10.10.111)
+**URL**: https://lmdev.htsn.io (if accessible)
+**GPU**: TITAN RTX (shared with Saltbox)
+
+**Installed**:
+- CUDA toolkit
+- Python 3.11+
+- PyTorch, TensorFlow
+- Hugging Face transformers
+
+---
+
+## Monitoring & Utilities
+
+### UPS Monitoring (NUT)
+
+**Purpose**: Monitor UPS status and trigger shutdowns
+**Location**: PVE (master), PVE2 (slave)
+**Access**: Command-line (`upsc`)
+
+**Key Commands**:
+```bash
+ssh pve 'upsc cyberpower@localhost'
+ssh pve 'upsc cyberpower@localhost ups.load'
+ssh pve 'upsc cyberpower@localhost battery.runtime'
+```
+
+**Home Assistant Integration**: UPS sensors exposed
+
+**See**: [UPS.md](UPS.md)
+
+---
+
+### Pulse Monitoring
+
+**Purpose**: Unknown monitoring dashboard
+**Location**: docker-host (10.10.10.206:7655)
+**URL**: https://pulse.htsn.io
+**Access**: Unknown
+
+**Needs documentation**:
+- What does it monitor?
+- How to configure?
+- Authentication?
+
+---
+
+### Tailscale VPN
+
+**Purpose**: Secure remote access to homelab
+**Subnet Routers**:
+- PVE (100.113.177.80) - Primary
+- UCG-Fiber (100.94.246.32) - Failover
+
+**Devices on Tailscale**:
+- Mac Mini: 100.108.89.58
+- PVE: 100.113.177.80
+- TrueNAS: 100.100.94.71
+- Pi-hole: 100.112.59.128
+
+**See**: [NETWORK.md](NETWORK.md)
+
+---
+
+## Custom Applications
+
+### FindShyt (CT 205)
+
+**Purpose**: Unknown custom application
+**Location**: LXC on PVE (10.10.10.8)
+**URL**: https://findshyt.htsn.io
+**Access**: Unknown
+
+**Needs documentation**:
+- What is this app?
+- How to use it?
+- Tech stack?
+
+---
+
+## Service Dependencies
+
+### Critical Dependencies
+
+```
+TrueNAS
+├── Plex (media files via NFS)
+├── *arr apps (downloads via NFS)
+├── Syncthing (central storage hub)
+└── Backups (if configured)
+
+Traefik (CT 202)
+├── All *.htsn.io services
+└── SSL certificate management
+
+Pi-hole
+└── DNS for entire network
+
+Router
+└── Gateway for all services
+```
+
+### Startup Order
+
+**See [VMS.md](VMS.md)** for VM boot order configuration:
+1. TrueNAS (storage first)
+2. Saltbox (depends on TrueNAS NFS)
+3. Other VMs
+4. Containers
+
+---
+
+## Service Port Reference
+
+### Well-Known Ports
+
+| Port | Service | Protocol | Purpose |
+|------|---------|----------|---------|
+| 22 | SSH | TCP | Remote access |
+| 53 | Pi-hole | UDP | DNS queries |
+| 80 | Traefik | TCP | HTTP (redirects to 443) |
+| 443 | Traefik | TCP | HTTPS |
+| 3000 | Gitea | TCP | Git HTTP/S |
+| 8006 | Proxmox | TCP | Web UI |
+| 8096 | Plex | TCP | Plex Media Server |
+| 8384 | Syncthing | TCP | Web UI |
+| 22000 | Syncthing | TCP | Sync protocol |
+
+### Internal Ports
+
+| Port | Service | Purpose |
+|------|---------|---------|
+| 3002 | Happy Server | Relay backend |
+| 5432 | PostgreSQL | Happy Server DB |
+| 6379 | Redis | Happy Server cache |
+| 7655 | Pulse | Monitoring |
+| 8080 | Excalidraw | Whiteboard |
+| 8080 | Traefik | Dashboard |
+| 9000 | MinIO | Object storage |
+
+---
+
+## Service Health Checks
+
+### Quick Health Check Script
+
+```bash
+#!/bin/bash
+# Check all critical services
+
+echo "=== Infrastructure ==="
+curl -Is https://pve.htsn.io:8006 | head -1
+curl -Is https://truenas.htsn.io | head -1
+curl -I http://10.10.10.10/admin 2>/dev/null | head -1
+echo ""
+
+echo "=== Media Services ==="
+curl -Is https://plex.htsn.io | head -1
+curl -Is https://sonarr.htsn.io | head -1
+curl -Is https://radarr.htsn.io | head -1
+echo ""
+
+echo "=== Development ==="
+curl -Is https://git.htsn.io | head -1
+curl -Is https://excalidraw.htsn.io | head -1
+echo ""
+
+echo "=== Home Automation ==="
+curl -Is https://homeassistant.htsn.io | head -1
+curl -Is https://happy.htsn.io/health | head -1
+```
+
+### Service-Specific Checks
+
+```bash
+# Proxmox VMs
+ssh pve 'qm list | grep running'
+
+# Docker services
+ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
+
+# Syncthing
+curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
+  "http://127.0.0.1:8384/rest/system/status"
+
+# UPS
+ssh pve 'upsc cyberpower@localhost ups.status'
+```
+
+---
+
+## Service Credentials
+
+**Location**: See individual service documentation
+
+| Service | Credentials Location | Notes |
+|---------|---------------------|-------|
+| Proxmox | Proxmox UI | Username + 2FA |
+| TrueNAS | TrueNAS UI | Root password |
+| Plex | Plex account | Managed externally |
+| Gitea | Gitea DB | Self-managed |
+| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
+| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
+
+**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - VM/service locations
+- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
+**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)
diff --git a/SSH-ACCESS.md b/SSH-ACCESS.md
new file mode 100644
index 0000000..36507f1
--- /dev/null
+++ b/SSH-ACCESS.md
@@ -0,0 +1,464 @@
+# SSH Access
+
+Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
+
+## Overview
+
+Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
+
+**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
+
+---
+
+## SSH Key Authentication (Primary Method)
+
+### SSH Key Configuration
+
+SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
+
+**Key file**: `~/.ssh/homelab` (Ed25519 key)
+
+**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
+
+### Host Aliases
+
+Use these convenient aliases instead of IP addresses:
+
+| Host Alias | IP | User | Type | Notes |
+|------------|-----|------|------|-------|
+| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
+| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
+| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
+| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
+| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
+| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
+| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
+| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
+| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
+| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
+| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
+| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
+| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
+
+### Usage Examples
+
+```bash
+# List VMs on PVE
+ssh pve 'qm list'
+
+# Check ZFS pool on TrueNAS
+ssh truenas 'zpool status vault'
+
+# List Docker containers on Saltbox
+ssh saltbox 'docker ps'
+
+# Check Pi-hole status
+ssh pihole 'pihole status'
+
+# View Traefik config
+ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
+```
+
+### SSH Config File
+
+**Location**: `~/.ssh/config`
+
+**Example entries**:
+
+```sshconfig
+# Proxmox Servers
+Host pve
+    HostName 10.10.10.120
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host pve2
+    HostName 10.10.10.102
+    User root
+    IdentityFile ~/.ssh/homelab
+    # Post-quantum KEX causes MTU issues - use classic
+    KexAlgorithms curve25519-sha256
+
+# VMs
+Host truenas
+    HostName 10.10.10.200
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host saltbox
+    HostName 10.10.10.100
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host lmdev1
+    HostName 10.10.10.111
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host docker-host
+    HostName 10.10.10.206
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host fs-dev
+    HostName 10.10.10.5
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host copyparty
+    HostName 10.10.10.201
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host gitea-vm
+    HostName 10.10.10.220
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+Host trading-vm
+    HostName 10.10.10.221
+    User hutson
+    IdentityFile ~/.ssh/homelab
+
+# LXC Containers
+Host pihole
+    HostName 10.10.10.10
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host traefik
+    HostName 10.10.10.250
+    User root
+    IdentityFile ~/.ssh/homelab
+
+Host findshyt
+    HostName 10.10.10.8
+    User root
+    IdentityFile ~/.ssh/homelab
+```
+
+---
+
+## Password Authentication (Special Cases)
+
+Some systems don't support SSH key auth or have other limitations.
+
+### UniFi Router (10.10.10.1)
+
+**Issue**: Uses `keyboard-interactive` auth method, incompatible with `sshpass`
+**Solution**: Use `expect` to automate password entry
+
+**Commands**:
+
+```bash
+# Run command on router
+expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+
+# Get ARP table (all device IPs)
+expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+
+# Check Tailscale status
+expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
+```
+
+**Why not key auth?**: UniFi router firmware doesn't persist SSH keys across reboots.
+
+### Windows PC (10.10.10.150)
+
+**OS**: Windows with OpenSSH server
+**User**: `claude`
+**Password**: `GrilledCh33s3#`
+**Shell**: PowerShell (not bash)
+
+**Commands**:
+
+```bash
+# Run PowerShell command
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
+
+# Check Syncthing status
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
+
+# Restart Syncthing
+sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
+```
+
+**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
+
+**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
+
+---
+
+## QEMU Guest Agent
+
+Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
+
+### VMs with QEMU Agent
+
+| VMID | VM Name | Use Case |
+|------|---------|----------|
+| 100 | truenas | Execute commands, check ZFS |
+| 101 | saltbox | Execute commands, Docker mgmt |
+| 105 | fs-dev | Execute commands |
+| 111 | lmdev1 | Execute commands |
+| 201 | copyparty | Execute commands |
+| 206 | docker-host | Execute commands |
+| 300 | gitea-vm | Execute commands |
+| 301 | trading-vm | Execute commands |
+
+### VM WITHOUT QEMU Agent
+
+**VMID 110 (homeassistant)**: No QEMU agent installed
+- Access via web UI only
+- Or install SSH server manually if needed
+
+### Usage Examples
+
+**Basic syntax**:
+```bash
+ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
+```
+
+**Examples**:
+
+```bash
+# Check ZFS pool on TrueNAS (without SSH)
+ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
+
+# Get VM IP addresses
+ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
+
+# Check Docker containers on Saltbox
+ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
+
+# Run multi-line command
+ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
+```
+
+**When to use QEMU agent vs SSH**:
+- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
+- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
+- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
+
+---
+
+## Troubleshooting SSH Issues
+
+### Connection Refused
+
+```bash
+# Check if SSH service is running
+ssh pve 'systemctl status sshd'
+
+# Check if port 22 is open
+nc -zv 10.10.10.XXX 22
+
+# Check firewall
+ssh pve 'iptables -L -n | grep 22'
+```
+
+### Permission Denied (Public Key)
+
+```bash
+# Verify key file exists
+ls -la ~/.ssh/homelab
+
+# Check key permissions (should be 600)
+chmod 600 ~/.ssh/homelab
+
+# Test SSH key auth verbosely
+ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
+
+# Check authorized_keys on remote (via QEMU agent if SSH broken)
+ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
+```
+
+### Slow SSH Connection (PVE2 Issue)
+
+**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
+**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
+**Fix**: Use classic KEX algorithm instead
+
+**In `~/.ssh/config`**:
+```sshconfig
+Host pve2
+    HostName 10.10.10.102
+    User root
+    IdentityFile ~/.ssh/homelab
+    KexAlgorithms curve25519-sha256  # Avoid mlkem768x25519-sha256
+```
+
+**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
+
+---
+
+## Adding SSH Keys to New Systems
+
+### Linux (VMs/LXCs)
+
+```bash
+# Copy public key to new host
+ssh-copy-id -i ~/.ssh/homelab user@hostname
+
+# Or manually:
+ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
+ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
+```
+
+### LXC Containers (Root User)
+
+```bash
+# Via pct exec from Proxmox host
+ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
+ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
+ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
+
+# Also enable PermitRootLogin in sshd_config
+ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
+ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
+```
+
+### VMs (via QEMU Agent)
+
+```bash
+# Add key via QEMU agent (if SSH not working)
+ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
+ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
+ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
+```
+
+---
+
+## SSH Key Management
+
+### Rotate SSH Keys (Future)
+
+When rotating SSH keys:
+
+1. Generate new key pair:
+   ```bash
+   ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
+   ```
+
+2. Deploy new key to all hosts (keep old key for now):
+   ```bash
+   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
+     ssh-copy-id -i ~/.ssh/homelab-new $host
+   done
+   ```
+
+3. Update `~/.ssh/config` to use new key:
+   ```sshconfig
+   IdentityFile ~/.ssh/homelab-new
+   ```
+
+4. Test all connections:
+   ```bash
+   for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
+     echo "Testing $host..."
+     ssh $host 'hostname'
+   done
+   ```
+
+5. Remove old key from all hosts once confirmed working
+
+---
+
+## Quick Reference
+
+### Common SSH Operations
+
+```bash
+# Execute command on remote host
+ssh host 'command'
+
+# Execute multiple commands
+ssh host 'command1 && command2'
+
+# Copy file to remote
+scp file host:/path/
+
+# Copy file from remote
+scp host:/path/file ./
+
+# Execute command on Proxmox VM (via QEMU agent)
+ssh pve 'qm guest exec VMID -- bash -c "command"'
+
+# Execute command on LXC
+ssh pve 'pct exec CTID -- command'
+
+# Interactive shell
+ssh host
+
+# SSH with X11 forwarding
+ssh -X host
+```
+
+### Troubleshooting Commands
+
+```bash
+# Test SSH with verbose output
+ssh -vvv host
+
+# Check SSH service status (remote)
+ssh host 'systemctl status sshd'
+
+# Check SSH config (local)
+ssh -G host
+
+# Test port connectivity
+nc -zv hostname 22
+```
+
+---
+
+## Security Best Practices
+
+### Current Security Posture
+
+✅ **Good**:
+- SSH keys used instead of passwords (where possible)
+- Keys use Ed25519 (modern, secure algorithm)
+- Root login disabled on VMs (use sudo instead)
+- SSH keys have proper permissions (600)
+
+⚠️ **Could Improve**:
+- [ ] Disable password authentication on all hosts (force key-only)
+- [ ] Use SSH certificate authority instead of individual keys
+- [ ] Set up SSH bastion host (jump server)
+- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
+- [ ] Implement SSH key rotation policy (annually)
+
+### Hardening SSH (Future)
+
+For additional security, consider:
+
+```sshconfig
+# /etc/ssh/sshd_config (on remote hosts)
+PermitRootLogin prohibit-password  # No root password login
+PasswordAuthentication no          # Disable password auth entirely
+PubkeyAuthentication yes           # Only allow key auth
+AuthorizedKeysFile .ssh/authorized_keys
+MaxAuthTries 3                     # Limit auth attempts
+MaxSessions 10                     # Limit concurrent sessions
+ClientAliveInterval 300            # Timeout idle sessions
+ClientAliveCountMax 2              # Drop after 2 keepalives
+```
+
+**Apply after editing**:
+```bash
+systemctl restart sshd
+```
+
+---
+
+## Related Documentation
+
+- [VMS.md](VMS.md) - Complete VM/CT inventory
+- [NETWORK.md](NETWORK.md) - Network configuration
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
+- [SECURITY.md](#) - Security policies (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/STORAGE.md b/STORAGE.md
new file mode 100644
index 0000000..3f3b702
--- /dev/null
+++ b/STORAGE.md
@@ -0,0 +1,510 @@
+# Storage Architecture
+
+Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
+
+## Overview
+
+### Storage Distribution
+
+| Location | Type | Capacity | Purpose |
+|----------|------|----------|---------|
+| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
+| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
+| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
+
+---
+
+## PVE (10.10.10.120) Storage Pools
+
+### nvme-mirror1 (Primary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Sabrent Rocket Q NVMe
+- **Capacity**: 3.6 TB usable
+- **Purpose**: High-performance VM storage
+- **Used By**:
+  - Critical VMs requiring fast IO
+  - Database workloads
+  - Development environments
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror1'
+ssh pve 'zpool list nvme-mirror1'
+```
+
+### nvme-mirror2 (Secondary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Kingston SFYRD 2TB NVMe
+- **Capacity**: 1.8 TB usable
+- **Purpose**: Additional fast VM storage
+- **Used By**: TBD
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror2'
+ssh pve 'zpool list nvme-mirror2'
+```
+
+### rpool (Root Pool)
+- **Type**: ZFS mirror
+- **Devices**: 2x Samsung 870 QVO 4TB SSD
+- **Capacity**: 3.6 TB usable
+- **Purpose**: Proxmox OS, container storage, VM backups
+- **Used By**:
+  - Proxmox root filesystem
+  - LXC containers
+  - Local VM backups
+
+**Check status**:
+```bash
+ssh pve 'zpool status rpool'
+ssh pve 'df -h /var/lib/vz'
+```
+
+### Storage Pool Usage Summary (PVE)
+
+**Get current usage**:
+```bash
+ssh pve 'zpool list'
+ssh pve 'pvesm status'
+```
+
+---
+
+## PVE2 (10.10.10.102) Storage Pools
+
+### nvme-mirror3 (Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x NVMe (model unknown)
+- **Capacity**: Unknown (needs investigation)
+- **Purpose**: High-performance VM storage
+- **Used By**: Trading VM (301), other VMs
+
+**Check status**:
+```bash
+ssh pve2 'zpool status nvme-mirror3'
+ssh pve2 'zpool list nvme-mirror3'
+```
+
+### local-zfs2 (Bulk Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x WD Red 6TB HDD
+- **Capacity**: ~6 TB usable
+- **Purpose**: Bulk/archival storage
+- **Power Management**: 30-minute spindown configured
+  - Saves ~10-16W when idle
+  - Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
+  - Command: `hdparm -S 241` (30 min)
+
+**Notes**:
+- Pool had only 768 KB used as of 2024-12-16
+- Drives configured to spin down after 30 min idle
+- Good for archival, NOT for active workloads
+
+**Check status**:
+```bash
+ssh pve2 'zpool status local-zfs2'
+ssh pve2 'zpool list local-zfs2'
+
+# Check if drives are spun down
+ssh pve2 'hdparm -C /dev/sdX'  # Shows active/standby
+```
+
+---
+
+## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
+
+### ZFS Pool: vault
+
+**Primary storage pool** for all shared data.
+
+**Devices**: ❓ Needs investigation
+- EMC storage enclosure with multiple drives
+- SAS connection via LSI SAS2308 HBA (passed through to VM)
+
+**Capacity**: ❓ Needs investigation
+
+**Check pool status**:
+```bash
+ssh truenas 'zpool status vault'
+ssh truenas 'zpool list vault'
+
+# Get detailed capacity
+ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
+```
+
+### Datasets (Known)
+
+Based on Syncthing configuration, likely datasets:
+
+| Dataset | Purpose | Synced Devices | Notes |
+|---------|---------|----------------|-------|
+| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
+| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
+| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
+| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
+| vault/desktop | Desktop sync | Unknown | 7.2 GB |
+| vault/movies | Movie library | Unknown | Unknown size |
+| vault/config | Config files | Mac Mini, MacBook | Unknown size |
+
+**Get complete dataset list**:
+```bash
+ssh truenas 'zfs list -r vault'
+```
+
+### NFS/SMB Shares
+
+**Status**: ❓ Not documented
+
+**Needs investigation**:
+```bash
+# List NFS exports
+ssh truenas 'showmount -e localhost'
+
+# List SMB shares
+ssh truenas 'smbclient -L localhost -N'
+
+# Via TrueNAS API/UI
+# Sharing → Unix Shares (NFS)
+# Sharing → Windows Shares (SMB)
+```
+
+**Expected shares**:
+- Media libraries for Plex (on Saltbox VM)
+- Document storage
+- VM backups?
+- ISO storage?
+
+### EMC Storage Enclosure
+
+**Model**: EMC KTN-STL4 (or similar)
+**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
+**Drives**: ❓ Unknown count and capacity
+
+**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
+- SES commands
+- Fan control
+- LCC (Link Control Card) troubleshooting
+- Maintenance procedures
+
+**Check enclosure status**:
+```bash
+ssh truenas 'sg_ses --page=0x02 /dev/sgX'  # Element descriptor
+ssh truenas 'smartctl --scan'              # List all drives
+```
+
+---
+
+## Storage Network Architecture
+
+### Internal Storage Network (10.10.10.20.0/24)
+
+**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
+
+**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
+**Subnet**: 10.10.10.20.0/24
+**DHCP**: No
+**Gateway**: No (internal only, no internet)
+
+**Connected VMs**:
+- TrueNAS VM (secondary NIC)
+- Saltbox VM (secondary NIC) - for NFS mounts
+- Other VMs needing storage access
+
+**Configuration**:
+```bash
+# On TrueNAS VM - check second NIC
+ssh truenas 'ip addr show enp6s19'
+
+# On Saltbox - check NFS mounts
+ssh saltbox 'mount | grep nfs'
+```
+
+**Benefits**:
+- Separates storage traffic from general network
+- Prevents NFS/SMB from saturating main network
+- Better performance for storage-heavy workloads
+
+---
+
+## Storage Capacity Planning
+
+### Current Usage (Estimate)
+
+**Needs actual audit**:
+```bash
+# PVE pools
+ssh pve 'zpool list -o name,size,alloc,free'
+
+# PVE2 pools
+ssh pve2 'zpool list -o name,size,alloc,free'
+
+# TrueNAS vault pool
+ssh truenas 'zpool list vault'
+
+# Get detailed breakdown
+ssh truenas 'zfs list -r vault -o name,used,avail'
+```
+
+### Growth Rate
+
+**Needs tracking** - recommend monthly snapshots of capacity:
+
+```bash
+#!/bin/bash
+# Save as ~/bin/storage-capacity-report.sh
+
+DATE=$(date +%Y-%m-%d)
+REPORT=~/Backups/storage-reports/capacity-$DATE.txt
+
+mkdir -p ~/Backups/storage-reports
+
+echo "Storage Capacity Report - $DATE" > $REPORT
+echo "================================" >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE Pools:" >> $REPORT
+ssh pve 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE2 Pools:" >> $REPORT
+ssh pve2 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Pools:" >> $REPORT
+ssh truenas 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Datasets:" >> $REPORT
+ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
+
+echo "Report saved to $REPORT"
+```
+
+**Run monthly via cron**:
+```cron
+0 9 1 * * ~/bin/storage-capacity-report.sh
+```
+
+### Expansion Planning
+
+**When to expand**:
+- Pool reaches 80% capacity
+- Performance degrades
+- New workloads require more space
+
+**Expansion options**:
+1. Add drives to existing pools (if mirrors, add mirror vdev)
+2. Add new NVMe drives to PVE/PVE2
+3. Expand EMC enclosure (add more drives)
+4. Add second EMC enclosure
+
+**Cost estimates**: TBD
+
+---
+
+## ZFS Health Monitoring
+
+### Daily Health Checks
+
+```bash
+# Check for errors on all pools
+ssh pve 'zpool status -x'     # Shows only unhealthy pools
+ssh pve2 'zpool status -x'
+ssh truenas 'zpool status -x'
+
+# Check scrub status
+ssh pve 'zpool status | grep scrub'
+ssh pve2 'zpool status | grep scrub'
+ssh truenas 'zpool status | grep scrub'
+```
+
+### Scrub Schedule
+
+**Recommended**: Monthly scrub on all pools
+
+**Configure scrub**:
+```bash
+# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
+# Or via cron:
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**On TrueNAS**:
+- Configure via UI: Storage → Pools → Scrub Tasks
+- Recommended: 1st of every month at 2 AM
+
+### SMART Monitoring
+
+**Check drive health**:
+```bash
+# PVE
+ssh pve 'smartctl -a /dev/nvme0'
+ssh pve 'smartctl -a /dev/sda'
+
+# TrueNAS
+ssh truenas 'smartctl --scan'
+ssh truenas 'smartctl -a /dev/sdX'  # For each drive
+```
+
+**Configure SMART tests**:
+- TrueNAS UI: Tasks → S.M.A.R.T. Tests
+- Recommended: Weekly short test, monthly long test
+
+### Alerts
+
+**Set up email alerts for**:
+- ZFS pool errors
+- SMART test failures
+- Pool capacity > 80%
+- Scrub failures
+
+---
+
+## Storage Performance Tuning
+
+### ZFS ARC (Cache)
+
+**Check ARC usage**:
+```bash
+ssh pve 'arc_summary'
+ssh truenas 'arc_summary'
+```
+
+**Tuning** (if needed):
+- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
+- TrueNAS: Configure via UI (System → Advanced → Tunables)
+
+### NFS Performance
+
+**Mount options** (on clients like Saltbox):
+```
+rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
+```
+
+**Verify NFS mounts**:
+```bash
+ssh saltbox 'mount | grep nfs'
+```
+
+### Record Size Optimization
+
+**Different workloads need different record sizes**:
+- VMs: 64K (default, good for VMs)
+- Databases: 8K or 16K
+- Media files: 1M (large sequential reads)
+
+**Set record size** (on TrueNAS datasets):
+```bash
+ssh truenas 'zfs set recordsize=1M vault/movies'
+```
+
+---
+
+## Disaster Recovery
+
+### Pool Recovery
+
+**If a pool fails to import**:
+```bash
+# Try importing with different name
+zpool import -f -N poolname newpoolname
+
+# Check pool with readonly
+zpool import -f -o readonly=on poolname
+
+# Force import (last resort)
+zpool import -f -F poolname
+```
+
+### Drive Replacement
+
+**When a drive fails**:
+```bash
+# Identify failed drive
+zpool status poolname
+
+# Replace drive
+zpool replace poolname old-device new-device
+
+# Monitor resilver
+watch zpool status poolname
+```
+
+### Data Recovery
+
+**If pool is completely lost**:
+1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
+2. Recreate pool structure
+3. Restore data
+
+**Critical**: This is why we need offsite backups!
+
+---
+
+## Quick Reference
+
+### Common Commands
+
+```bash
+# Pool status
+zpool status [poolname]
+zpool list
+
+# Dataset usage
+zfs list
+zfs list -r vault
+
+# Check pool health (only unhealthy)
+zpool status -x
+
+# Scrub pool
+zpool scrub poolname
+
+# Get pool IO stats
+zpool iostat -v 1
+
+# Snapshot management
+zfs snapshot poolname/dataset@snapname
+zfs list -t snapshot
+zfs rollback poolname/dataset@snapname
+zfs destroy poolname/dataset@snapname
+```
+
+### Storage Locations by Use Case
+
+| Use Case | Recommended Storage | Why |
+|----------|---------------------|-----|
+| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
+| Database | nvme-mirror1/2 | Low latency |
+| Media files | TrueNAS vault | Large capacity |
+| Development | nvme-mirror2 | Fast, mid-tier |
+| Containers | rpool | Good performance |
+| Backups | TrueNAS or rpool | Large capacity |
+| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
+
+---
+
+## Investigation Needed
+
+- [ ] Get complete TrueNAS dataset list
+- [ ] Document NFS/SMB share configuration
+- [ ] Inventory EMC enclosure drives (count, capacity, model)
+- [ ] Document current pool usage percentages
+- [ ] Set up monthly capacity reports
+- [ ] Configure ZFS scrub schedules
+- [ ] Set up storage health alerts
+
+---
+
+## Related Documentation
+
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
+- [VMS.md](VMS.md) - VM storage assignments
+- [NETWORK.md](NETWORK.md) - Storage network configuration
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/TRAEFIK.md b/TRAEFIK.md
new file mode 100644
index 0000000..77af1d9
--- /dev/null
+++ b/TRAEFIK.md
@@ -0,0 +1,672 @@
+# Traefik Reverse Proxy
+
+Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
+
+## Overview
+
+There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
+
+| Instance | Location | IP | Purpose | Managed By |
+|----------|----------|-----|---------|------------|
+| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
+| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
+
+---
+
+## ⚠️ CRITICAL RULE: Which Traefik to Use
+
+### When Adding ANY New Service:
+
+✅ **USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
+❌ **DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
+
+### Why This Matters:
+
+- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
+- Messing with it breaks Plex, Sonarr, Radarr, and all media services
+- Each Traefik has its own Let's Encrypt certificates
+- Mixing them causes certificate conflicts and routing issues
+
+---
+
+## Traefik-Primary (CT 202) - For New Services
+
+### Configuration
+
+**Location**: Container 202 on PVE (10.10.10.250)
+**Config Directory**: `/etc/traefik/`
+**Main Config**: `/etc/traefik/traefik.yaml`
+**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
+
+### Access Traefik Config
+
+```bash
+# From Mac Mini:
+ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
+ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
+
+# Edit a service config:
+ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
+
+# View logs:
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+### Services Using Traefik-Primary
+
+| Service | Domain | Backend |
+|---------|--------|---------|
+| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
+| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
+| Gitea | git.htsn.io | 10.10.10.220:3000 |
+| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
+| LM Dev | lmdev.htsn.io | 10.10.10.111 |
+| Pi-hole | pihole.htsn.io | 10.10.10.200 |
+| TrueNAS | truenas.htsn.io | 10.10.10.200 |
+| Proxmox | pve.htsn.io | 10.10.10.120 |
+| Copyparty | copyparty.htsn.io | 10.10.10.201 |
+| AI Trade | aitrade.htsn.io | (trading server) |
+| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
+| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
+
+---
+
+## Traefik-Saltbox (VM 101) - DO NOT MODIFY
+
+### Configuration
+
+**Location**: `/opt/traefik/` inside Saltbox VM
+**Managed By**: Saltbox Ansible playbooks (automatic)
+**Docker Mount**: `/opt/traefik` → `/etc/traefik` in container
+
+### Services Using Traefik-Saltbox
+
+- Plex (plex.htsn.io)
+- Sonarr, Radarr, Lidarr
+- SABnzbd, NZBGet, qBittorrent
+- Overseerr, Tautulli, Organizr
+- Jackett, NZBHydra2
+- Authelia (SSO authentication)
+- All other Saltbox-managed containers
+
+### View Saltbox Traefik (Read-Only)
+
+```bash
+# View config (don't edit!)
+ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
+
+# View logs
+ssh saltbox 'docker logs -f traefik'
+```
+
+**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
+
+---
+
+## Adding a New Public Service - Complete Workflow
+
+Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
+
+### Step 0: Deploy Your Service
+
+First, deploy your service on the appropriate host.
+
+#### Option A: Docker on docker-host (10.10.10.206)
+
+```bash
+ssh hutson@10.10.10.206
+sudo mkdir -p /opt/myservice
+cat > /opt/myservice/docker-compose.yml << 'EOF'
+version: "3.8"
+services:
+  myservice:
+    image: myimage:latest
+    ports:
+      - "8080:80"
+    restart: unless-stopped
+EOF
+cd /opt/myservice && sudo docker-compose up -d
+```
+
+#### Option B: New LXC Container on PVE
+
+```bash
+ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
+  --hostname myservice --memory 2048 --cores 2 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
+  --rootfs local-zfs:8 --unprivileged 1 --start 1'
+```
+
+#### Option C: New VM on PVE
+
+```bash
+ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
+  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
+```
+
+### Step 1: Create Traefik Config File
+
+Use this template for new services on **Traefik-Primary (CT 202)**:
+
+#### Basic Template
+
+```yaml
+# /etc/traefik/conf.d/myservice.yaml
+http:
+  routers:
+    # HTTPS router
+    myservice-secure:
+      entryPoints:
+        - websecure
+      rule: "Host(`myservice.htsn.io`)"
+      service: myservice
+      tls:
+        certResolver: cloudflare  # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
+      priority: 50
+
+    # HTTP → HTTPS redirect
+    myservice-redirect:
+      entryPoints:
+        - web
+      rule: "Host(`myservice.htsn.io`)"
+      middlewares:
+        - myservice-https-redirect
+      service: myservice
+      priority: 50
+
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: "http://10.10.10.XXX:PORT"
+
+  middlewares:
+    myservice-https-redirect:
+      redirectScheme:
+        scheme: https
+        permanent: true
+```
+
+#### Deploy the Config
+
+```bash
+# Create file on CT 202
+ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
+<paste config here>
+EOF"'
+
+# Traefik auto-reloads (watches conf.d directory)
+# Check logs:
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+### Step 2: Add Cloudflare DNS Entry
+
+#### Cloudflare Credentials
+
+| Field | Value |
+|-------|-------|
+| Email | cloudflare@htsn.io |
+| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
+| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
+| Public IP | 70.237.94.174 |
+
+#### Method 1: Manual (Cloudflare Dashboard)
+
+1. Go to https://dash.cloudflare.com/
+2. Select `htsn.io` domain
+3. DNS → Add Record
+4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
+
+#### Method 2: Automated (CLI)
+
+Save this as `~/bin/add-cloudflare-dns.sh`:
+
+```bash
+#!/bin/bash
+# Add DNS record to Cloudflare for htsn.io
+
+SUBDOMAIN="$1"
+CF_EMAIL="cloudflare@htsn.io"
+CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
+PUBLIC_IP="70.237.94.174"
+
+if [ -z "$SUBDOMAIN" ]; then
+  echo "Usage: $0 <subdomain>"
+  echo "Example: $0 myservice  # Creates myservice.htsn.io"
+  exit 1
+fi
+
+curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data "{
+    \"type\":\"A\",
+    \"name\":\"$SUBDOMAIN\",
+    \"content\":\"$PUBLIC_IP\",
+    \"ttl\":1,
+    \"proxied\":true
+  }" | jq .
+```
+
+**Usage**:
+```bash
+chmod +x ~/bin/add-cloudflare-dns.sh
+~/bin/add-cloudflare-dns.sh myservice  # Creates myservice.htsn.io
+```
+
+### Step 3: Testing
+
+```bash
+# Check if DNS resolves
+dig myservice.htsn.io
+
+# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
+
+# Test HTTP redirect
+curl -I http://myservice.htsn.io
+
+# Expected: 301 redirect to https://
+
+# Test HTTPS
+curl -I https://myservice.htsn.io
+
+# Expected: 200 OK
+
+# Check Traefik dashboard (if enabled)
+# http://10.10.10.250:8080/dashboard/
+```
+
+### Step 4: Update Documentation
+
+After deploying, update:
+
+1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
+2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
+3. **CLAUDE.md** - Update quick reference if needed
+
+---
+
+## SSL Certificates
+
+Traefik has **two certificate resolvers** configured:
+
+| Resolver | Use When | Challenge Type | Notes |
+|----------|----------|----------------|-------|
+| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
+| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
+
+### ⚠️ Important: HTTP Challenge vs DNS Challenge
+
+**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
+
+**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
+
+### Certificate Resolver Configuration
+
+**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
+
+```ini
+Environment="CF_API_EMAIL=cloudflare@htsn.io"
+Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+```
+
+### Certificate Storage
+
+| Resolver | Storage File |
+|----------|--------------|
+| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
+| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
+
+**Permissions**: Must be `600` (read/write owner only)
+
+```bash
+# Check permissions
+ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
+
+# Fix if needed
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
+```
+
+### Certificate Renewal
+
+- **Automatic** via Traefik
+- Checks every 24 hours
+- Renews 30 days before expiry
+- No manual intervention needed
+
+### Troubleshooting Certificates
+
+#### Certificate Fails to Issue
+
+```bash
+# Check Traefik logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
+
+# Verify Cloudflare API access
+curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
+  -H "X-Auth-Email: cloudflare@htsn.io" \
+  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+
+# Check acme.json permissions
+ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
+```
+
+#### Force Certificate Renewal
+
+```bash
+# Delete certificate (Traefik will re-request)
+ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+
+# Watch logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+```
+
+---
+
+## Quick Deployment - One-Liner
+
+For fast deployment, use this all-in-one command:
+
+```bash
+# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
+
+# 1. Create Traefik config
+ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
+http:
+  routers:
+    myservice-secure:
+      entryPoints: [websecure]
+      rule: Host(\\\`myservice.htsn.io\\\`)
+      service: myservice
+      tls: {certResolver: cloudflare}
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: http://10.10.10.206:8080
+EOF"'
+
+# 2. Add Cloudflare DNS
+curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
+  -H "X-Auth-Email: cloudflare@htsn.io" \
+  -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
+  -H "Content-Type: application/json" \
+  --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
+
+# 3. Test (wait a few seconds for DNS propagation)
+curl -I https://myservice.htsn.io
+```
+
+---
+
+## Docker Service with Traefik Labels (Alternative)
+
+If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
+
+**Requirements**:
+- Traefik must have access to Docker socket
+- Service must be on same Docker network as Traefik
+
+**Example docker-compose.yml**:
+
+```yaml
+version: "3.8"
+
+services:
+  myservice:
+    image: myimage:latest
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
+      - "traefik.http.routers.myservice.entrypoints=websecure"
+      - "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
+      - "traefik.http.services.myservice.loadbalancer.server.port=8080"
+    networks:
+      - traefik
+
+networks:
+  traefik:
+    external: true
+```
+
+**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
+
+---
+
+## Cloudflare API Reference
+
+### API Credentials
+
+| Field | Value |
+|-------|-------|
+| Email | cloudflare@htsn.io |
+| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
+| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
+
+### Common API Operations
+
+Set credentials:
+```bash
+CF_EMAIL="cloudflare@htsn.io"
+CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
+ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
+```
+
+**List all DNS records**:
+```bash
+curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" | jq
+```
+
+**Add A record**:
+```bash
+curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data '{
+    "type":"A",
+    "name":"subdomain",
+    "content":"70.237.94.174",
+    "proxied":true
+  }'
+```
+
+**Delete record**:
+```bash
+curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY"
+```
+
+**Update record** (toggle proxy):
+```bash
+curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
+  -H "X-Auth-Email: $CF_EMAIL" \
+  -H "X-Auth-Key: $CF_API_KEY" \
+  -H "Content-Type: application/json" \
+  --data '{"proxied":false}'
+```
+
+---
+
+## Troubleshooting
+
+### Service Not Accessible
+
+```bash
+# 1. Check if DNS resolves
+dig myservice.htsn.io
+
+# 2. Check if backend is reachable
+curl -I http://10.10.10.XXX:PORT
+
+# 3. Check Traefik logs
+ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
+
+# 4. Check Traefik config is valid
+ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
+
+# 5. Restart Traefik (if needed)
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+```
+
+### Certificate Issues
+
+```bash
+# Check certificate status in acme.json
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
+
+# Check certificate expiry
+echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
+```
+
+### 502 Bad Gateway
+
+**Cause**: Backend service is down or unreachable
+
+```bash
+# Check if backend is running
+ssh backend-host 'systemctl status myservice'
+
+# Check if port is open
+nc -zv 10.10.10.XXX PORT
+
+# Check firewall
+ssh backend-host 'iptables -L -n | grep PORT'
+```
+
+### 404 Not Found
+
+**Cause**: Traefik can't match the request to a router
+
+```bash
+# Check router rule matches domain
+ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
+
+# Should be: rule: "Host(`myservice.htsn.io`)"
+
+# Check DNS is pointing to correct IP
+dig myservice.htsn.io
+
+# Restart Traefik to reload config
+ssh pve 'pct exec 202 -- systemctl restart traefik'
+```
+
+---
+
+## Advanced Configuration Examples
+
+### WebSocket Support
+
+For services that use WebSockets (like Home Assistant):
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      entryPoints:
+        - websecure
+      rule: "Host(`myservice.htsn.io`)"
+      service: myservice
+      tls:
+        certResolver: cloudflare
+
+  services:
+    myservice:
+      loadBalancer:
+        servers:
+          - url: "http://10.10.10.XXX:PORT"
+        # No special config needed - WebSockets work by default in Traefik v2+
+```
+
+### Custom Headers
+
+Add custom headers (e.g., security headers):
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      middlewares:
+        - myservice-headers
+
+  middlewares:
+    myservice-headers:
+      headers:
+        customResponseHeaders:
+          X-Frame-Options: "DENY"
+          X-Content-Type-Options: "nosniff"
+          Referrer-Policy: "strict-origin-when-cross-origin"
+```
+
+### Basic Authentication
+
+Protect a service with basic auth:
+
+```yaml
+http:
+  routers:
+    myservice-secure:
+      middlewares:
+        - myservice-auth
+
+  middlewares:
+    myservice-auth:
+      basicAuth:
+        users:
+          - "user:$apr1$..." # Generate with: htpasswd -nb user password
+```
+
+---
+
+## Maintenance
+
+### Monthly Checks
+
+```bash
+# Check Traefik status
+ssh pve 'pct exec 202 -- systemctl status traefik'
+
+# Review logs for errors
+ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
+
+# Check certificate expiry dates
+ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
+
+# Verify all services responding
+for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
+  echo "Testing $domain..."
+  curl -sI https://$domain | head -1
+done
+```
+
+### Backup Traefik Config
+
+```bash
+# Backup all configs
+ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
+
+# Copy to safe location
+scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
+```
+
+---
+
+## Related Documentation
+
+- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
+- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
+- [SERVICES.md](#) - Complete service inventory (coming soon)
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/UPS.md b/UPS.md
new file mode 100644
index 0000000..1aeb248
--- /dev/null
+++ b/UPS.md
@@ -0,0 +1,605 @@
+# UPS and Power Management
+
+Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
+
+## Hardware
+
+### Current UPS
+
+| Specification | Value |
+|---------------|-------|
+| **Model** | CyberPower OR2200PFCRT2U |
+| **Capacity** | 2200VA / 1320W |
+| **Form Factor** | 2U rackmount |
+| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
+| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
+| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
+| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
+| **Installed** | 2025-12-21 |
+| **Status** | Active |
+
+### ⚠️ Temporary Wiring Modification
+
+**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
+**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
+**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
+**Current draw**: ~1000-1350W total (safe margin)
+**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
+
+### Previous UPS
+
+| Model | Capacity | Issue | Replaced |
+|-------|----------|-------|----------|
+| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
+
+**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
+
+---
+
+## Power Draw Estimates
+
+### Typical Load
+
+| Component | Idle | Load | Notes |
+|-----------|------|------|-------|
+| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
+| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
+| Network gear | ~50W | ~50W | Router, switches |
+| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
+
+**UPS Load**: ~33-50% typical, 70-80% under heavy load
+
+### Runtime Calculation
+
+At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
+At **660W load** (50%): ~10-12 min estimated
+At **1000W load** (75%): ~6-8 min estimated
+
+**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
+
+---
+
+## NUT (Network UPS Tools) Configuration
+
+### Architecture
+
+```
+UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
+                      │
+                      └──> Home Assistant (monitoring only)
+```
+
+**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
+**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
+
+### NUT Server Configuration (PVE)
+
+#### 1. UPS Driver Config: `/etc/nut/ups.conf`
+
+```ini
+[cyberpower]
+    driver = usbhid-ups
+    port = auto
+    desc = "CyberPower OR2200PFCRT2U"
+    override.battery.charge.low = 20
+    override.battery.runtime.low = 120
+```
+
+**Key settings**:
+- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
+- `port = auto`: Auto-detect USB device
+- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
+
+#### 2. NUT Server Config: `/etc/nut/upsd.conf`
+
+```ini
+LISTEN 127.0.0.1 3493
+LISTEN 10.10.10.120 3493
+```
+
+**Listens on**:
+- Localhost (for local monitoring)
+- LAN IP (for PVE2 to connect)
+
+#### 3. User Config: `/etc/nut/upsd.users`
+
+```ini
+[admin]
+    password = upsadmin123
+    actions = SET
+    instcmds = ALL
+
+[upsmon]
+    password = upsmon123
+    upsmon master
+```
+
+**Users**:
+- `admin`: Full control, can run commands
+- `upsmon`: Monitoring only (used by PVE2)
+
+#### 4. Monitor Config: `/etc/nut/upsmon.conf`
+
+```ini
+MONITOR cyberpower@localhost 1 upsmon upsmon123 master
+
+MINSUPPLIES 1
+SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
+NOTIFYCMD /usr/sbin/upssched
+POLLFREQ 5
+POLLFREQALERT 5
+HOSTSYNC 15
+DEADTIME 15
+POWERDOWNFLAG /etc/killpower
+
+NOTIFYMSG ONLINE    "UPS %s on line power"
+NOTIFYMSG ONBATT    "UPS %s on battery"
+NOTIFYMSG LOWBATT   "UPS %s battery is low"
+NOTIFYMSG FSD       "UPS %s: forced shutdown in progress"
+NOTIFYMSG COMMOK    "Communications with UPS %s established"
+NOTIFYMSG COMMBAD   "Communications with UPS %s lost"
+NOTIFYMSG SHUTDOWN  "Auto logout and shutdown proceeding"
+NOTIFYMSG REPLBATT  "UPS %s battery needs to be replaced"
+NOTIFYMSG NOCOMM    "UPS %s is unavailable"
+NOTIFYMSG NOPARENT  "upsmon parent process died - shutdown impossible"
+
+NOTIFYFLAG ONLINE   SYSLOG+WALL
+NOTIFYFLAG ONBATT   SYSLOG+WALL
+NOTIFYFLAG LOWBATT  SYSLOG+WALL
+NOTIFYFLAG FSD      SYSLOG+WALL
+NOTIFYFLAG COMMOK   SYSLOG+WALL
+NOTIFYFLAG COMMBAD  SYSLOG+WALL
+NOTIFYFLAG SHUTDOWN SYSLOG+WALL
+NOTIFYFLAG REPLBATT SYSLOG+WALL
+NOTIFYFLAG NOCOMM   SYSLOG+WALL
+NOTIFYFLAG NOPARENT SYSLOG
+```
+
+**Key settings**:
+- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
+- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
+- `POLLFREQ 5`: Check UPS every 5 seconds
+
+#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
+
+```udev
+SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
+```
+
+**Purpose**: Ensure NUT can access USB UPS device
+
+**Apply rule**:
+```bash
+udevadm control --reload-rules
+udevadm trigger
+```
+
+### NUT Client Configuration (PVE2)
+
+#### Monitor Config: `/etc/nut/upsmon.conf`
+
+```ini
+MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
+
+MINSUPPLIES 1
+SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
+POLLFREQ 5
+POLLFREQALERT 5
+HOSTSYNC 15
+DEADTIME 15
+POWERDOWNFLAG /etc/killpower
+
+# Same NOTIFYMSG and NOTIFYFLAG as PVE
+```
+
+**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
+
+---
+
+## Custom Shutdown Script
+
+### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
+
+```bash
+#!/bin/bash
+# Graceful VM/CT shutdown when UPS battery low
+
+LOG="/var/log/ups-shutdown.log"
+
+log() {
+    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
+}
+
+log "=== UPS Shutdown Triggered ==="
+log "Battery low - initiating graceful shutdown of VMs/CTs"
+
+# Get list of running VMs (skip TrueNAS for now)
+VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
+for VMID in $VMS; do
+    log "Stopping VM $VMID..."
+    qm shutdown $VMID
+done
+
+# Get list of running containers
+CTS=$(pct list | awk '$2=="running" {print $1}')
+for CTID in $CTS; do
+    log "Stopping CT $CTID..."
+    pct shutdown $CTID
+done
+
+# Wait for VMs/CTs to stop
+log "Waiting 60 seconds for VMs/CTs to shut down..."
+sleep 60
+
+# Now stop TrueNAS (storage - must be last)
+if qm status 100 | grep -q running; then
+    log "Stopping TrueNAS (VM 100) last..."
+    qm shutdown 100
+    sleep 30
+fi
+
+log "All VMs/CTs stopped. Host will remain running until UPS dies."
+log "=== UPS Shutdown Complete ==="
+```
+
+**Make executable**:
+```bash
+chmod +x /usr/local/bin/ups-shutdown.sh
+```
+
+**Script behavior**:
+1. Stops all VMs (except TrueNAS)
+2. Stops all containers
+3. Waits 60 seconds
+4. Stops TrueNAS last (storage must be cleanly unmounted)
+5. **Does NOT shut down Proxmox hosts** - intentionally left running
+
+**Why not shut down hosts?**
+- BIOS configured to "Restore on AC Power Loss"
+- When power returns, servers auto-boot and start VMs in order
+- Avoids need for manual intervention
+
+---
+
+## Power Failure Behavior
+
+### When Power Fails
+
+1. **UPS switches to battery** (`OB DISCHRG` status)
+2. **NUT monitors runtime** - polls every 5 seconds
+3. **At 120 seconds (2 min) remaining**:
+   - NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
+   - Script gracefully stops all VMs/CTs
+   - TrueNAS stopped last (storage integrity)
+4. **Hosts remain running** until UPS battery depletes
+5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
+
+### When Power Returns
+
+1. **UPS charges battery**, power returns to servers
+2. **BIOS "Restore on AC Power Loss"** boots both servers
+3. **Proxmox starts** and auto-starts VMs in configured order:
+
+| Order | Wait | VMs/CTs | Reason |
+|-------|------|---------|--------|
+| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
+| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
+| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
+| 4 | 5s | pihole, traefik, findshyt | Containers |
+
+PVE2 VMs: order=1, wait=10s
+
+**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
+
+---
+
+## UPS Status Codes
+
+| Code | Meaning | Action |
+|------|---------|--------|
+| `OL` | Online (AC power) | Normal operation |
+| `OB` | On Battery | Power outage - monitor runtime |
+| `LB` | Low Battery | <2 min remaining - shutdown imminent |
+| `CHRG` | Charging | Battery charging after power restored |
+| `DISCHRG` | Discharging | On battery, draining |
+| `FSD` | Forced Shutdown | NUT triggered shutdown |
+
+---
+
+## Monitoring & Commands
+
+### Check UPS Status
+
+```bash
+# Full status
+ssh pve 'upsc cyberpower@localhost'
+
+# Key metrics only
+ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
+
+# Example output:
+# battery.charge: 100
+# battery.runtime: 1234        (seconds remaining)
+# ups.load: 33                  (% load)
+# ups.status: OL                (online)
+```
+
+### Control UPS Beeper
+
+```bash
+# Mute beeper (temporary - until next power event)
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
+
+# Disable beeper (permanent)
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
+
+# Enable beeper
+ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
+```
+
+### Test Shutdown Procedure
+
+**Simulate low battery** (careful - this will shut down VMs!):
+
+```bash
+# Set a very high low battery threshold to trigger shutdown
+ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
+
+# Watch it trigger (when runtime drops below 300 seconds)
+ssh pve 'tail -f /var/log/ups-shutdown.log'
+
+# Reset to normal
+ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
+```
+
+**Better test**: Run shutdown script manually without actually triggering NUT:
+```bash
+ssh pve '/usr/local/bin/ups-shutdown.sh'
+```
+
+---
+
+## Home Assistant Integration
+
+UPS metrics are exposed to Home Assistant via NUT integration.
+
+### Available Sensors
+
+| Entity ID | Description |
+|-----------|-------------|
+| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
+| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
+| `sensor.cyberpower_load` | Load % (0-100) |
+| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
+| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
+| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
+
+### Configuration
+
+**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
+
+### Example Automations
+
+**Send notification when on battery**:
+```yaml
+automation:
+  - alias: "UPS On Battery Alert"
+    trigger:
+      - platform: state
+        entity_id: sensor.cyberpower_status
+        to: "OB"
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
+```
+
+**Alert when battery low**:
+```yaml
+automation:
+  - alias: "UPS Low Battery Alert"
+    trigger:
+      - platform: numeric_state
+        entity_id: sensor.cyberpower_battery_runtime
+        below: 300
+    action:
+      - service: notify.mobile_app
+        data:
+          message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
+```
+
+---
+
+## Testing Results
+
+### Full Power Failure Test (2025-12-21)
+
+Complete end-to-end test of power failure and recovery:
+
+| Event | Time | Duration | Notes |
+|-------|------|----------|-------|
+| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
+| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
+| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
+| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
+| **Power restored** | ~22:47 | - | Plugged back in |
+| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
+| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
+| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
+| **Total recovery** | - | **~7 min** | From power return to fully operational |
+
+**Results**:
+✅ VMs shut down gracefully
+✅ Hosts remained running until UPS died (as intended)
+✅ Auto-boot on power restoration worked
+✅ VMs started in correct order with appropriate delays
+✅ No data corruption or issues
+
+**Runtime calculation**:
+- Load: ~33% (440W estimated)
+- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
+- Matches manufacturer estimate for 33% load
+
+---
+
+## Proxmox Cluster Quorum Fix
+
+### Problem
+
+With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
+
+During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
+
+### Solution
+
+Modified `/etc/pve/corosync.conf` to enable 2-node mode:
+
+```
+quorum {
+    provider: corosync_votequorum
+    two_node: 1
+}
+```
+
+**Effect**:
+- Either node can operate independently if the other is down
+- No more waiting for quorum when one server is offline
+- Both nodes visible in single Proxmox interface when both up
+
+**Applied**: 2025-12-21
+
+---
+
+## Maintenance
+
+### Monthly Checks
+
+```bash
+# Check UPS status
+ssh pve 'upsc cyberpower@localhost'
+
+# Check NUT server running
+ssh pve 'systemctl status nut-server'
+ssh pve 'systemctl status nut-monitor'
+
+# Check NUT client running (PVE2)
+ssh pve2 'systemctl status nut-monitor'
+
+# Verify PVE2 can see UPS
+ssh pve2 'upsc cyberpower@10.10.10.120'
+
+# Check logs for errors
+ssh pve 'journalctl -u nut-server -n 50'
+ssh pve 'journalctl -u nut-monitor -n 50'
+```
+
+### Battery Health
+
+**Check battery stats**:
+```bash
+ssh pve 'upsc cyberpower@localhost | grep battery'
+
+# Key metrics:
+# battery.charge: 100          (should be near 100 when on AC)
+# battery.runtime: 1200+       (seconds at current load)
+# battery.voltage: ~24V        (normal for 24V battery system)
+```
+
+**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
+```bash
+ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
+```
+
+CyberPower batteries typically last 3-5 years.
+
+### Firmware Updates
+
+Check CyberPower website for firmware updates:
+https://www.cyberpowersystems.com/support/firmware/
+
+---
+
+## Troubleshooting
+
+### UPS Not Detected
+
+```bash
+# Check USB connection
+ssh pve 'lsusb | grep Cyber'
+
+# Expected:
+# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
+
+# Restart NUT driver
+ssh pve 'systemctl restart nut-driver'
+ssh pve 'systemctl status nut-driver'
+```
+
+### PVE2 Can't Connect
+
+```bash
+# Verify NUT server listening
+ssh pve 'netstat -tuln | grep 3493'
+
+# Should show:
+# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
+
+# Test connection from PVE2
+ssh pve2 'telnet 10.10.10.120 3493'
+
+# Check firewall (should allow port 3493)
+ssh pve 'iptables -L -n | grep 3493'
+```
+
+### Shutdown Script Not Running
+
+```bash
+# Check script permissions
+ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
+
+# Should be: -rwxr-xr-x (executable)
+
+# Check logs
+ssh pve 'cat /var/log/ups-shutdown.log'
+
+# Test script manually
+ssh pve '/usr/local/bin/ups-shutdown.sh'
+```
+
+### UPS Status Shows UNKNOWN
+
+```bash
+# Driver may not be compatible
+ssh pve 'upsc cyberpower@localhost ups.status'
+
+# Try different driver (in /etc/nut/ups.conf)
+# driver = usbhid-ups
+# or
+# driver = blazer_usb
+
+# Restart after change
+ssh pve 'systemctl restart nut-driver nut-server'
+```
+
+---
+
+## Future Improvements
+
+- [ ] Add email alerts for UPS events (power fail, low battery)
+- [ ] Log runtime statistics to track battery degradation
+- [ ] Set up Grafana dashboard for UPS metrics
+- [ ] Test battery runtime at different load levels
+- [ ] Upgrade to 20A circuit, restore original 5-20P plug
+- [ ] Consider adding network management card for out-of-band UPS access
+
+---
+
+## Related Documentation
+
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
+- [VMS.md](VMS.md) - VM startup order configuration
+- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
+
+---
+
+**Last Updated**: 2025-12-22
diff --git a/VMS.md b/VMS.md
new file mode 100644
index 0000000..b8e1d4c
--- /dev/null
+++ b/VMS.md
@@ -0,0 +1,579 @@
+# VMs and Containers
+
+Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
+
+## Overview
+
+| Server | VMs | LXCs | Total |
+|--------|-----|------|-------|
+| **PVE** (10.10.10.120) | 6 | 3 | 9 |
+| **PVE2** (10.10.10.102) | 2 | 0 | 2 |
+| **Total** | **8** | **3** | **11** |
+
+---
+
+## PVE (10.10.10.120) - Primary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
+| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
+| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
+| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
+| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
+| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
+| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
+
+### LXC Containers
+
+| CTID | Name | IP | RAM | Storage | Purpose |
+|------|------|-----|-----|---------|---------|
+| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
+| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
+| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
+
+---
+
+## PVE2 (10.10.10.102) - Secondary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
+| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
+
+### LXC Containers
+
+None on PVE2.
+
+---
+
+## VM Details
+
+### 100 - TrueNAS (Storage Server)
+
+**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
+
+**Specs**:
+- **OS**: TrueNAS SCALE
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
+- **Network**:
+  - Primary: 10 Gb (vmbr2)
+  - Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
+
+**Hardware Passthrough**:
+- LSI SAS2308 HBA (for EMC enclosure drives)
+- Samsung NVMe (for ZFS caching)
+
+**ZFS Pools**:
+- `vault`: Main storage pool on EMC drives
+- Boot pool on passed-through NVMe
+
+**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
+
+---
+
+### 101 - Saltbox (Media Automation)
+
+**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 16 GB
+- **Storage**: nvme-mirror1
+- **Network**: 10 Gb (vmbr2)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (for Plex hardware transcoding)
+
+**Services**:
+- Plex Media Server (plex.htsn.io)
+- Sonarr, Radarr, Lidarr (TV/movie/music automation)
+- SABnzbd, NZBGet (downloaders)
+- Overseerr (request management)
+- Tautulli (Plex stats)
+- Organizr (dashboard)
+- Authelia (SSO authentication)
+- Traefik (reverse proxy - separate from CT 202)
+
+**Managed By**: Saltbox Ansible playbooks
+**See**: [SALTBOX.md](#) (coming soon)
+
+---
+
+### 105 - fs-dev (Development Environment)
+
+**Purpose**: General development work, testing, prototyping
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 10
+- **RAM**: 8 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+---
+
+### 110 - Home Assistant (Home Automation)
+
+**Purpose**: Smart home automation platform
+
+**Specs**:
+- **OS**: Home Assistant OS
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**:
+- Web UI: https://homeassistant.htsn.io
+- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
+
+**Special Notes**:
+- ❌ No QEMU agent (Home Assistant OS doesn't support it)
+- No SSH server by default (access via web terminal)
+
+---
+
+### 111 - lmdev1 (AI/LLM Development)
+
+**Purpose**: AI model development, fine-tuning, inference
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
+
+**Installed**:
+- CUDA toolkit
+- Python 3.11+
+- PyTorch, TensorFlow
+- Hugging Face transformers
+
+---
+
+### 201 - Copyparty (File Sharing)
+
+**Purpose**: Simple HTTP file sharing server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://copyparty.htsn.io
+
+---
+
+### 206 - docker-host (Docker Services)
+
+**Purpose**: General-purpose Docker host for miscellaneous services
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+- **CPU**: `host` passthrough (for x86-64-v3 support)
+
+**Services Running**:
+- Excalidraw (excalidraw.htsn.io) - Whiteboard
+- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
+- Pulse (pulse.htsn.io) - Monitoring dashboard
+
+**Docker Compose Files**: `/opt/*/docker-compose.yml`
+
+---
+
+### 300 - gitea-vm (Git Server)
+
+**Purpose**: Self-hosted Git server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://git.htsn.io
+
+**Repositories**:
+- homelab-docs (this documentation)
+- Personal projects
+- Private repos
+
+---
+
+### 301 - trading-vm (AI Trading Platform)
+
+**Purpose**: Algorithmic trading system with AI models
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
+
+**Software**:
+- Trading algorithms
+- AI models for market prediction
+- Real-time data feeds
+- Backtesting infrastructure
+
+---
+
+## LXC Container Details
+
+### 200 - Pi-hole (DNS & Ad Blocking)
+
+**Purpose**: Network-wide DNS server and ad blocker
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.10
+**Storage**: rpool
+
+**Access**:
+- Web UI: http://10.10.10.10/admin
+- Public URL: https://pihole.htsn.io
+
+**Configuration**:
+- Upstream DNS: Cloudflare (1.1.1.1)
+- DHCP: Disabled (router handles DHCP)
+- Interface: All interfaces
+
+**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
+
+---
+
+### 202 - Traefik (Reverse Proxy)
+
+**Purpose**: Primary reverse proxy for all public-facing services
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.250
+**Storage**: rpool
+
+**Configuration**: `/etc/traefik/`
+**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
+
+**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
+
+---
+
+### 205 - FindShyt (Custom App)
+
+**Purpose**: Custom application (details TBD)
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.8
+**Storage**: rpool
+
+**Access**: https://findshyt.htsn.io
+
+---
+
+## VM Startup Order & Dependencies
+
+### Power-On Sequence
+
+When servers boot (after power failure or restart), VMs/CTs start in this order:
+
+#### PVE (10.10.10.120)
+
+| Order | Wait | VMID | Name | Reason |
+|-------|------|------|------|--------|
+| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
+| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
+| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
+| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
+
+**Configure startup order** (already set):
+```bash
+# View current config
+ssh pve 'qm config 100 | grep -E "startup|onboot"'
+
+# Set startup order (example)
+ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
+ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
+```
+
+#### PVE2 (10.10.10.102)
+
+| Order | Wait | VMID | Name |
+|-------|------|------|------|
+| **1** | 10s | 300, 301 | All VMs |
+
+**Less critical** - no dependencies between PVE2 VMs.
+
+---
+
+## Resource Allocation Summary
+
+### Total Allocated (PVE)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
+| **RAM** | 98 GB | 128 GB | 77% |
+
+**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
+
+### Total Allocated (PVE2)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 18 | 64 | 28% |
+| **RAM** | 36 GB | 128 GB | 28% |
+
+**PVE2** has significant headroom for additional VMs.
+
+---
+
+## Adding a New VM
+
+### Quick Template
+
+```bash
+# Create VM
+ssh pve 'qm create VMID \
+  --name myvm \
+  --memory 4096 \
+  --cores 2 \
+  --net0 virtio,bridge=vmbr0 \
+  --scsihw virtio-scsi-pci \
+  --scsi0 nvme-mirror1:32 \
+  --boot order=scsi0 \
+  --ostype l26 \
+  --agent enabled=1'
+
+# Attach ISO for installation
+ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
+
+# Start VM
+ssh pve 'qm start VMID'
+
+# Access console
+ssh pve 'qm vncproxy VMID' # Then connect with VNC client
+# Or via Proxmox web UI
+```
+
+### Cloud-Init Template (Faster)
+
+Use cloud-init for automated VM deployment:
+
+```bash
+# Download cloud image
+ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
+
+# Create VM
+ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
+
+# Import disk
+ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
+
+# Attach disk
+ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
+
+# Add cloud-init drive
+ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
+
+# Set boot disk
+ssh pve 'qm set VMID --boot order=scsi0'
+
+# Configure cloud-init (user, SSH key, network)
+ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
+
+# Enable QEMU agent
+ssh pve 'qm set VMID --agent enabled=1'
+
+# Resize disk (cloud images are small by default)
+ssh pve 'qm resize VMID scsi0 +30G'
+
+# Start VM
+ssh pve 'qm start VMID'
+```
+
+**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
+
+---
+
+## Adding a New LXC Container
+
+```bash
+# Download template (if not already downloaded)
+ssh pve 'pveam update'
+ssh pve 'pveam available | grep ubuntu'
+ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
+
+# Create container
+ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
+  --hostname mycontainer \
+  --memory 2048 \
+  --cores 2 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
+  --rootfs local-zfs:8 \
+  --unprivileged 1 \
+  --features nesting=1 \
+  --start 1'
+
+# Set root password
+ssh pve 'pct exec CTID -- passwd'
+
+# Add SSH key
+ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
+ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
+ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
+```
+
+---
+
+## GPU Passthrough Configuration
+
+### Current GPU Assignments
+
+| GPU | Location | Passed To | VMID | Purpose |
+|-----|----------|-----------|------|---------|
+| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
+| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
+| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
+
+### How to Pass GPU to VM
+
+1. **Identify GPU PCI ID**:
+   ```bash
+   ssh pve 'lspci | grep -i nvidia'
+   # Example output:
+   # 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
+   # 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
+   ```
+
+2. **Pass GPU to VM** (include both VGA and Audio):
+   ```bash
+   ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
+   # If multi-function device (GPU + Audio), use:
+   ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
+   ```
+
+3. **Configure VM for GPU**:
+   ```bash
+   # Set machine type to q35
+   ssh pve 'qm set VMID --machine q35'
+
+   # Set BIOS to OVMF (UEFI)
+   ssh pve 'qm set VMID --bios ovmf'
+
+   # Add EFI disk
+   ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
+   ```
+
+4. **Reboot VM** and install NVIDIA drivers inside the VM
+
+**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
+
+---
+
+## Backup Priority
+
+See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
+
+### Critical VMs (Must Backup)
+
+| Priority | VMID | Name | Reason |
+|----------|------|------|--------|
+| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
+| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
+| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
+| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
+| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
+
+### Medium Priority
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 200 | pihole | Easy to rebuild, but DNS config valuable |
+| 202 | traefik | Config files backed up separately |
+
+### Low Priority (Ephemeral/Rebuildable)
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 105 | fs-dev | Development - code is in Git |
+| 111 | lmdev1 | Ephemeral development |
+| 201 | copyparty | Simple app, easy to redeploy |
+| 206 | docker-host | Docker Compose files backed up separately |
+
+---
+
+## Quick Reference Commands
+
+```bash
+# List all VMs
+ssh pve 'qm list'
+ssh pve2 'qm list'
+
+# List all containers
+ssh pve 'pct list'
+
+# Start/stop VM
+ssh pve 'qm start VMID'
+ssh pve 'qm stop VMID'
+ssh pve 'qm shutdown VMID'  # Graceful
+
+# Start/stop container
+ssh pve 'pct start CTID'
+ssh pve 'pct stop CTID'
+ssh pve 'pct shutdown CTID'  # Graceful
+
+# VM console
+ssh pve 'qm terminal VMID'
+
+# Container console
+ssh pve 'pct enter CTID'
+
+# Clone VM
+ssh pve 'qm clone VMID NEW_VMID --name newvm'
+
+# Delete VM
+ssh pve 'qm destroy VMID'
+
+# Delete container
+ssh pve 'pct destroy CTID'
+```
+
+---
+
+## Related Documentation
+
+- [STORAGE.md](STORAGE.md) - Storage pool assignments
+- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
+- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
+
+---
+
+**Last Updated**: 2025-12-22