Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/VMS.md
+++ b/VMS.md
@@ -0,0 +1,579 @@
+# VMs and Containers
+
+Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
+
+## Overview
+
+| Server | VMs | LXCs | Total |
+|--------|-----|------|-------|
+| **PVE** (10.10.10.120) | 6 | 3 | 9 |
+| **PVE2** (10.10.10.102) | 2 | 0 | 2 |
+| **Total** | **8** | **3** | **11** |
+
+---
+
+## PVE (10.10.10.120) - Primary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
+| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
+| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
+| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
+| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
+| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
+| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
+
+### LXC Containers
+
+| CTID | Name | IP | RAM | Storage | Purpose |
+|------|------|-----|-----|---------|---------|
+| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
+| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
+| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
+
+---
+
+## PVE2 (10.10.10.102) - Secondary Server
+
+### Virtual Machines
+
+| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
+|------|------|-----|-------|-----|---------|---------|-----------------|------------|
+| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
+| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
+
+### LXC Containers
+
+None on PVE2.
+
+---
+
+## VM Details
+
+### 100 - TrueNAS (Storage Server)
+
+**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
+
+**Specs**:
+- **OS**: TrueNAS SCALE
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
+- **Network**:
+  - Primary: 10 Gb (vmbr2)
+  - Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
+
+**Hardware Passthrough**:
+- LSI SAS2308 HBA (for EMC enclosure drives)
+- Samsung NVMe (for ZFS caching)
+
+**ZFS Pools**:
+- `vault`: Main storage pool on EMC drives
+- Boot pool on passed-through NVMe
+
+**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
+
+---
+
+### 101 - Saltbox (Media Automation)
+
+**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 16 GB
+- **Storage**: nvme-mirror1
+- **Network**: 10 Gb (vmbr2)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (for Plex hardware transcoding)
+
+**Services**:
+- Plex Media Server (plex.htsn.io)
+- Sonarr, Radarr, Lidarr (TV/movie/music automation)
+- SABnzbd, NZBGet (downloaders)
+- Overseerr (request management)
+- Tautulli (Plex stats)
+- Organizr (dashboard)
+- Authelia (SSO authentication)
+- Traefik (reverse proxy - separate from CT 202)
+
+**Managed By**: Saltbox Ansible playbooks
+**See**: [SALTBOX.md](#) (coming soon)
+
+---
+
+### 105 - fs-dev (Development Environment)
+
+**Purpose**: General development work, testing, prototyping
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 10
+- **RAM**: 8 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+---
+
+### 110 - Home Assistant (Home Automation)
+
+**Purpose**: Smart home automation platform
+
+**Specs**:
+- **OS**: Home Assistant OS
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**:
+- Web UI: https://homeassistant.htsn.io
+- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
+
+**Special Notes**:
+- ❌ No QEMU agent (Home Assistant OS doesn't support it)
+- No SSH server by default (access via web terminal)
+
+---
+
+### 111 - lmdev1 (AI/LLM Development)
+
+**Purpose**: AI model development, fine-tuning, inference
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 8
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror1
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
+
+**Installed**:
+- CUDA toolkit
+- Python 3.11+
+- PyTorch, TensorFlow
+- Hugging Face transformers
+
+---
+
+### 201 - Copyparty (File Sharing)
+
+**Purpose**: Simple HTTP file sharing server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 2 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://copyparty.htsn.io
+
+---
+
+### 206 - docker-host (Docker Services)
+
+**Purpose**: General-purpose Docker host for miscellaneous services
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: rpool
+- **Network**: 1 Gb (vmbr0)
+- **CPU**: `host` passthrough (for x86-64-v3 support)
+
+**Services Running**:
+- Excalidraw (excalidraw.htsn.io) - Whiteboard
+- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
+- Pulse (pulse.htsn.io) - Monitoring dashboard
+
+**Docker Compose Files**: `/opt/*/docker-compose.yml`
+
+---
+
+### 300 - gitea-vm (Git Server)
+
+**Purpose**: Self-hosted Git server
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 2
+- **RAM**: 4 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**Access**: https://git.htsn.io
+
+**Repositories**:
+- homelab-docs (this documentation)
+- Personal projects
+- Private repos
+
+---
+
+### 301 - trading-vm (AI Trading Platform)
+
+**Purpose**: Algorithmic trading system with AI models
+
+**Specs**:
+- **OS**: Ubuntu 22.04
+- **vCPUs**: 16
+- **RAM**: 32 GB
+- **Storage**: nvme-mirror3 (PVE2)
+- **Network**: 1 Gb (vmbr0)
+
+**GPU Passthrough**:
+- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
+
+**Software**:
+- Trading algorithms
+- AI models for market prediction
+- Real-time data feeds
+- Backtesting infrastructure
+
+---
+
+## LXC Container Details
+
+### 200 - Pi-hole (DNS & Ad Blocking)
+
+**Purpose**: Network-wide DNS server and ad blocker
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.10
+**Storage**: rpool
+
+**Access**:
+- Web UI: http://10.10.10.10/admin
+- Public URL: https://pihole.htsn.io
+
+**Configuration**:
+- Upstream DNS: Cloudflare (1.1.1.1)
+- DHCP: Disabled (router handles DHCP)
+- Interface: All interfaces
+
+**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
+
+---
+
+### 202 - Traefik (Reverse Proxy)
+
+**Purpose**: Primary reverse proxy for all public-facing services
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.250
+**Storage**: rpool
+
+**Configuration**: `/etc/traefik/`
+**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
+
+**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
+
+**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
+
+---
+
+### 205 - FindShyt (Custom App)
+
+**Purpose**: Custom application (details TBD)
+
+**Type**: LXC (unprivileged)
+**OS**: Ubuntu 22.04
+**IP**: 10.10.10.8
+**Storage**: rpool
+
+**Access**: https://findshyt.htsn.io
+
+---
+
+## VM Startup Order & Dependencies
+
+### Power-On Sequence
+
+When servers boot (after power failure or restart), VMs/CTs start in this order:
+
+#### PVE (10.10.10.120)
+
+| Order | Wait | VMID | Name | Reason |
+|-------|------|------|------|--------|
+| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
+| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
+| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
+| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
+
+**Configure startup order** (already set):
+```bash
+# View current config
+ssh pve 'qm config 100 | grep -E "startup|onboot"'
+
+# Set startup order (example)
+ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
+ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
+```
+
+#### PVE2 (10.10.10.102)
+
+| Order | Wait | VMID | Name |
+|-------|------|------|------|
+| **1** | 10s | 300, 301 | All VMs |
+
+**Less critical** - no dependencies between PVE2 VMs.
+
+---
+
+## Resource Allocation Summary
+
+### Total Allocated (PVE)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
+| **RAM** | 98 GB | 128 GB | 77% |
+
+**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
+
+### Total Allocated (PVE2)
+
+| Resource | Allocated | Physical | % Used |
+|----------|-----------|----------|--------|
+| **vCPUs** | 18 | 64 | 28% |
+| **RAM** | 36 GB | 128 GB | 28% |
+
+**PVE2** has significant headroom for additional VMs.
+
+---
+
+## Adding a New VM
+
+### Quick Template
+
+```bash
+# Create VM
+ssh pve 'qm create VMID \
+  --name myvm \
+  --memory 4096 \
+  --cores 2 \
+  --net0 virtio,bridge=vmbr0 \
+  --scsihw virtio-scsi-pci \
+  --scsi0 nvme-mirror1:32 \
+  --boot order=scsi0 \
+  --ostype l26 \
+  --agent enabled=1'
+
+# Attach ISO for installation
+ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
+
+# Start VM
+ssh pve 'qm start VMID'
+
+# Access console
+ssh pve 'qm vncproxy VMID' # Then connect with VNC client
+# Or via Proxmox web UI
+```
+
+### Cloud-Init Template (Faster)
+
+Use cloud-init for automated VM deployment:
+
+```bash
+# Download cloud image
+ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
+
+# Create VM
+ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
+
+# Import disk
+ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
+
+# Attach disk
+ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
+
+# Add cloud-init drive
+ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
+
+# Set boot disk
+ssh pve 'qm set VMID --boot order=scsi0'
+
+# Configure cloud-init (user, SSH key, network)
+ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
+
+# Enable QEMU agent
+ssh pve 'qm set VMID --agent enabled=1'
+
+# Resize disk (cloud images are small by default)
+ssh pve 'qm resize VMID scsi0 +30G'
+
+# Start VM
+ssh pve 'qm start VMID'
+```
+
+**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
+
+---
+
+## Adding a New LXC Container
+
+```bash
+# Download template (if not already downloaded)
+ssh pve 'pveam update'
+ssh pve 'pveam available | grep ubuntu'
+ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
+
+# Create container
+ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
+  --hostname mycontainer \
+  --memory 2048 \
+  --cores 2 \
+  --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
+  --rootfs local-zfs:8 \
+  --unprivileged 1 \
+  --features nesting=1 \
+  --start 1'
+
+# Set root password
+ssh pve 'pct exec CTID -- passwd'
+
+# Add SSH key
+ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
+ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
+ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
+```
+
+---
+
+## GPU Passthrough Configuration
+
+### Current GPU Assignments
+
+| GPU | Location | Passed To | VMID | Purpose |
+|-----|----------|-----------|------|---------|
+| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
+| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
+| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
+
+### How to Pass GPU to VM
+
+1. **Identify GPU PCI ID**:
+   ```bash
+   ssh pve 'lspci | grep -i nvidia'
+   # Example output:
+   # 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
+   # 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
+   ```
+
+2. **Pass GPU to VM** (include both VGA and Audio):
+   ```bash
+   ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
+   # If multi-function device (GPU + Audio), use:
+   ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
+   ```
+
+3. **Configure VM for GPU**:
+   ```bash
+   # Set machine type to q35
+   ssh pve 'qm set VMID --machine q35'
+
+   # Set BIOS to OVMF (UEFI)
+   ssh pve 'qm set VMID --bios ovmf'
+
+   # Add EFI disk
+   ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
+   ```
+
+4. **Reboot VM** and install NVIDIA drivers inside the VM
+
+**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
+
+---
+
+## Backup Priority
+
+See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
+
+### Critical VMs (Must Backup)
+
+| Priority | VMID | Name | Reason |
+|----------|------|------|--------|
+| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
+| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
+| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
+| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
+| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
+
+### Medium Priority
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 200 | pihole | Easy to rebuild, but DNS config valuable |
+| 202 | traefik | Config files backed up separately |
+
+### Low Priority (Ephemeral/Rebuildable)
+
+| VMID | Name | Notes |
+|------|------|-------|
+| 105 | fs-dev | Development - code is in Git |
+| 111 | lmdev1 | Ephemeral development |
+| 201 | copyparty | Simple app, easy to redeploy |
+| 206 | docker-host | Docker Compose files backed up separately |
+
+---
+
+## Quick Reference Commands
+
+```bash
+# List all VMs
+ssh pve 'qm list'
+ssh pve2 'qm list'
+
+# List all containers
+ssh pve 'pct list'
+
+# Start/stop VM
+ssh pve 'qm start VMID'
+ssh pve 'qm stop VMID'
+ssh pve 'qm shutdown VMID'  # Graceful
+
+# Start/stop container
+ssh pve 'pct start CTID'
+ssh pve 'pct stop CTID'
+ssh pve 'pct shutdown CTID'  # Graceful
+
+# VM console
+ssh pve 'qm terminal VMID'
+
+# Container console
+ssh pve 'pct enter CTID'
+
+# Clone VM
+ssh pve 'qm clone VMID NEW_VMID --name newvm'
+
+# Delete VM
+ssh pve 'qm destroy VMID'
+
+# Delete container
+ssh pve 'pct destroy CTID'
+```
+
+---
+
+## Related Documentation
+
+- [STORAGE.md](STORAGE.md) - Storage pool assignments
+- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
+- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
+- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
+
+---
+
+**Last Updated**: 2025-12-22