Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Hutson
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions

579
VMS.md Normal file
View File

@@ -0,0 +1,579 @@
# VMs and Containers
Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
## Overview
| Server | VMs | LXCs | Total |
|--------|-----|------|-------|
| **PVE** (10.10.10.120) | 6 | 3 | 9 |
| **PVE2** (10.10.10.102) | 2 | 0 | 2 |
| **Total** | **8** | **3** | **11** |
---
## PVE (10.10.10.120) - Primary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
### LXC Containers
| CTID | Name | IP | RAM | Storage | Purpose |
|------|------|-----|-----|---------|---------|
| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
---
## PVE2 (10.10.10.102) - Secondary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
### LXC Containers
None on PVE2.
---
## VM Details
### 100 - TrueNAS (Storage Server)
**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
**Specs**:
- **OS**: TrueNAS SCALE
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
- **Network**:
- Primary: 10 Gb (vmbr2)
- Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
**Hardware Passthrough**:
- LSI SAS2308 HBA (for EMC enclosure drives)
- Samsung NVMe (for ZFS caching)
**ZFS Pools**:
- `vault`: Main storage pool on EMC drives
- Boot pool on passed-through NVMe
**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
---
### 101 - Saltbox (Media Automation)
**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 16 GB
- **Storage**: nvme-mirror1
- **Network**: 10 Gb (vmbr2)
**GPU Passthrough**:
- NVIDIA TITAN RTX (for Plex hardware transcoding)
**Services**:
- Plex Media Server (plex.htsn.io)
- Sonarr, Radarr, Lidarr (TV/movie/music automation)
- SABnzbd, NZBGet (downloaders)
- Overseerr (request management)
- Tautulli (Plex stats)
- Organizr (dashboard)
- Authelia (SSO authentication)
- Traefik (reverse proxy - separate from CT 202)
**Managed By**: Saltbox Ansible playbooks
**See**: [SALTBOX.md](#) (coming soon)
---
### 105 - fs-dev (Development Environment)
**Purpose**: General development work, testing, prototyping
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 10
- **RAM**: 8 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
---
### 110 - Home Assistant (Home Automation)
**Purpose**: Smart home automation platform
**Specs**:
- **OS**: Home Assistant OS
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**:
- Web UI: https://homeassistant.htsn.io
- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
**Special Notes**:
- ❌ No QEMU agent (Home Assistant OS doesn't support it)
- No SSH server by default (access via web terminal)
---
### 111 - lmdev1 (AI/LLM Development)
**Purpose**: AI model development, fine-tuning, inference
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
### 201 - Copyparty (File Sharing)
**Purpose**: Simple HTTP file sharing server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**: https://copyparty.htsn.io
---
### 206 - docker-host (Docker Services)
**Purpose**: General-purpose Docker host for miscellaneous services
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
- **CPU**: `host` passthrough (for x86-64-v3 support)
**Services Running**:
- Excalidraw (excalidraw.htsn.io) - Whiteboard
- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
- Pulse (pulse.htsn.io) - Monitoring dashboard
**Docker Compose Files**: `/opt/*/docker-compose.yml`
---
### 300 - gitea-vm (Git Server)
**Purpose**: Self-hosted Git server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**Access**: https://git.htsn.io
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
---
### 301 - trading-vm (AI Trading Platform)
**Purpose**: Algorithmic trading system with AI models
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 32 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
**Software**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
---
## LXC Container Details
### 200 - Pi-hole (DNS & Ad Blocking)
**Purpose**: Network-wide DNS server and ad blocker
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.10
**Storage**: rpool
**Access**:
- Web UI: http://10.10.10.10/admin
- Public URL: https://pihole.htsn.io
**Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1)
- DHCP: Disabled (router handles DHCP)
- Interface: All interfaces
**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
---
### 202 - Traefik (Reverse Proxy)
**Purpose**: Primary reverse proxy for all public-facing services
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.250
**Storage**: rpool
**Configuration**: `/etc/traefik/`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
---
### 205 - FindShyt (Custom App)
**Purpose**: Custom application (details TBD)
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.8
**Storage**: rpool
**Access**: https://findshyt.htsn.io
---
## VM Startup Order & Dependencies
### Power-On Sequence
When servers boot (after power failure or restart), VMs/CTs start in this order:
#### PVE (10.10.10.120)
| Order | Wait | VMID | Name | Reason |
|-------|------|------|------|--------|
| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
**Configure startup order** (already set):
```bash
# View current config
ssh pve 'qm config 100 | grep -E "startup|onboot"'
# Set startup order (example)
ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
```
#### PVE2 (10.10.10.102)
| Order | Wait | VMID | Name |
|-------|------|------|------|
| **1** | 10s | 300, 301 | All VMs |
**Less critical** - no dependencies between PVE2 VMs.
---
## Resource Allocation Summary
### Total Allocated (PVE)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
| **RAM** | 98 GB | 128 GB | 77% |
**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
### Total Allocated (PVE2)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 18 | 64 | 28% |
| **RAM** | 36 GB | 128 GB | 28% |
**PVE2** has significant headroom for additional VMs.
---
## Adding a New VM
### Quick Template
```bash
# Create VM
ssh pve 'qm create VMID \
--name myvm \
--memory 4096 \
--cores 2 \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-pci \
--scsi0 nvme-mirror1:32 \
--boot order=scsi0 \
--ostype l26 \
--agent enabled=1'
# Attach ISO for installation
ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
# Start VM
ssh pve 'qm start VMID'
# Access console
ssh pve 'qm vncproxy VMID' # Then connect with VNC client
# Or via Proxmox web UI
```
### Cloud-Init Template (Faster)
Use cloud-init for automated VM deployment:
```bash
# Download cloud image
ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
# Create VM
ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
# Import disk
ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
# Attach disk
ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
# Add cloud-init drive
ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
# Set boot disk
ssh pve 'qm set VMID --boot order=scsi0'
# Configure cloud-init (user, SSH key, network)
ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
# Enable QEMU agent
ssh pve 'qm set VMID --agent enabled=1'
# Resize disk (cloud images are small by default)
ssh pve 'qm resize VMID scsi0 +30G'
# Start VM
ssh pve 'qm start VMID'
```
**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
---
## Adding a New LXC Container
```bash
# Download template (if not already downloaded)
ssh pve 'pveam update'
ssh pve 'pveam available | grep ubuntu'
ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
# Create container
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname mycontainer \
--memory 2048 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 \
--unprivileged 1 \
--features nesting=1 \
--start 1'
# Set root password
ssh pve 'pct exec CTID -- passwd'
# Add SSH key
ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
```
---
## GPU Passthrough Configuration
### Current GPU Assignments
| GPU | Location | Passed To | VMID | Purpose |
|-----|----------|-----------|------|---------|
| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
### How to Pass GPU to VM
1. **Identify GPU PCI ID**:
```bash
ssh pve 'lspci | grep -i nvidia'
# Example output:
# 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
# 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
```
2. **Pass GPU to VM** (include both VGA and Audio):
```bash
ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
# If multi-function device (GPU + Audio), use:
ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
```
3. **Configure VM for GPU**:
```bash
# Set machine type to q35
ssh pve 'qm set VMID --machine q35'
# Set BIOS to OVMF (UEFI)
ssh pve 'qm set VMID --bios ovmf'
# Add EFI disk
ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
```
4. **Reboot VM** and install NVIDIA drivers inside the VM
**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
---
## Backup Priority
See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
### Critical VMs (Must Backup)
| Priority | VMID | Name | Reason |
|----------|------|------|--------|
| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
### Medium Priority
| VMID | Name | Notes |
|------|------|-------|
| 200 | pihole | Easy to rebuild, but DNS config valuable |
| 202 | traefik | Config files backed up separately |
### Low Priority (Ephemeral/Rebuildable)
| VMID | Name | Notes |
|------|------|-------|
| 105 | fs-dev | Development - code is in Git |
| 111 | lmdev1 | Ephemeral development |
| 201 | copyparty | Simple app, easy to redeploy |
| 206 | docker-host | Docker Compose files backed up separately |
---
## Quick Reference Commands
```bash
# List all VMs
ssh pve 'qm list'
ssh pve2 'qm list'
# List all containers
ssh pve 'pct list'
# Start/stop VM
ssh pve 'qm start VMID'
ssh pve 'qm stop VMID'
ssh pve 'qm shutdown VMID' # Graceful
# Start/stop container
ssh pve 'pct start CTID'
ssh pve 'pct stop CTID'
ssh pve 'pct shutdown CTID' # Graceful
# VM console
ssh pve 'qm terminal VMID'
# Container console
ssh pve 'pct enter CTID'
# Clone VM
ssh pve 'qm clone VMID NEW_VMID --name newvm'
# Delete VM
ssh pve 'qm destroy VMID'
# Delete container
ssh pve 'pct destroy CTID'
```
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - Storage pool assignments
- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
---
**Last Updated**: 2025-12-22