Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Hutson
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions

358
BACKUP-STRATEGY.md Normal file
View File

@@ -0,0 +1,358 @@
# Backup Strategy
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
## Executive Summary
### What We Have ✅
- **Syncthing**: File synchronization across 5+ devices
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
- **Proxmox**: Built-in backup capabilities (not yet configured)
### What We DON'T Have 🚨
- ❌ No documented VM/CT backups
- ❌ No ZFS snapshot schedule
- ❌ No offsite backups
- ❌ No disaster recovery plan
- ❌ No tested restore procedures
- ❌ No configuration backups
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
---
## Current State Analysis
### Syncthing (File Synchronization)
**What it is**: Real-time file sync across devices
**What it is NOT**: A backup solution
| Folder | Devices | Size | Protected? |
|--------|---------|------|------------|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
**Limitations**:
- ❌ Accidental deletion → deleted everywhere
- ❌ Ransomware/corruption → spreads everywhere
- ❌ No point-in-time recovery
- ❌ No version history (unless file versioning enabled - not documented)
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
### ZFS on TrueNAS (Potential Backup Target)
**Current Status**: ❓ Unknown - snapshots may or may not be configured
**Needs Investigation**:
```bash
# Check if snapshots exist
ssh truenas 'zfs list -t snapshot'
# Check if automated snapshots are configured
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
# Check snapshot schedule via TrueNAS API/UI
```
**If configured**, ZFS snapshots provide:
- ✅ Point-in-time recovery
- ✅ Protection against accidental deletion
- ✅ Fast rollback capability
- ⚠️ Still single location (no offsite protection)
### Proxmox VM/CT Backups
**Current Status**: ❓ Unknown - no backup jobs documented
**Needs Investigation**:
```bash
# Check backup configuration
ssh pve 'pvesh get /cluster/backup'
# Check if any backups exist
ssh pve 'ls -lh /var/lib/vz/dump/'
ssh pve2 'ls -lh /var/lib/vz/dump/'
```
**Critical VMs Needing Backup**:
| VM/CT | VMID | Priority | Notes |
|-------|------|----------|-------|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
| homeassistant | 110 | 🟡 HIGH | Home automation config |
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
---
## Recommended Backup Strategy
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
**ZFS Snapshots on TrueNAS**
Schedule automatic snapshots for all datasets:
| Dataset | Frequency | Retention |
|---------|-----------|-----------|
| vault/documents | Every 15 min | 1 hour |
| vault/documents | Hourly | 24 hours |
| vault/documents | Daily | 30 days |
| vault/documents | Weekly | 12 weeks |
| vault/documents | Monthly | 12 months |
**Implementation**:
```bash
# Via TrueNAS UI: Storage → Snapshots → Add
# Or via CLI:
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
```
**Proxmox VM Backups**
Configure weekly backups to local storage:
```bash
# Create backup job via Proxmox UI:
# Datacenter → Backup → Add
# - Schedule: Weekly (Sunday 2 AM)
# - Storage: local-zfs or nvme-mirror1
# - Mode: Snapshot (fast)
# - Retention: 4 backups
```
**Or via CLI**:
```bash
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
```
### Tier 2: Offsite Backups (CRITICAL GAP)
**Option A: Cloud Storage (Recommended)**
Use **rclone** or **restic** to sync critical data to cloud:
| Provider | Cost | Pros | Cons |
|----------|------|------|------|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
**Implementation Example (Backblaze B2)**:
```bash
# Install on TrueNAS
ssh truenas 'pkg install rclone restic'
# Configure B2
rclone config # Follow prompts for B2
# Daily backup critical folders
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
```
**Option B: Offsite TrueNAS Replication**
- Set up second TrueNAS at friend/family member's house
- Use ZFS replication to sync snapshots
- Requires: Static IP or Tailscale, trust
**Option C: USB Drive Rotation**
- Weekly backup to external USB drive
- Rotate 2-3 drives (one always offsite)
- Manual but simple
### Tier 3: Configuration Backups
**Proxmox Configuration**
```bash
# Backup /etc/pve (configs are already in cluster filesystem)
# But also backup to external location:
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
# Copy to safe location
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
```
**VM-Specific Configs**
- Traefik configs: `/etc/traefik/` on CT 202
- Saltbox configs: `/srv/git/saltbox/` on VM 101
- Home Assistant: `/config/` on VM 110
**Script to backup all configs**:
```bash
#!/bin/bash
# Save as ~/bin/backup-homelab-configs.sh
DATE=$(date +%Y%m%d)
BACKUP_DIR=~/Backups/homelab-configs/$DATE
mkdir -p $BACKUP_DIR
# Proxmox configs
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
# Traefik
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
# Saltbox
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
# Home Assistant
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
echo "Configs backed up to $BACKUP_DIR"
```
---
## Disaster Recovery Scenarios
### Scenario 1: Single VM Failure
**Impact**: Medium
**Recovery Time**: 30-60 minutes
1. Restore from Proxmox backup:
```bash
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
```
2. Start VM and verify
3. Update IP if needed
### Scenario 2: TrueNAS Failure
**Impact**: CATASTROPHIC (all storage lost)
**Recovery Time**: Unknown - NO PLAN
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backup of critical datasets
- Documented ZFS pool creation steps
- Share configuration export
### Scenario 3: Complete PVE Server Failure
**Impact**: SEVERE
**Recovery Time**: 4-8 hours
**Current State**: ⚠️ PARTIALLY RECOVERABLE
**Needed**:
- VM backups stored on TrueNAS or PVE2
- Proxmox reinstall procedure
- Network config documentation
### Scenario 4: Complete Site Disaster (Fire/Flood)
**Impact**: TOTAL LOSS
**Recovery Time**: Unknown
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backups (cloud or physical)
- Critical data prioritization
- Restore procedures
---
## Action Plan
### Immediate (Next 7 Days)
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
```bash
ssh truenas 'zfs list -t snapshot'
ssh pve 'ls -lh /var/lib/vz/dump/'
```
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
### Short-term (Next 30 Days)
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
- [ ] **Install backup tools**: rclone or restic on TrueNAS
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
### Long-term (Next 90 Days)
- [ ] **Implement monitoring**: Alerts for backup failures
- [ ] **Quarterly restore test**: Verify backups actually work
- [ ] **Backup rotation policy**: Automate old backup cleanup
- [ ] **Configuration backup automation**: Weekly cron job
---
## Monitoring & Validation
### Backup Health Checks
```bash
# Check last ZFS snapshot
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
# Check Proxmox backup status
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
# Check cloud sync status (if using rclone)
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
```
### Alerts to Set Up
- Email alert if no snapshot created in 24 hours
- Email alert if Proxmox backup fails
- Email alert if cloud sync fails
- Weekly backup status report
---
## Cost Estimate
**Monthly Backup Costs**:
| Component | Cost | Notes |
|-----------|------|-------|
| Local storage (already owned) | $0 | Using existing TrueNAS |
| Proxmox backups (local) | $0 | Using existing storage |
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
**One-time**:
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
- [VMS.md](VMS.md) - VM inventory and prioritization
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED

1273
CLAUDE.md

File diff suppressed because it is too large Load Diff

455
HARDWARE.md Normal file
View File

@@ -0,0 +1,455 @@
# Hardware Inventory
Complete hardware specifications for all homelab equipment.
## Servers
### PVE (10.10.10.120) - Primary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Cores**: 32 cores / 64 threads
- **Base Clock**: 3.5 GHz
- **Boost Clock**: 4.2 GHz
- **TDP**: 280W
- **Architecture**: Zen 2 (7nm)
- **Socket**: sTRX4
- **Features**: ECC support, PCIe 4.0
#### RAM
- **Capacity**: 128 GB
- **Type**: DDR4 ECC Registered
- **Speed**: Unknown (needs investigation)
- **Channels**: 8-channel (quad-channel per socket)
- **Idle Power**: ~30-40W
#### Storage
**OS/VM Storage:**
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
**Total Storage**: ~9 TB usable
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
**Total GPU Power**: 75W + 280W = 355W (under load)
#### Network Cards
| Interface | Model | Speed | Purpose | Bridge |
|-----------|-------|-------|---------|--------|
| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
#### Storage Controllers
| Model | Interface | Purpose |
|-------|-----------|---------|
| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
#### Motherboard
- **Model**: Unknown - needs investigation
- **Chipset**: AMD TRX40
- **Form Factor**: ATX/EATX
- **PCIe Slots**: Multiple PCIe 4.0 slots
- **Features**: IOMMU support, ECC memory
#### Power Supply
- **Model**: Unknown
- **Wattage**: Likely 1000W+ (needs investigation)
- **Type**: ATX, 80+ certification unknown
#### Cooling
- **CPU Cooler**: Unknown - likely large tower or AIO
- **Case Fans**: Unknown quantity
- **Note**: CPU temps 70-80°C under load (healthy)
---
### PVE2 (10.10.10.102) - Secondary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Specs**: Same as PVE (32C/64T, 280W TDP)
#### RAM
- **Capacity**: 128 GB DDR4 ECC
- **Same specs as PVE**
#### Storage
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
#### Network Cards
| Interface | Model | Speed | Purpose |
|-----------|-------|-------|---------|
| nic1 | Unknown (onboard) | 1 Gb | Management |
**Note**: MTU set to 9000 for jumbo frames
#### Motherboard
- **Model**: Unknown
- **Chipset**: AMD TRX40
- **Similar to PVE**
---
## Network Equipment
### UniFi Dream Machine Pro (UCG-Fiber)
- **Model**: UniFi Cloud Gateway Fiber
- **IP**: 10.10.10.1
- **Ports**: Multiple 1Gb + SFP+ uplink
- **Features**: Router, firewall, VPN, IDS/IPS
- **MTU**: 9216 (supports jumbo frames)
- **Tailscale**: Installed for VPN failover
### Switches
**Details needed** - investigate current switch setup:
- 10Gb switch for high-speed connections?
- 1Gb switch for general devices?
- PoE capabilities?
```bash
# Check what's connected to 10Gb interfaces
ssh pve 'ip link show enp35s0f0'
ssh pve 'ip link show enp35s0f1'
```
---
## Storage Hardware
### EMC Storage Enclosure
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
- **Model**: EMC KTN-STL4 (or similar)
- **Form Factor**: 4U rackmount
- **Drive Bays**: 25x 3.5" SAS/SATA
- **Controllers**: Dual LCC (Link Control Cards)
- **Connection**: SAS via LSI SAS2308 HBA
- **Passed to**: TrueNAS VM (VMID 100)
**Current Status**:
- LCC A: Active (working)
- LCC B: Failed (replacement ordered)
**Drive Inventory**: Unknown - needs audit
```bash
# Get drive list from TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'lsblk'
```
### NVMe Drives
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
### SSDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
### HDDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
---
## UPS
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Input** | NEMA 5-15P (rewired from 5-20P) |
| **Outlets** | 2x 5-20R + 6x 5-15R |
| **Output** | PFC Sinewave |
| **Runtime** | ~15-20 min @ 33% load |
| **Interface** | USB (connected to PVE) |
**See [UPS.md](UPS.md) for configuration details**
---
## Client Devices
### Mac Mini (Hutson's Workstation)
- **Model**: Unknown generation
- **CPU**: Unknown
- **RAM**: Unknown
- **Storage**: Unknown
- **Network**: 1Gb Ethernet (en0) - MTU 9000
- **Tailscale IP**: 100.108.89.58
- **Local IP**: 10.10.10.125 (static)
- **Purpose**: Primary workstation, Happy Coder daemon host
### MacBook (Mobile)
- **Model**: Unknown
- **Network**: Wi-Fi + Ethernet adapter
- **Tailscale IP**: Unknown
- **Purpose**: Mobile work, development
### Windows PC
- **Model**: Unknown
- **CPU**: Unknown
- **Network**: 1Gb Ethernet
- **IP**: 10.10.10.150
- **Purpose**: Gaming, Windows development, Syncthing node
### Phone (Android)
- **Model**: Unknown
- **IP**: 10.10.10.54 (when on Wi-Fi)
- **Purpose**: Syncthing mobile node, Happy Coder client
---
## Rack Layout (If Applicable)
**Needs documentation** - Current rack configuration unknown
Suggested format:
```
U42: Blank panel
U41: UPS (CyberPower 2U)
U40: UPS (CyberPower 2U)
U39: Switch (10Gb)
U38-U35: EMC Storage Enclosure (4U)
U34: PVE Server
U33: PVE2 Server
...
```
---
## Power Consumption
### Measured Power Draw
| Component | Idle | Typical | Peak | Notes |
|-----------|------|---------|------|-------|
| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
| Network Gear | ~50W | ~50W | ~50W | Router + switches |
| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
**UPS Capacity**: 1320W
**Typical Load**: 33-50% (safe margin)
**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
### Power Optimizations Applied
**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
- KSMD disabled: ~60-80W saved
- CPU governors: ~60-120W saved
- Syncthing rescans: ~60-80W saved
- HDD spindown: ~10-16W saved when idle
- **Total savings**: ~150-300W
---
## Thermal Management
### CPU Cooling
**PVE & PVE2**:
- CPU cooler: Unknown model
- Thermal paste: Unknown, likely needs refresh if temps >85°C
- Target temp: 70-80°C under load
- Max safe: 90°C Tctl (Threadripper PRO spec)
### GPU Cooling
All GPUs are passively managed (stock coolers):
- TITAN RTX: 2-3W idle, 280W load
- RTX A6000: 11W idle, 300W load
- Quadro P2000: 25W constant (Plex active)
### Case Airflow
**Unknown** - needs investigation:
- Case model?
- Fan configuration?
- Positive or negative pressure?
---
## Cable Management
### Network Cables
| Connection | Type | Length | Speed |
|------------|------|--------|-------|
| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
| PVE2 → Router | Cat6 | Unknown | 1Gb |
| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
### Power Cables
**Critical**: All servers on UPS battery-backed outlets
---
## Maintenance Schedule
### Annual Maintenance
- [ ] Clean dust from servers (every 6-12 months)
- [ ] Check thermal paste on CPUs (every 2-3 years)
- [ ] Test UPS battery runtime (annually)
- [ ] Verify all fans operational
- [ ] Check for bulging capacitors on PSUs
### Drive Health
```bash
# Check SMART status on all drives
ssh pve 'smartctl -a /dev/nvme0'
ssh pve2 'smartctl -a /dev/sda'
ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
```
### Temperature Monitoring
```bash
# Check all temps (needs lm-sensors installed)
ssh pve 'sensors'
ssh pve2 'sensors'
```
---
## Warranty & Purchase Info
**Needs documentation**:
- When were servers purchased?
- Where were components bought?
- Any warranties still active?
- Replacement part sources?
---
## Upgrade Path
### Short-term Upgrades (< 6 months)
- [ ] 20A circuit for UPS (restore original 5-20P plug)
- [ ] Document missing hardware specs
- [ ] Label all cables
- [ ] Create rack diagram
### Medium-term Upgrades (6-12 months)
- [ ] Additional 10Gb NIC for PVE2?
- [ ] More NVMe storage?
- [ ] Upgrade network switches?
- [ ] Replace EMC enclosure with newer model?
### Long-term Upgrades (1-2 years)
- [ ] CPU upgrade to newer Threadripper?
- [ ] RAM expansion to 256GB?
- [ ] Additional GPU for AI workloads?
- [ ] Migrate to PCIe 5.0 storage?
---
## Investigation Needed
High-priority items to document:
- [ ] Get exact motherboard model (both servers)
- [ ] Get PSU model and wattage
- [ ] CPU cooler models
- [ ] Network switch models and configuration
- [ ] Complete drive inventory in EMC enclosure
- [ ] RAM speed and timings
- [ ] Case models
- [ ] Exact NVMe models for all drives
**Commands to gather info**:
```bash
# Motherboard
ssh pve 'dmidecode -t baseboard'
# CPU details
ssh pve 'lscpu'
# RAM details
ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
# Storage devices
ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
# Network cards
ssh pve 'lspci | grep -i network'
# GPU details
ssh pve 'lspci | grep -i vga'
ssh pve 'nvidia-smi -L' # If nvidia-smi available
```
---
## Related Documentation
- [VMS.md](VMS.md) - VM resource allocation
- [STORAGE.md](STORAGE.md) - Storage pools and usage
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
- [UPS.md](UPS.md) - UPS configuration
- [NETWORK.md](NETWORK.md) - Network configuration
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many specs need investigation

View File

@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
- **Philips Hue** - Lights
- **Sonos** - Speakers
- **Motion Sensors** - Various locations
- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
### NUT / UPS Integration
Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
**Connection:**
- Host: 10.10.10.120
- Port: 3493
- Username: upsmon
- Password: upsmon123
**Entities:**
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery percentage |
| `sensor.cyberpower_load` | Current load % |
| `sensor.cyberpower_input_voltage` | Input voltage |
| `sensor.cyberpower_output_voltage` | Output voltage |
| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
**Dashboard Card Example:**
```yaml
type: entities
title: UPS Status
entities:
- entity: sensor.cyberpower_status
name: Status
- entity: sensor.cyberpower_battery_charge
name: Battery
- entity: sensor.cyberpower_load
name: Load
- entity: sensor.cyberpower_input_voltage
name: Input Voltage
```
## Automations

618
MAINTENANCE.md Normal file
View File

@@ -0,0 +1,618 @@
# Maintenance Procedures and Schedules
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
## Overview
| Frequency | Tasks | Estimated Time |
|-----------|-------|----------------|
| **Daily** | Quick health check | 2-5 min |
| **Weekly** | Service status, logs review | 15-30 min |
| **Monthly** | Updates, backups verification | 1-2 hours |
| **Quarterly** | Full system audit, testing | 2-4 hours |
| **Annual** | Hardware maintenance, planning | 4-8 hours |
---
## Daily Maintenance (Automated)
### Quick Health Check Script
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily via cron**:
```bash
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
## Weekly Maintenance
### Service Status Review
**Check all critical services**:
```bash
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'
# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
```
### Log Review
**Check for errors in critical logs**:
```bash
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'
# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
```
### Syncthing Sync Status
**Check for sync errors**:
```bash
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
echo "=== $folder ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
```
**See**: [SYNCTHING.md](SYNCTHING.md)
---
## Monthly Maintenance
### System Updates
#### Proxmox Updates
**Check for updates**:
```bash
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
```
**Apply updates**:
```bash
# PVE
ssh pve 'apt update && apt dist-upgrade -y'
# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'
# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
```
**⚠️ Important**:
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart
#### Container Updates (LXC)
```bash
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
```
#### VM Updates
**Update VMs individually via SSH**:
```bash
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'
# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
```
### ZFS Scrubs
**Schedule**: Run monthly on all pools
**PVE**:
```bash
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'
# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
```
**PVE2**:
```bash
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
```
**TrueNAS**:
```bash
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
```
**Automate scrubs**:
```bash
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
```
**See**: [STORAGE.md](STORAGE.md) for pool details
### SMART Tests
**Run extended SMART tests monthly**:
```bash
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
```
**Automate SMART tests**:
```bash
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
```
### Certificate Renewal Verification
**Check SSL certificate expiry**:
```bash
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
**Certificates should auto-renew 30 days before expiry via Traefik**
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
### Backup Verification
**⚠️ TODO**: No backup strategy currently in place
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
---
## Quarterly Maintenance
### Full System Audit
**Check all systems comprehensively**:
1. **ZFS Pool Health**:
```bash
ssh pve 'zpool status -v'
ssh pve2 'zpool status -v'
ssh truenas 'zpool status -v vault'
```
Look for: errors, degraded vdevs, resilver operations
2. **SMART Health**:
```bash
# Run SMART health check script
~/bin/smart-health-check.sh
```
Look for: reallocated sectors, pending sectors, failures
3. **Disk Space Trends**:
```bash
# Check growth rate
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
ssh truenas 'df -h /mnt/vault'
```
Plan for expansion if >80% full
4. **VM Resource Usage**:
```bash
# Check if VMs need more/less resources
ssh pve 'qm list'
ssh pve 'pvesh get /nodes/pve/status'
```
5. **Network Performance**:
```bash
# Test bandwidth between critical nodes
iperf3 -s # On one host
iperf3 -c 10.10.10.120 # From another
```
6. **Temperature Monitoring**:
```bash
# Check max temps over past quarter
# TODO: Set up Prometheus/Grafana for historical data
ssh pve 'sensors'
ssh pve2 'sensors'
```
### Service Dependency Testing
**Test critical paths**:
1. **Power failure recovery** (if safe to test):
- See [UPS.md](UPS.md) for full procedure
- Verify VM startup order works
- Confirm all services come back online
2. **Failover testing**:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
3. **Backup restoration** (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning
### Documentation Review
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
- [ ] Update any changed procedures in this document
---
## Annual Maintenance
### Hardware Maintenance
**Physical cleaning**:
```bash
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'
# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans
# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
```
**Thermal paste inspection** (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C
**See**: [HARDWARE.md](HARDWARE.md) for component details
### UPS Battery Test
**Runtime test**:
```bash
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'
# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers
# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
```
**See**: [UPS.md](UPS.md) for full UPS details
### Drive Replacement Planning
**Check drive age and health**:
```bash
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
```
**Replace drives if**:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives
**Budget for replacements**:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
### Capacity Planning
**Review growth trends**:
```bash
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'
# Network bandwidth (if monitoring in place)
# Review Grafana dashboards
# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
```
**Plan expansions**:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%
### License and Subscription Review
**Proxmox subscription** (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features
**Service subscriptions**:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)
---
## Update Schedules
### Proxmox
| Component | Frequency | Notes |
|-----------|-----------|-------|
| Security patches | Weekly | Via `apt upgrade` |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |
**Update procedure**:
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
2. Backup VM configs: `vzdump --dumpdir /tmp`
3. Update: `apt update && apt dist-upgrade`
4. Reboot if kernel changed: `reboot`
5. Verify VMs auto-started: `qm list`
### Containers (LXC)
| Container | Update Frequency | Package Manager |
|-----------|------------------|-----------------|
| Pi-hole (200) | Weekly | `apt` |
| Traefik (202) | Monthly | `apt` |
| FindShyt (205) | As needed | `apt` |
**Update command**:
```bash
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
```
### VMs
| VM | Update Frequency | Notes |
|----|------------------|-------|
| TrueNAS | Monthly | Via web UI or `apt` |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | `apt` + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + `apt` |
**Docker image updates**:
```bash
ssh docker-host 'docker-compose pull && docker-compose up -d'
```
### Firmware Updates
| Component | Check Frequency | Update Method |
|-----------|----------------|---------------|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | `nvidia-smi` or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you
---
## Testing Checklists
### Pre-Update Checklist
Before ANY system update:
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
- [ ] Verify backups are current (when backup system in place)
- [ ] Check for critical VMs/services that can't have downtime
- [ ] Review update changelog/release notes
- [ ] Test on non-critical system first (PVE2 or test VM)
- [ ] Plan rollback strategy if update fails
- [ ] Notify users if downtime expected
### Post-Update Checklist
After system update:
- [ ] Verify system booted correctly: `uptime`
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
- [ ] Test critical services:
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
- [ ] NFS/SMB shares: Test mount from VM
- [ ] Syncthing sync: Check all devices connected
- [ ] Review logs for errors: `journalctl -p err -b`
- [ ] Check temperatures: `sensors`
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
### Disaster Recovery Test
**Quarterly test** (when backup system in place):
- [ ] Simulate VM failure: Restore from backup
- [ ] Simulate storage failure: Import pool on different system
- [ ] Simulate network failure: Verify Tailscale failover
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
- [ ] Document recovery time and issues
---
## Log Rotation
**System logs** are automatically rotated by systemd-journald and logrotate.
**Check log sizes**:
```bash
# Journalctl size
ssh pve 'journalctl --disk-usage'
# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
```
**Configure retention**:
```bash
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
```
**Traefik log rotation** (already configured):
```bash
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
```
---
## Monitoring Integration
**TODO**: Set up automated monitoring for these procedures
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)
---
## Related Documentation
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
- [UPS.md](UPS.md) - Power failure procedures
- [STORAGE.md](STORAGE.md) - ZFS pool management
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
- [SERVICES.md](SERVICES.md) - Service inventory
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Manual procedures only - monitoring automation needed

546
MONITORING.md Normal file
View File

@@ -0,0 +1,546 @@
# Monitoring and Alerting
Documentation for system monitoring, health checks, and alerting across the homelab.
## Current Monitoring Status
| Component | Monitored? | Method | Alerts | Notes |
|-----------|------------|--------|--------|-------|
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
| **Network** | ❌ No | - | ❌ No | No uptime monitoring |
| **Services** | ❌ No | - | ❌ No | No health checks |
| **Backups** | ❌ No | - | ❌ No | No verification |
**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts
---
## Existing Monitoring
### UPS Monitoring (NUT)
**Status**: ✅ **Active and working**
**What's monitored**:
- Battery charge percentage
- Runtime remaining (seconds)
- Load percentage
- Input/output voltage
- UPS status (OL/OB/LB)
**Access**:
```bash
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
```
**Home Assistant Integration**:
- Sensors: `sensor.cyberpower_*`
- Can be used for automation/alerts
- Currently: No alerts configured
**See**: [UPS.md](UPS.md)
---
### Syncthing Monitoring
**Status**: ⚠️ **Partial** - API available, no automated monitoring
**What's available**:
- Device connection status
- Folder sync status
- Sync errors
- Bandwidth usage
**Manual Checks**:
```bash
# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
```
**Needs**: Automated monitoring script + alerts
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Temperature Monitoring
**Status**: ⚠️ **Manual only**
**Current Method**:
```bash
# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Thresholds**:
- Healthy: 70-80°C under load
- Warning: >85°C
- Critical: >90°C (throttling)
**Needs**: Automated monitoring + alert if >85°C
---
### Proxmox VM Monitoring
**Status**: ⚠️ **Manual only**
**Current Access**:
- Proxmox Web UI: Node → Summary
- CLI: `ssh pve 'qm list'`
**Metrics Available** (via Proxmox):
- CPU usage per VM
- RAM usage per VM
- Disk I/O
- Network I/O
- VM uptime
**Needs**: API-based monitoring + alerts for VM down
---
## Recommended Monitoring Stack
### Option 1: Prometheus + Grafana (Recommended)
**Why**:
- Industry standard
- Extensive integrations
- Beautiful dashboards
- Flexible alerting
**Architecture**:
```
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
Alertmanager (alerts)
```
**Required Exporters**:
| Exporter | Monitors | Install On |
|----------|----------|------------|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
| nut_exporter | UPS metrics | PVE |
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
| cadvisor | Docker containers | Saltbox, docker-host |
**Deployment**:
```bash
# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
--net0 virtio,bridge=vmbr0'
# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml
```
**Estimated Setup Time**: 4-6 hours
---
### Option 2: Uptime Kuma (Simpler Alternative)
**Why**:
- Lightweight
- Easy to set up
- Web-based dashboard
- Built-in alerts (email, Slack, etc.)
**What it monitors**:
- HTTP/HTTPS endpoints
- Ping (ICMP)
- Ports (TCP)
- Docker containers
**Deployment**:
```bash
ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
ports:
- "3001:3001"
volumes:
- ./data:/app/data
restart: unless-stopped
EOF
# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io
```
**Estimated Setup Time**: 1-2 hours
---
### Option 3: Netdata (Real-time Monitoring)
**Why**:
- Real-time metrics (1-second granularity)
- Auto-discovers services
- Low overhead
- Beautiful web UI
**Deployment**:
```bash
# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)
```
**Parent-Child Setup** (optional):
- Configure PVE as parent
- Stream metrics from PVE2 → PVE
- Single dashboard for both servers
**Estimated Setup Time**: 1 hour
---
## Critical Metrics to Monitor
### Server Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **CPU usage** | >90% for 5 min | Alert |
| **CPU temp** | >85°C | Alert |
| **CPU temp** | >90°C | Critical alert |
| **RAM usage** | >95% | Alert |
| **Disk space** | >80% | Warning |
| **Disk space** | >90% | Alert |
| **Load average** | >CPU count | Alert |
### Storage Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **ZFS pool errors** | >0 | Alert immediately |
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
| **ZFS scrub failed** | Last scrub error | Alert |
| **SMART reallocated sectors** | >0 | Warning |
| **SMART pending sectors** | >0 | Alert |
| **SMART failure** | Pre-fail | Critical - replace drive |
### UPS
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Battery charge** | <20% | Warning |
| **Battery charge** | <10% | Alert |
| **On battery** | >5 min | Alert |
| **Runtime** | <5 min | Critical |
### Network
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Device unreachable** | >2 min down | Alert |
| **High packet loss** | >5% | Warning |
| **Bandwidth saturation** | >90% | Warning |
### VMs/Services
| Metric | Threshold | Action |
|--------|-----------|--------|
| **VM stopped** | Critical VM down | Alert immediately |
| **Service unreachable** | HTTP 5xx or timeout | Alert |
| **Backup failed** | Any backup failure | Alert |
| **Certificate expiry** | <30 days | Warning |
| **Certificate expiry** | <7 days | Alert |
---
## Alert Destinations
### Email Alerts
**Recommended**: Set up SMTP relay for email alerts
**Options**:
1. Gmail SMTP (free, rate-limited)
2. SendGrid (free tier: 100 emails/day)
3. Mailgun (free tier available)
4. Self-hosted mail server (complex)
**Configuration Example** (Prometheus Alertmanager):
```yaml
# /etc/alertmanager/alertmanager.yml
receivers:
- name: 'email'
email_configs:
- to: 'hutson@example.com'
from: 'alerts@htsn.io'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alerts@htsn.io'
auth_password: 'app-password-here'
```
---
### Push Notifications
**Options**:
- **Pushover**: $5 one-time, reliable
- **Pushbullet**: Free tier available
- **Telegram Bot**: Free
- **Discord Webhook**: Free
- **Slack**: Free tier available
**Recommended**: Pushover or Telegram for mobile alerts
---
### Home Assistant Alerts
Since Home Assistant is already running, use it for alerts:
**Automation Example**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_charge
below: 20
action:
- service: notify.mobile_app
data:
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
- alias: "Server High Temperature"
trigger:
- platform: template
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
action:
- service: notify.mobile_app
data:
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
```
**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
---
## Monitoring Scripts
### Daily Health Check
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily**:
```cron
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
### ZFS Scrub Checker
```bash
#!/bin/bash
# Check last ZFS scrub status
echo "=== ZFS Scrub Status ==="
for host in pve pve2; do
echo "--- $host ---"
ssh $host 'zpool status | grep -A1 scrub'
echo ""
done
echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'
```
---
### SMART Health Checker
```bash
#!/bin/bash
# Check SMART health on all drives
echo "=== SMART Health Check ==="
echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'
echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'
```
---
## Dashboard Recommendations
### Grafana Dashboard Layout
**Page 1: Overview**
- Server uptime
- CPU usage (all servers)
- RAM usage (all servers)
- Disk space (all pools)
- Network traffic
- UPS status
**Page 2: Storage**
- ZFS pool health
- SMART status for all drives
- I/O latency
- Scrub progress
- Disk temperatures
**Page 3: VMs**
- VM status (up/down)
- VM resource usage
- VM disk I/O
- VM network traffic
**Page 4: Services**
- Service health checks
- HTTP response times
- Certificate expiry dates
- Syncthing sync status
---
## Implementation Plan
### Phase 1: Basic Monitoring (Week 1)
- [ ] Install Uptime Kuma or Netdata
- [ ] Add HTTP checks for all services
- [ ] Configure UPS alerts in Home Assistant
- [ ] Set up daily health check email
**Estimated Time**: 4-6 hours
---
### Phase 2: Advanced Monitoring (Week 2-3)
- [ ] Install Prometheus + Grafana
- [ ] Deploy node_exporter on all servers
- [ ] Deploy zfs_exporter
- [ ] Deploy smartmon_exporter
- [ ] Create Grafana dashboards
**Estimated Time**: 8-12 hours
---
### Phase 3: Alerting (Week 4)
- [ ] Configure Alertmanager
- [ ] Set up email/push notifications
- [ ] Create alert rules for all critical metrics
- [ ] Test all alert paths
- [ ] Document alert procedures
**Estimated Time**: 4-6 hours
---
## Related Documentation
- [UPS.md](UPS.md) - UPS monitoring details
- [STORAGE.md](STORAGE.md) - ZFS health checks
- [SERVICES.md](SERVICES.md) - Service inventory
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed**

509
POWER-MANAGEMENT.md Normal file
View File

@@ -0,0 +1,509 @@
# Power Management and Optimization
Documentation of power optimizations applied to reduce idle power consumption and heat generation.
## Overview
Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
---
## Power Draw Estimates
### PVE (10.10.10.120)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
| NVIDIA Quadro P2000 | 25W | 70W | 75W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
| HBAs, fans, misc | 20-30W | 20-30W | - |
| **Total** | **250-350W** | **800-940W** | - |
### PVE2 (10.10.10.102)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA RTX A6000 | 11W | 280W | 300W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
| Fans, misc | 15-20W | 15-20W | - |
| **Total** | **226-330W** | **765-890W** | - |
### Combined
| Metric | Idle | Load |
|--------|------|------|
| Servers | 476-680W | 1565-1830W |
| Network gear | ~50W | ~50W |
| **Total** | **~530-730W** | **~1615-1880W** |
| **UPS Load** | 40-55% | 120-140% ⚠️ |
**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
---
## Optimizations Applied
### 1. KSMD Disabled (2024-12-17)
**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
**Problem**:
- KSMD was consuming 44-57% CPU continuously on PVE
- Caused CPU temp to rise from 74°C to 83°C
- **Negative profit**: More power spent scanning than saved from deduplication
**Solution**: Disabled KSM permanently
**Configuration**:
**Systemd service**: `/etc/systemd/system/disable-ksm.service`
```ini
[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable and start**:
```bash
systemctl daemon-reload
systemctl enable --now disable-ksm
systemctl mask ksmtuned # Prevent re-enabling
```
**Verify**:
```bash
# KSM should be disabled (run=0)
cat /sys/kernel/mm/ksm/run # Should output: 0
# ksmd should show 0% CPU
ps aux | grep ksmd
```
**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
```bash
cat /sys/kernel/mm/ksm/run
# If 1, disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned
```
---
### 2. CPU Governor Optimization (2024-12-16)
Default CPU governor keeps cores at max frequency even when idle, wasting power.
#### PVE: `amd-pstate-epp` Driver
**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
**Governor**: `powersave`
**EPP**: `balance_power`
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
# Check governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: powersave
# Check EPP
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
# Output: balance_power
# Check current frequency (should be low when idle)
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
```
#### PVE2: `acpi-cpufreq` Driver
**Driver**: `acpi-cpufreq` (older ACPI driver)
**Governor**: `schedutil` (adaptive, better than powersave for this driver)
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to schedutil
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: schedutil
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle
```
**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
**Performance impact**: Minimal - CPU still boosts to max frequency under load
---
### 3. GPU Power States (2024-12-16)
GPUs automatically enter low-power states when idle. Verified optimal.
| GPU | Location | Idle Power | P-State | Notes |
|-----|----------|------------|---------|-------|
| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
**Check GPU power state**:
```bash
# Via nvidia-smi (if installed in VM)
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
# Expected output:
# name, power.draw [W], pstate
# NVIDIA TITAN RTX, 2.50 W, P8
# Via lspci (from Proxmox host - shows link speed, not power)
ssh pve 'lspci | grep -i nvidia'
```
**P-States**:
- **P0**: Maximum performance
- **P8**: Minimum power (idle)
**No action needed** - GPUs automatically manage power states.
**Savings**: N/A (already optimal)
---
### 4. Syncthing Rescan Intervals (2024-12-16)
Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
**Changed**:
- Large folders: 60s → **3600s** (1 hour)
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
**Configuration**: Via Syncthing UI on each device
- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
- Still acceptable for most use cases
- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
---
### 5. ksmtuned Disabled (2024-12-16)
**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
**Solution**: Stopped and disabled on both servers
```bash
systemctl stop ksmtuned
systemctl disable ksmtuned
systemctl mask ksmtuned # Prevent re-enabling
```
**Savings**: ~2-5W
---
### 6. HDD Spindown on PVE2 (2024-12-16)
**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
**Solution**: Configure 30-minute spindown timeout
**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
```udev
# Spin down WD Red 6TB drives after 30 minutes idle
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
```
**hdparm value**: 241 = 30 minutes
- Formula: `value * 5 seconds = timeout`
- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
# Verify drives have spindown set
hdparm -I /dev/sda | grep -i standby
hdparm -I /dev/sdb | grep -i standby
```
**Check if drives are spun down**:
```bash
hdparm -C /dev/sda
# Output: drive state is: standby (spun down)
# or: drive state is: active/idle (spinning)
```
**Savings**: ~10-16W when spun down (8W per drive)
**Trade-off**: 5-10 second delay when accessing pool after spindown
---
## Potential Optimizations (Not Yet Applied)
### PCIe ASPM (Active State Power Management)
**Benefit**: Reduce power of idle PCIe devices
**Risk**: May cause stability issues with some devices
**Estimated savings**: 5-15W
**Test**:
```bash
# Check current ASPM state
lspci -vv | grep -i aspm
# Enable ASPM (test first)
# Add to kernel cmdline: pcie_aspm=force
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
# Update grub
update-grub
reboot
```
### NMI Watchdog Disable
**Benefit**: Reduce CPU wakeups
**Risk**: Harder to debug kernel hangs
**Estimated savings**: 1-3W
**Test**:
```bash
# Disable NMI watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
# Make permanent (add to kernel cmdline)
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
update-grub
reboot
```
---
## Monitoring
### CPU Frequency
```bash
# Current frequency on all cores
ssh pve 'grep MHz /proc/cpuinfo | head -10'
# Governor
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
# Available governors
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
```
### CPU Temperature
```bash
# PVE
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
# PVE2
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy temps**: 70-80°C under load
**Warning**: >85°C
**Throttle**: 90°C (Tctl max for Threadripper PRO)
### GPU Power Draw
```bash
# If nvidia-smi installed in VM
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
# Sample output:
# name, power.draw [W], power.limit [W], pstate
# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
```
### Power Consumption (UPS)
```bash
# Check UPS load percentage
ssh pve 'upsc cyberpower@localhost ups.load'
# Battery runtime (seconds)
ssh pve 'upsc cyberpower@localhost battery.runtime'
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
```
See [UPS.md](UPS.md) for more UPS monitoring details.
### ZFS ARC Memory Usage
```bash
# PVE
ssh pve 'arc_summary | grep -A5 "ARC size"'
# TrueNAS
ssh truenas 'arc_summary | grep -A5 "ARC size"'
```
**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
```bash
# Limit ARC to 32 GB (example)
# Edit /etc/modprobe.d/zfs.conf:
options zfs zfs_arc_max=34359738368
# Apply (reboot required)
update-initramfs -u
reboot
```
---
## Troubleshooting
### CPU Not Downclocking
```bash
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: powersave (PVE) or schedutil (PVE2)
# If not, systemd service may have failed
# Check service status
systemctl status cpu-powersave
# Manually set governor (temporary)
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Check frequency
grep MHz /proc/cpuinfo | head -5
```
### High Idle Power After Update
**Common causes**:
1. **KSM re-enabled** after Proxmox update
- Check: `cat /sys/kernel/mm/ksm/run`
- Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
2. **CPU governor reset** to default
- Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
- Fix: `systemctl restart cpu-powersave`
3. **GPU stuck in high-performance mode**
- Check: `nvidia-smi --query-gpu=pstate --format=csv`
- Fix: Restart VM or power cycle GPU
### HDDs Won't Spin Down
```bash
# Check spindown setting
hdparm -I /dev/sda | grep -i standby
# Set spindown manually (temporary)
hdparm -S 241 /dev/sda
# Check if drive is idle (ZFS may keep it active)
zpool iostat -v 1 5 # Watch for activity
# Check what's accessing the drive
lsof | grep /mnt/pool
```
---
## Power Optimization Summary
| Optimization | Savings | Applied | Notes |
|--------------|---------|---------|-------|
| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
| PCIe ASPM | 5-15W | ❌ | Not yet tested |
| NMI watchdog | 1-3W | ❌ | Not yet tested |
| **Total savings** | **~150-300W** | - | Significant reduction |
---
## Related Documentation
- [UPS.md](UPS.md) - UPS capacity and power monitoring
- [STORAGE.md](STORAGE.md) - HDD spindown configuration
- [VMS.md](VMS.md) - VM resource allocation
---
**Last Updated**: 2025-12-22

148
README.md Normal file
View File

@@ -0,0 +1,148 @@
# Homelab Documentation
Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
## 🚀 Quick Start
**New to this homelab?** Start here:
1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
4. [SERVICES.md](SERVICES.md) - What services are running
**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
## 📚 Documentation Index
### Infrastructure
| Document | Description |
|----------|-------------|
| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
### Services & Applications
| Document | Description |
|----------|-------------|
| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
### Access & Security
| Document | Description |
|----------|-------------|
| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
### Operations
| Document | Description |
|----------|-------------|
| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
### Reference
| Document | Description |
|----------|-------------|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
## 🖥️ System Overview
### Servers
- **PVE** (10.10.10.120) - Primary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA Quadro P2000 + TITAN RTX
- **PVE2** (10.10.10.102) - Secondary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA RTX A6000
### Key Services
| Service | Location | URL |
|---------|----------|-----|
| **Proxmox** | PVE | https://pve.htsn.io |
| **TrueNAS** | VM 100 | https://truenas.htsn.io |
| **Plex** | Saltbox VM | https://plex.htsn.io |
| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
| **Gitea** | VM 300 | https://git.htsn.io |
| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
| **Traefik** | CT 202 | http://10.10.10.250:8080 |
[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
## 🔥 Emergency Procedures
### Power Failure
1. UPS provides ~15 min runtime at typical load
2. At 2 min remaining, NUT triggers graceful VM shutdown
3. When power returns, servers auto-boot and start VMs in order
See [UPS.md](UPS.md) for details.
### Service Down
```bash
# Quick health check (run from Mac Mini)
ssh pve 'qm list' # Check VMs on PVE
ssh pve2 'qm list' # Check VMs on PVE2
ssh pve 'pct list' # Check containers
# Syncthing status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections"
# Restart a VM
ssh pve 'qm stop VMID && qm start VMID'
```
See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
## 📞 Getting Help
**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
**Key Contacts**:
- Homelab Owner: Hutson
- Git Repo: https://git.htsn.io/hutson/homelab-docs
- Local Path: `~/Projects/homelab`
## 🔄 Recent Changes
See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
## 📝 Contributing
When updating docs:
1. Keep CLAUDE.md as quick reference only
2. Move detailed content to specialized docs
3. Update cross-references
4. Test all commands before committing
5. Add entries to changelog
```bash
cd ~/Projects/homelab
git add -A
git commit -m "Update documentation: <description>"
git push
```
---
**Last Updated**: 2025-12-22

591
SERVICES.md Normal file
View File

@@ -0,0 +1,591 @@
# Services Inventory
Complete inventory of all services running across the homelab infrastructure.
## Overview
| Category | Services | Location | Access |
|----------|----------|----------|--------|
| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
| **Development** | Gitea, Docker services | VMs | Web UI |
| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
**Total Services**: 25+ running services
---
## Service URLs Quick Reference
| Service | URL | Authentication | Purpose |
|---------|-----|----------------|---------|
| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
---
## Infrastructure Services
### Proxmox VE (PVE & PVE2)
**Purpose**: Virtualization platform, VM/CT host
**Location**: Physical servers (10.10.10.120, 10.10.10.102)
**Access**: https://pve.htsn.io:8006, SSH
**Version**: Unknown (check: `pveversion`)
**Key Features**:
- Web-based management
- VM and LXC container support
- ZFS storage pools
- Clustering (2-node)
- API access
**Common Operations**:
```bash
# List VMs
ssh pve 'qm list'
# Create VM
ssh pve 'qm create VMID --name myvm ...'
# Backup VM
ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
```
**See**: [VMS.md](VMS.md)
---
### TrueNAS SCALE (VM 100)
**Purpose**: Central file storage, NFS/SMB shares
**Location**: VM on PVE (10.10.10.200)
**Access**: https://truenas.htsn.io, SSH
**Version**: TrueNAS SCALE (check version in UI)
**Key Features**:
- ZFS storage management
- NFS exports
- SMB shares
- Syncthing hub
- Snapshot management
**Storage Pools**:
- `vault`: Main data pool on EMC enclosure
**Shares** (needs documentation):
- NFS exports for Saltbox media
- SMB shares for Windows access
- Syncthing sync folders
**See**: [STORAGE.md](STORAGE.md)
---
### Pi-hole (CT 200)
**Purpose**: Network-wide DNS server and ad blocker
**Location**: LXC on PVE (10.10.10.10)
**Access**: http://10.10.10.10/admin
**Version**: Unknown
**Configuration**:
- **Upstream DNS**: Cloudflare (1.1.1.1)
- **Blocklists**: Unknown count
- **Queries**: All network DNS traffic
- **DHCP**: Disabled (router handles DHCP)
**Stats** (example):
```bash
ssh pihole 'pihole -c -e' # Stats
ssh pihole 'pihole status' # Status
```
**Common Tasks**:
- Update blocklists: `ssh pihole 'pihole -g'`
- Whitelist domain: `ssh pihole 'pihole -w example.com'`
- View logs: `ssh pihole 'pihole -t'`
---
### Traefik (CT 202)
**Purpose**: Reverse proxy for all public-facing services
**Location**: LXC on PVE (10.10.10.250)
**Access**: http://10.10.10.250:8080/dashboard/
**Version**: Unknown (check: `traefik version`)
**Managed Services**:
- All *.htsn.io domains (except Saltbox services)
- SSL/TLS certificates via Let's Encrypt
- HTTP → HTTPS redirects
**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
---
## Media Services (Saltbox VM)
All media services run in Docker on the Saltbox VM (10.10.10.100).
### Plex Media Server
**Purpose**: Media streaming platform
**URL**: https://plex.htsn.io
**Access**: Plex account
**Features**:
- Hardware transcoding (TITAN RTX)
- Libraries: Movies, TV, Music
- Remote access enabled
- Managed by Saltbox
**Media Storage**:
- Source: TrueNAS NFS mounts
- Location: `/mnt/unionfs/`
**Common Tasks**:
```bash
# View Plex status
ssh saltbox 'docker logs -f plex'
# Restart Plex
ssh saltbox 'docker restart plex'
# Scan library
# (via Plex UI: Settings → Library → Scan)
```
---
### *arr Apps (Media Automation)
Running on Saltbox VM, managed via Traefik-Saltbox.
| Service | Purpose | URL | Notes |
|---------|---------|-----|-------|
| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
**Downloaders**:
| Service | Purpose | URL |
|---------|---------|-----|
| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
| **NZBGet** | Usenet downloader | nzbget.htsn.io |
| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
**Indexers**:
| Service | Purpose | URL |
|---------|---------|-----|
| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
---
### Supporting Media Services
| Service | Purpose | URL |
|---------|---------|-----|
| **Tautulli** | Plex statistics | tautulli.htsn.io |
| **Organizr** | Service dashboard | organizr.htsn.io |
| **Authelia** | SSO authentication | auth.htsn.io |
---
## Development Services
### Gitea (VM 300)
**Purpose**: Self-hosted Git server
**Location**: VM on PVE2 (10.10.10.220)
**URL**: https://git.htsn.io
**Access**: Username/password
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
**Common Tasks**:
```bash
# SSH to Gitea VM
ssh gitea-vm
# View logs
ssh gitea-vm 'journalctl -u gitea -f'
# Backup
ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
```
**See**: Gitea documentation for API usage
---
### Docker Services (docker-host VM)
Running on VM 206 (10.10.10.206).
| Service | URL | Purpose | Port |
|---------|-----|---------|------|
| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
**Managing services**:
```bash
ssh docker-host 'docker ps'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
```
---
## Home Automation
### Home Assistant (VM 110)
**Purpose**: Smart home automation platform
**Location**: VM on PVE (10.10.10.110)
**URL**: https://homeassistant.htsn.io
**Access**: Username/password
**Integrations**:
- UPS monitoring (NUT sensors)
- Unknown other integrations (needs documentation)
**Sensors**:
- `sensor.cyberpower_battery_charge`
- `sensor.cyberpower_load`
- `sensor.cyberpower_battery_runtime`
- `sensor.cyberpower_status`
**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
---
### Happy Coder Relay (docker-host)
**Purpose**: Self-hosted relay server for Happy Coder mobile app
**Location**: docker-host (10.10.10.206)
**URL**: https://happy.htsn.io
**Access**: QR code authentication
**Stack**:
- Happy Server (Node.js)
- PostgreSQL (user/session data)
- Redis (real-time events)
- MinIO (file/image storage)
**Clients**:
- Mac Mini (Happy daemon)
- Mobile app (iOS/Android)
**Credentials**:
- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
- PostgreSQL: `happy` / `happypass`
- MinIO: `happyadmin` / `happyadmin123`
---
## File Sync & Storage
### Syncthing
**Purpose**: File synchronization across all devices
**Devices**:
- Mac Mini (10.10.10.125) - Hub
- MacBook - Mobile sync
- TrueNAS (10.10.10.200) - Central storage
- Windows PC (10.10.10.150) - Windows sync
- Phone (10.10.10.54) - Mobile sync
**API Keys**:
- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
**Synced Folders**:
- documents (~11 GB)
- downloads (~38 GB)
- pictures
- notes
- desktop (~7.2 GB)
- config
- movies
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Copyparty (VM 201)
**Purpose**: Simple HTTP file sharing
**Location**: VM on PVE (10.10.10.201)
**URL**: https://copyparty.htsn.io
**Access**: Unknown
**Features**:
- Web-based file upload/download
- Lightweight
---
## Trading & AI Services
### AI Trading Platform (trading-vm)
**Purpose**: Algorithmic trading with AI models
**Location**: VM 301 on PVE2 (10.10.10.221)
**URL**: https://aitrade.htsn.io (if accessible)
**GPU**: RTX A6000 (48GB VRAM)
**Components**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
**Access**: SSH only (no web UI documented)
---
### LM Dev (lmdev1)
**Purpose**: AI/LLM development environment
**Location**: VM 111 on PVE (10.10.10.111)
**URL**: https://lmdev.htsn.io (if accessible)
**GPU**: TITAN RTX (shared with Saltbox)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
## Monitoring & Utilities
### UPS Monitoring (NUT)
**Purpose**: Monitor UPS status and trigger shutdowns
**Location**: PVE (master), PVE2 (slave)
**Access**: Command-line (`upsc`)
**Key Commands**:
```bash
ssh pve 'upsc cyberpower@localhost'
ssh pve 'upsc cyberpower@localhost ups.load'
ssh pve 'upsc cyberpower@localhost battery.runtime'
```
**Home Assistant Integration**: UPS sensors exposed
**See**: [UPS.md](UPS.md)
---
### Pulse Monitoring
**Purpose**: Unknown monitoring dashboard
**Location**: docker-host (10.10.10.206:7655)
**URL**: https://pulse.htsn.io
**Access**: Unknown
**Needs documentation**:
- What does it monitor?
- How to configure?
- Authentication?
---
### Tailscale VPN
**Purpose**: Secure remote access to homelab
**Subnet Routers**:
- PVE (100.113.177.80) - Primary
- UCG-Fiber (100.94.246.32) - Failover
**Devices on Tailscale**:
- Mac Mini: 100.108.89.58
- PVE: 100.113.177.80
- TrueNAS: 100.100.94.71
- Pi-hole: 100.112.59.128
**See**: [NETWORK.md](NETWORK.md)
---
## Custom Applications
### FindShyt (CT 205)
**Purpose**: Unknown custom application
**Location**: LXC on PVE (10.10.10.8)
**URL**: https://findshyt.htsn.io
**Access**: Unknown
**Needs documentation**:
- What is this app?
- How to use it?
- Tech stack?
---
## Service Dependencies
### Critical Dependencies
```
TrueNAS
├── Plex (media files via NFS)
├── *arr apps (downloads via NFS)
├── Syncthing (central storage hub)
└── Backups (if configured)
Traefik (CT 202)
├── All *.htsn.io services
└── SSL certificate management
Pi-hole
└── DNS for entire network
Router
└── Gateway for all services
```
### Startup Order
**See [VMS.md](VMS.md)** for VM boot order configuration:
1. TrueNAS (storage first)
2. Saltbox (depends on TrueNAS NFS)
3. Other VMs
4. Containers
---
## Service Port Reference
### Well-Known Ports
| Port | Service | Protocol | Purpose |
|------|---------|----------|---------|
| 22 | SSH | TCP | Remote access |
| 53 | Pi-hole | UDP | DNS queries |
| 80 | Traefik | TCP | HTTP (redirects to 443) |
| 443 | Traefik | TCP | HTTPS |
| 3000 | Gitea | TCP | Git HTTP/S |
| 8006 | Proxmox | TCP | Web UI |
| 8096 | Plex | TCP | Plex Media Server |
| 8384 | Syncthing | TCP | Web UI |
| 22000 | Syncthing | TCP | Sync protocol |
### Internal Ports
| Port | Service | Purpose |
|------|---------|---------|
| 3002 | Happy Server | Relay backend |
| 5432 | PostgreSQL | Happy Server DB |
| 6379 | Redis | Happy Server cache |
| 7655 | Pulse | Monitoring |
| 8080 | Excalidraw | Whiteboard |
| 8080 | Traefik | Dashboard |
| 9000 | MinIO | Object storage |
---
## Service Health Checks
### Quick Health Check Script
```bash
#!/bin/bash
# Check all critical services
echo "=== Infrastructure ==="
curl -Is https://pve.htsn.io:8006 | head -1
curl -Is https://truenas.htsn.io | head -1
curl -I http://10.10.10.10/admin 2>/dev/null | head -1
echo ""
echo "=== Media Services ==="
curl -Is https://plex.htsn.io | head -1
curl -Is https://sonarr.htsn.io | head -1
curl -Is https://radarr.htsn.io | head -1
echo ""
echo "=== Development ==="
curl -Is https://git.htsn.io | head -1
curl -Is https://excalidraw.htsn.io | head -1
echo ""
echo "=== Home Automation ==="
curl -Is https://homeassistant.htsn.io | head -1
curl -Is https://happy.htsn.io/health | head -1
```
### Service-Specific Checks
```bash
# Proxmox VMs
ssh pve 'qm list | grep running'
# Docker services
ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
# Syncthing
curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/status"
# UPS
ssh pve 'upsc cyberpower@localhost ups.status'
```
---
## Service Credentials
**Location**: See individual service documentation
| Service | Credentials Location | Notes |
|---------|---------------------|-------|
| Proxmox | Proxmox UI | Username + 2FA |
| TrueNAS | TrueNAS UI | Root password |
| Plex | Plex account | Managed externally |
| Gitea | Gitea DB | Self-managed |
| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
---
## Related Documentation
- [VMS.md](VMS.md) - VM/service locations
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [NETWORK.md](NETWORK.md) - Network configuration
- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)

464
SSH-ACCESS.md Normal file
View File

@@ -0,0 +1,464 @@
# SSH Access
Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
## Overview
Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
---
## SSH Key Authentication (Primary Method)
### SSH Key Configuration
SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
**Key file**: `~/.ssh/homelab` (Ed25519 key)
**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
### Host Aliases
Use these convenient aliases instead of IP addresses:
| Host Alias | IP | User | Type | Notes |
|------------|-----|------|------|-------|
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services |
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
### Usage Examples
```bash
# List VMs on PVE
ssh pve 'qm list'
# Check ZFS pool on TrueNAS
ssh truenas 'zpool status vault'
# List Docker containers on Saltbox
ssh saltbox 'docker ps'
# Check Pi-hole status
ssh pihole 'pihole status'
# View Traefik config
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
```
### SSH Config File
**Location**: `~/.ssh/config`
**Example entries**:
```sshconfig
# Proxmox Servers
Host pve
HostName 10.10.10.120
User root
IdentityFile ~/.ssh/homelab
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
# Post-quantum KEX causes MTU issues - use classic
KexAlgorithms curve25519-sha256
# VMs
Host truenas
HostName 10.10.10.200
User root
IdentityFile ~/.ssh/homelab
Host saltbox
HostName 10.10.10.100
User hutson
IdentityFile ~/.ssh/homelab
Host lmdev1
HostName 10.10.10.111
User hutson
IdentityFile ~/.ssh/homelab
Host docker-host
HostName 10.10.10.206
User hutson
IdentityFile ~/.ssh/homelab
Host fs-dev
HostName 10.10.10.5
User hutson
IdentityFile ~/.ssh/homelab
Host copyparty
HostName 10.10.10.201
User hutson
IdentityFile ~/.ssh/homelab
Host gitea-vm
HostName 10.10.10.220
User hutson
IdentityFile ~/.ssh/homelab
Host trading-vm
HostName 10.10.10.221
User hutson
IdentityFile ~/.ssh/homelab
# LXC Containers
Host pihole
HostName 10.10.10.10
User root
IdentityFile ~/.ssh/homelab
Host traefik
HostName 10.10.10.250
User root
IdentityFile ~/.ssh/homelab
Host findshyt
HostName 10.10.10.8
User root
IdentityFile ~/.ssh/homelab
```
---
## Password Authentication (Special Cases)
Some systems don't support SSH key auth or have other limitations.
### UniFi Router (10.10.10.1)
**Issue**: Uses `keyboard-interactive` auth method, incompatible with `sshpass`
**Solution**: Use `expect` to automate password entry
**Commands**:
```bash
# Run command on router
expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
# Get ARP table (all device IPs)
expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
# Check Tailscale status
expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof'
```
**Why not key auth?**: UniFi router firmware doesn't persist SSH keys across reboots.
### Windows PC (10.10.10.150)
**OS**: Windows with OpenSSH server
**User**: `claude`
**Password**: `GrilledCh33s3#`
**Shell**: PowerShell (not bash)
**Commands**:
```bash
# Run PowerShell command
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
# Check Syncthing status
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
# Restart Syncthing
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
```
**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
---
## QEMU Guest Agent
Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
### VMs with QEMU Agent
| VMID | VM Name | Use Case |
|------|---------|----------|
| 100 | truenas | Execute commands, check ZFS |
| 101 | saltbox | Execute commands, Docker mgmt |
| 105 | fs-dev | Execute commands |
| 111 | lmdev1 | Execute commands |
| 201 | copyparty | Execute commands |
| 206 | docker-host | Execute commands |
| 300 | gitea-vm | Execute commands |
| 301 | trading-vm | Execute commands |
### VM WITHOUT QEMU Agent
**VMID 110 (homeassistant)**: No QEMU agent installed
- Access via web UI only
- Or install SSH server manually if needed
### Usage Examples
**Basic syntax**:
```bash
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
```
**Examples**:
```bash
# Check ZFS pool on TrueNAS (without SSH)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
# Check Docker containers on Saltbox
ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
# Run multi-line command
ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
```
**When to use QEMU agent vs SSH**:
- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
---
## Troubleshooting SSH Issues
### Connection Refused
```bash
# Check if SSH service is running
ssh pve 'systemctl status sshd'
# Check if port 22 is open
nc -zv 10.10.10.XXX 22
# Check firewall
ssh pve 'iptables -L -n | grep 22'
```
### Permission Denied (Public Key)
```bash
# Verify key file exists
ls -la ~/.ssh/homelab
# Check key permissions (should be 600)
chmod 600 ~/.ssh/homelab
# Test SSH key auth verbosely
ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
# Check authorized_keys on remote (via QEMU agent if SSH broken)
ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
```
### Slow SSH Connection (PVE2 Issue)
**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
**Fix**: Use classic KEX algorithm instead
**In `~/.ssh/config`**:
```sshconfig
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
KexAlgorithms curve25519-sha256 # Avoid mlkem768x25519-sha256
```
**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
---
## Adding SSH Keys to New Systems
### Linux (VMs/LXCs)
```bash
# Copy public key to new host
ssh-copy-id -i ~/.ssh/homelab user@hostname
# Or manually:
ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
```
### LXC Containers (Root User)
```bash
# Via pct exec from Proxmox host
ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
# Also enable PermitRootLogin in sshd_config
ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
```
### VMs (via QEMU Agent)
```bash
# Add key via QEMU agent (if SSH not working)
ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
```
---
## SSH Key Management
### Rotate SSH Keys (Future)
When rotating SSH keys:
1. Generate new key pair:
```bash
ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
```
2. Deploy new key to all hosts (keep old key for now):
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
ssh-copy-id -i ~/.ssh/homelab-new $host
done
```
3. Update `~/.ssh/config` to use new key:
```sshconfig
IdentityFile ~/.ssh/homelab-new
```
4. Test all connections:
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
echo "Testing $host..."
ssh $host 'hostname'
done
```
5. Remove old key from all hosts once confirmed working
---
## Quick Reference
### Common SSH Operations
```bash
# Execute command on remote host
ssh host 'command'
# Execute multiple commands
ssh host 'command1 && command2'
# Copy file to remote
scp file host:/path/
# Copy file from remote
scp host:/path/file ./
# Execute command on Proxmox VM (via QEMU agent)
ssh pve 'qm guest exec VMID -- bash -c "command"'
# Execute command on LXC
ssh pve 'pct exec CTID -- command'
# Interactive shell
ssh host
# SSH with X11 forwarding
ssh -X host
```
### Troubleshooting Commands
```bash
# Test SSH with verbose output
ssh -vvv host
# Check SSH service status (remote)
ssh host 'systemctl status sshd'
# Check SSH config (local)
ssh -G host
# Test port connectivity
nc -zv hostname 22
```
---
## Security Best Practices
### Current Security Posture
✅ **Good**:
- SSH keys used instead of passwords (where possible)
- Keys use Ed25519 (modern, secure algorithm)
- Root login disabled on VMs (use sudo instead)
- SSH keys have proper permissions (600)
⚠️ **Could Improve**:
- [ ] Disable password authentication on all hosts (force key-only)
- [ ] Use SSH certificate authority instead of individual keys
- [ ] Set up SSH bastion host (jump server)
- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
- [ ] Implement SSH key rotation policy (annually)
### Hardening SSH (Future)
For additional security, consider:
```sshconfig
# /etc/ssh/sshd_config (on remote hosts)
PermitRootLogin prohibit-password # No root password login
PasswordAuthentication no # Disable password auth entirely
PubkeyAuthentication yes # Only allow key auth
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3 # Limit auth attempts
MaxSessions 10 # Limit concurrent sessions
ClientAliveInterval 300 # Timeout idle sessions
ClientAliveCountMax 2 # Drop after 2 keepalives
```
**Apply after editing**:
```bash
systemctl restart sshd
```
---
## Related Documentation
- [VMS.md](VMS.md) - Complete VM/CT inventory
- [NETWORK.md](NETWORK.md) - Network configuration
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
- [SECURITY.md](#) - Security policies (coming soon)
---
**Last Updated**: 2025-12-22

510
STORAGE.md Normal file
View File

@@ -0,0 +1,510 @@
# Storage Architecture
Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
## Overview
### Storage Distribution
| Location | Type | Capacity | Purpose |
|----------|------|----------|---------|
| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
---
## PVE (10.10.10.120) Storage Pools
### nvme-mirror1 (Primary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Sabrent Rocket Q NVMe
- **Capacity**: 3.6 TB usable
- **Purpose**: High-performance VM storage
- **Used By**:
- Critical VMs requiring fast IO
- Database workloads
- Development environments
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror1'
ssh pve 'zpool list nvme-mirror1'
```
### nvme-mirror2 (Secondary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Kingston SFYRD 2TB NVMe
- **Capacity**: 1.8 TB usable
- **Purpose**: Additional fast VM storage
- **Used By**: TBD
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror2'
ssh pve 'zpool list nvme-mirror2'
```
### rpool (Root Pool)
- **Type**: ZFS mirror
- **Devices**: 2x Samsung 870 QVO 4TB SSD
- **Capacity**: 3.6 TB usable
- **Purpose**: Proxmox OS, container storage, VM backups
- **Used By**:
- Proxmox root filesystem
- LXC containers
- Local VM backups
**Check status**:
```bash
ssh pve 'zpool status rpool'
ssh pve 'df -h /var/lib/vz'
```
### Storage Pool Usage Summary (PVE)
**Get current usage**:
```bash
ssh pve 'zpool list'
ssh pve 'pvesm status'
```
---
## PVE2 (10.10.10.102) Storage Pools
### nvme-mirror3 (Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x NVMe (model unknown)
- **Capacity**: Unknown (needs investigation)
- **Purpose**: High-performance VM storage
- **Used By**: Trading VM (301), other VMs
**Check status**:
```bash
ssh pve2 'zpool status nvme-mirror3'
ssh pve2 'zpool list nvme-mirror3'
```
### local-zfs2 (Bulk Storage)
- **Type**: ZFS mirror
- **Devices**: 2x WD Red 6TB HDD
- **Capacity**: ~6 TB usable
- **Purpose**: Bulk/archival storage
- **Power Management**: 30-minute spindown configured
- Saves ~10-16W when idle
- Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
- Command: `hdparm -S 241` (30 min)
**Notes**:
- Pool had only 768 KB used as of 2024-12-16
- Drives configured to spin down after 30 min idle
- Good for archival, NOT for active workloads
**Check status**:
```bash
ssh pve2 'zpool status local-zfs2'
ssh pve2 'zpool list local-zfs2'
# Check if drives are spun down
ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby
```
---
## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
### ZFS Pool: vault
**Primary storage pool** for all shared data.
**Devices**: ❓ Needs investigation
- EMC storage enclosure with multiple drives
- SAS connection via LSI SAS2308 HBA (passed through to VM)
**Capacity**: ❓ Needs investigation
**Check pool status**:
```bash
ssh truenas 'zpool status vault'
ssh truenas 'zpool list vault'
# Get detailed capacity
ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
```
### Datasets (Known)
Based on Syncthing configuration, likely datasets:
| Dataset | Purpose | Synced Devices | Notes |
|---------|---------|----------------|-------|
| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
| vault/desktop | Desktop sync | Unknown | 7.2 GB |
| vault/movies | Movie library | Unknown | Unknown size |
| vault/config | Config files | Mac Mini, MacBook | Unknown size |
**Get complete dataset list**:
```bash
ssh truenas 'zfs list -r vault'
```
### NFS/SMB Shares
**Status**: ❓ Not documented
**Needs investigation**:
```bash
# List NFS exports
ssh truenas 'showmount -e localhost'
# List SMB shares
ssh truenas 'smbclient -L localhost -N'
# Via TrueNAS API/UI
# Sharing → Unix Shares (NFS)
# Sharing → Windows Shares (SMB)
```
**Expected shares**:
- Media libraries for Plex (on Saltbox VM)
- Document storage
- VM backups?
- ISO storage?
### EMC Storage Enclosure
**Model**: EMC KTN-STL4 (or similar)
**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
**Drives**: ❓ Unknown count and capacity
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
- SES commands
- Fan control
- LCC (Link Control Card) troubleshooting
- Maintenance procedures
**Check enclosure status**:
```bash
ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor
ssh truenas 'smartctl --scan' # List all drives
```
---
## Storage Network Architecture
### Internal Storage Network (10.10.10.20.0/24)
**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
**Subnet**: 10.10.10.20.0/24
**DHCP**: No
**Gateway**: No (internal only, no internet)
**Connected VMs**:
- TrueNAS VM (secondary NIC)
- Saltbox VM (secondary NIC) - for NFS mounts
- Other VMs needing storage access
**Configuration**:
```bash
# On TrueNAS VM - check second NIC
ssh truenas 'ip addr show enp6s19'
# On Saltbox - check NFS mounts
ssh saltbox 'mount | grep nfs'
```
**Benefits**:
- Separates storage traffic from general network
- Prevents NFS/SMB from saturating main network
- Better performance for storage-heavy workloads
---
## Storage Capacity Planning
### Current Usage (Estimate)
**Needs actual audit**:
```bash
# PVE pools
ssh pve 'zpool list -o name,size,alloc,free'
# PVE2 pools
ssh pve2 'zpool list -o name,size,alloc,free'
# TrueNAS vault pool
ssh truenas 'zpool list vault'
# Get detailed breakdown
ssh truenas 'zfs list -r vault -o name,used,avail'
```
### Growth Rate
**Needs tracking** - recommend monthly snapshots of capacity:
```bash
#!/bin/bash
# Save as ~/bin/storage-capacity-report.sh
DATE=$(date +%Y-%m-%d)
REPORT=~/Backups/storage-reports/capacity-$DATE.txt
mkdir -p ~/Backups/storage-reports
echo "Storage Capacity Report - $DATE" > $REPORT
echo "================================" >> $REPORT
echo "" >> $REPORT
echo "PVE Pools:" >> $REPORT
ssh pve 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "PVE2 Pools:" >> $REPORT
ssh pve2 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Pools:" >> $REPORT
ssh truenas 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Datasets:" >> $REPORT
ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
echo "Report saved to $REPORT"
```
**Run monthly via cron**:
```cron
0 9 1 * * ~/bin/storage-capacity-report.sh
```
### Expansion Planning
**When to expand**:
- Pool reaches 80% capacity
- Performance degrades
- New workloads require more space
**Expansion options**:
1. Add drives to existing pools (if mirrors, add mirror vdev)
2. Add new NVMe drives to PVE/PVE2
3. Expand EMC enclosure (add more drives)
4. Add second EMC enclosure
**Cost estimates**: TBD
---
## ZFS Health Monitoring
### Daily Health Checks
```bash
# Check for errors on all pools
ssh pve 'zpool status -x' # Shows only unhealthy pools
ssh pve2 'zpool status -x'
ssh truenas 'zpool status -x'
# Check scrub status
ssh pve 'zpool status | grep scrub'
ssh pve2 'zpool status | grep scrub'
ssh truenas 'zpool status | grep scrub'
```
### Scrub Schedule
**Recommended**: Monthly scrub on all pools
**Configure scrub**:
```bash
# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
# Or via cron:
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub rpool
```
**On TrueNAS**:
- Configure via UI: Storage → Pools → Scrub Tasks
- Recommended: 1st of every month at 2 AM
### SMART Monitoring
**Check drive health**:
```bash
# PVE
ssh pve 'smartctl -a /dev/nvme0'
ssh pve 'smartctl -a /dev/sda'
# TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'smartctl -a /dev/sdX' # For each drive
```
**Configure SMART tests**:
- TrueNAS UI: Tasks → S.M.A.R.T. Tests
- Recommended: Weekly short test, monthly long test
### Alerts
**Set up email alerts for**:
- ZFS pool errors
- SMART test failures
- Pool capacity > 80%
- Scrub failures
---
## Storage Performance Tuning
### ZFS ARC (Cache)
**Check ARC usage**:
```bash
ssh pve 'arc_summary'
ssh truenas 'arc_summary'
```
**Tuning** (if needed):
- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
- TrueNAS: Configure via UI (System → Advanced → Tunables)
### NFS Performance
**Mount options** (on clients like Saltbox):
```
rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
```
**Verify NFS mounts**:
```bash
ssh saltbox 'mount | grep nfs'
```
### Record Size Optimization
**Different workloads need different record sizes**:
- VMs: 64K (default, good for VMs)
- Databases: 8K or 16K
- Media files: 1M (large sequential reads)
**Set record size** (on TrueNAS datasets):
```bash
ssh truenas 'zfs set recordsize=1M vault/movies'
```
---
## Disaster Recovery
### Pool Recovery
**If a pool fails to import**:
```bash
# Try importing with different name
zpool import -f -N poolname newpoolname
# Check pool with readonly
zpool import -f -o readonly=on poolname
# Force import (last resort)
zpool import -f -F poolname
```
### Drive Replacement
**When a drive fails**:
```bash
# Identify failed drive
zpool status poolname
# Replace drive
zpool replace poolname old-device new-device
# Monitor resilver
watch zpool status poolname
```
### Data Recovery
**If pool is completely lost**:
1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
2. Recreate pool structure
3. Restore data
**Critical**: This is why we need offsite backups!
---
## Quick Reference
### Common Commands
```bash
# Pool status
zpool status [poolname]
zpool list
# Dataset usage
zfs list
zfs list -r vault
# Check pool health (only unhealthy)
zpool status -x
# Scrub pool
zpool scrub poolname
# Get pool IO stats
zpool iostat -v 1
# Snapshot management
zfs snapshot poolname/dataset@snapname
zfs list -t snapshot
zfs rollback poolname/dataset@snapname
zfs destroy poolname/dataset@snapname
```
### Storage Locations by Use Case
| Use Case | Recommended Storage | Why |
|----------|---------------------|-----|
| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
| Database | nvme-mirror1/2 | Low latency |
| Media files | TrueNAS vault | Large capacity |
| Development | nvme-mirror2 | Fast, mid-tier |
| Containers | rpool | Good performance |
| Backups | TrueNAS or rpool | Large capacity |
| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
---
## Investigation Needed
- [ ] Get complete TrueNAS dataset list
- [ ] Document NFS/SMB share configuration
- [ ] Inventory EMC enclosure drives (count, capacity, model)
- [ ] Document current pool usage percentages
- [ ] Set up monthly capacity reports
- [ ] Configure ZFS scrub schedules
- [ ] Set up storage health alerts
---
## Related Documentation
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
- [VMS.md](VMS.md) - VM storage assignments
- [NETWORK.md](NETWORK.md) - Storage network configuration
---
**Last Updated**: 2025-12-22

672
TRAEFIK.md Normal file
View File

@@ -0,0 +1,672 @@
# Traefik Reverse Proxy
Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
## Overview
There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
| Instance | Location | IP | Purpose | Managed By |
|----------|----------|-----|---------|------------|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
---
## ⚠️ CRITICAL RULE: Which Traefik to Use
### When Adding ANY New Service:
**USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
**DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
### Why This Matters:
- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
- Each Traefik has its own Let's Encrypt certificates
- Mixing them causes certificate conflicts and routing issues
---
## Traefik-Primary (CT 202) - For New Services
### Configuration
**Location**: Container 202 on PVE (10.10.10.250)
**Config Directory**: `/etc/traefik/`
**Main Config**: `/etc/traefik/traefik.yaml`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
### Access Traefik Config
```bash
# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
# View logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Services Using Traefik-Primary
| Service | Domain | Backend |
|---------|--------|---------|
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
| Gitea | git.htsn.io | 10.10.10.220:3000 |
| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
| LM Dev | lmdev.htsn.io | 10.10.10.111 |
| Pi-hole | pihole.htsn.io | 10.10.10.200 |
| TrueNAS | truenas.htsn.io | 10.10.10.200 |
| Proxmox | pve.htsn.io | 10.10.10.120 |
| Copyparty | copyparty.htsn.io | 10.10.10.201 |
| AI Trade | aitrade.htsn.io | (trading server) |
| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
---
## Traefik-Saltbox (VM 101) - DO NOT MODIFY
### Configuration
**Location**: `/opt/traefik/` inside Saltbox VM
**Managed By**: Saltbox Ansible playbooks (automatic)
**Docker Mount**: `/opt/traefik``/etc/traefik` in container
### Services Using Traefik-Saltbox
- Plex (plex.htsn.io)
- Sonarr, Radarr, Lidarr
- SABnzbd, NZBGet, qBittorrent
- Overseerr, Tautulli, Organizr
- Jackett, NZBHydra2
- Authelia (SSO authentication)
- All other Saltbox-managed containers
### View Saltbox Traefik (Read-Only)
```bash
# View config (don't edit!)
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
# View logs
ssh saltbox 'docker logs -f traefik'
```
**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
---
## Adding a New Public Service - Complete Workflow
Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
### Step 0: Deploy Your Service
First, deploy your service on the appropriate host.
#### Option A: Docker on docker-host (10.10.10.206)
```bash
ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
myservice:
image: myimage:latest
ports:
- "8080:80"
restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d
```
#### Option B: New LXC Container on PVE
```bash
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname myservice --memory 2048 --cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 --unprivileged 1 --start 1'
```
#### Option C: New VM on PVE
```bash
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
```
### Step 1: Create Traefik Config File
Use this template for new services on **Traefik-Primary (CT 202)**:
#### Basic Template
```yaml
# /etc/traefik/conf.d/myservice.yaml
http:
routers:
# HTTPS router
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
priority: 50
# HTTP → HTTPS redirect
myservice-redirect:
entryPoints:
- web
rule: "Host(`myservice.htsn.io`)"
middlewares:
- myservice-https-redirect
service: myservice
priority: 50
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
middlewares:
myservice-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
#### Deploy the Config
```bash
# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'
# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Step 2: Add Cloudflare DNS Entry
#### Cloudflare Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
| Public IP | 70.237.94.174 |
#### Method 1: Manual (Cloudflare Dashboard)
1. Go to https://dash.cloudflare.com/
2. Select `htsn.io` domain
3. DNS → Add Record
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
#### Method 2: Automated (CLI)
Save this as `~/bin/add-cloudflare-dns.sh`:
```bash
#!/bin/bash
# Add DNS record to Cloudflare for htsn.io
SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
PUBLIC_IP="70.237.94.174"
if [ -z "$SUBDOMAIN" ]; then
echo "Usage: $0 <subdomain>"
echo "Example: $0 myservice # Creates myservice.htsn.io"
exit 1
fi
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data "{
\"type\":\"A\",
\"name\":\"$SUBDOMAIN\",
\"content\":\"$PUBLIC_IP\",
\"ttl\":1,
\"proxied\":true
}" | jq .
```
**Usage**:
```bash
chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh myservice # Creates myservice.htsn.io
```
### Step 3: Testing
```bash
# Check if DNS resolves
dig myservice.htsn.io
# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
# Test HTTP redirect
curl -I http://myservice.htsn.io
# Expected: 301 redirect to https://
# Test HTTPS
curl -I https://myservice.htsn.io
# Expected: 200 OK
# Check Traefik dashboard (if enabled)
# http://10.10.10.250:8080/dashboard/
```
### Step 4: Update Documentation
After deploying, update:
1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
3. **CLAUDE.md** - Update quick reference if needed
---
## SSL Certificates
Traefik has **two certificate resolvers** configured:
| Resolver | Use When | Challenge Type | Notes |
|----------|----------|----------------|-------|
| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
### ⚠️ Important: HTTP Challenge vs DNS Challenge
**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
### Certificate Resolver Configuration
**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
```ini
Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
```
### Certificate Storage
| Resolver | Storage File |
|----------|--------------|
| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
**Permissions**: Must be `600` (read/write owner only)
```bash
# Check permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
# Fix if needed
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
```
### Certificate Renewal
- **Automatic** via Traefik
- Checks every 24 hours
- Renews 30 days before expiry
- No manual intervention needed
### Troubleshooting Certificates
#### Certificate Fails to Issue
```bash
# Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
# Verify Cloudflare API access
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
# Check acme.json permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
```
#### Force Certificate Renewal
```bash
# Delete certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'
# Watch logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
---
## Quick Deployment - One-Liner
For fast deployment, use this all-in-one command:
```bash
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
routers:
myservice-secure:
entryPoints: [websecure]
rule: Host(\\\`myservice.htsn.io\\\`)
service: myservice
tls: {certResolver: cloudflare}
services:
myservice:
loadBalancer:
servers:
- url: http://10.10.10.206:8080
EOF"'
# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io
```
---
## Docker Service with Traefik Labels (Alternative)
If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
**Requirements**:
- Traefik must have access to Docker socket
- Service must be on same Docker network as Traefik
**Example docker-compose.yml**:
```yaml
version: "3.8"
services:
myservice:
image: myimage:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
- "traefik.http.routers.myservice.entrypoints=websecure"
- "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
- "traefik.http.services.myservice.loadbalancer.server.port=8080"
networks:
- traefik
networks:
traefik:
external: true
```
**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
---
## Cloudflare API Reference
### API Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
### Common API Operations
Set credentials:
```bash
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
```
**List all DNS records**:
```bash
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" | jq
```
**Add A record**:
```bash
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{
"type":"A",
"name":"subdomain",
"content":"70.237.94.174",
"proxied":true
}'
```
**Delete record**:
```bash
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY"
```
**Update record** (toggle proxy):
```bash
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{"proxied":false}'
```
---
## Troubleshooting
### Service Not Accessible
```bash
# 1. Check if DNS resolves
dig myservice.htsn.io
# 2. Check if backend is reachable
curl -I http://10.10.10.XXX:PORT
# 3. Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
# 4. Check Traefik config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
# 5. Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
### Certificate Issues
```bash
# Check certificate status in acme.json
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
# Check certificate expiry
echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
### 502 Bad Gateway
**Cause**: Backend service is down or unreachable
```bash
# Check if backend is running
ssh backend-host 'systemctl status myservice'
# Check if port is open
nc -zv 10.10.10.XXX PORT
# Check firewall
ssh backend-host 'iptables -L -n | grep PORT'
```
### 404 Not Found
**Cause**: Traefik can't match the request to a router
```bash
# Check router rule matches domain
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
# Should be: rule: "Host(`myservice.htsn.io`)"
# Check DNS is pointing to correct IP
dig myservice.htsn.io
# Restart Traefik to reload config
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
---
## Advanced Configuration Examples
### WebSocket Support
For services that use WebSockets (like Home Assistant):
```yaml
http:
routers:
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
# No special config needed - WebSockets work by default in Traefik v2+
```
### Custom Headers
Add custom headers (e.g., security headers):
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-headers
middlewares:
myservice-headers:
headers:
customResponseHeaders:
X-Frame-Options: "DENY"
X-Content-Type-Options: "nosniff"
Referrer-Policy: "strict-origin-when-cross-origin"
```
### Basic Authentication
Protect a service with basic auth:
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-auth
middlewares:
myservice-auth:
basicAuth:
users:
- "user:$apr1$..." # Generate with: htpasswd -nb user password
```
---
## Maintenance
### Monthly Checks
```bash
# Check Traefik status
ssh pve 'pct exec 202 -- systemctl status traefik'
# Review logs for errors
ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
# Check certificate expiry dates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
# Verify all services responding
for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
echo "Testing $domain..."
curl -sI https://$domain | head -1
done
```
### Backup Traefik Config
```bash
# Backup all configs
ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
# Copy to safe location
scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
```
---
## Related Documentation
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
- [SERVICES.md](#) - Complete service inventory (coming soon)
---
**Last Updated**: 2025-12-22

605
UPS.md Normal file
View File

@@ -0,0 +1,605 @@
# UPS and Power Management
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
## Hardware
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
| **Installed** | 2025-12-21 |
| **Status** | Active |
### ⚠️ Temporary Wiring Modification
**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
**Current draw**: ~1000-1350W total (safe margin)
**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
### Previous UPS
| Model | Capacity | Issue | Replaced |
|-------|----------|-------|----------|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
---
## Power Draw Estimates
### Typical Load
| Component | Idle | Load | Notes |
|-----------|------|------|-------|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
| Network gear | ~50W | ~50W | Router, switches |
| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
**UPS Load**: ~33-50% typical, 70-80% under heavy load
### Runtime Calculation
At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
At **660W load** (50%): ~10-12 min estimated
At **1000W load** (75%): ~6-8 min estimated
**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
---
## NUT (Network UPS Tools) Configuration
### Architecture
```
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
└──> Home Assistant (monitoring only)
```
**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
### NUT Server Configuration (PVE)
#### 1. UPS Driver Config: `/etc/nut/ups.conf`
```ini
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower OR2200PFCRT2U"
override.battery.charge.low = 20
override.battery.runtime.low = 120
```
**Key settings**:
- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
- `port = auto`: Auto-detect USB device
- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
#### 2. NUT Server Config: `/etc/nut/upsd.conf`
```ini
LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493
```
**Listens on**:
- Localhost (for local monitoring)
- LAN IP (for PVE2 to connect)
#### 3. User Config: `/etc/nut/upsd.users`
```ini
[admin]
password = upsadmin123
actions = SET
instcmds = ALL
[upsmon]
password = upsmon123
upsmon master
```
**Users**:
- `admin`: Full control, can run commands
- `upsmon`: Monitoring only (used by PVE2)
#### 4. Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONLINE "UPS %s on line power"
NOTIFYMSG ONBATT "UPS %s on battery"
NOTIFYMSG LOWBATT "UPS %s battery is low"
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK "Communications with UPS %s established"
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM "UPS %s is unavailable"
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
NOTIFYFLAG ONLINE SYSLOG+WALL
NOTIFYFLAG ONBATT SYSLOG+WALL
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG
```
**Key settings**:
- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
- `POLLFREQ 5`: Check UPS every 5 seconds
#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
```udev
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
```
**Purpose**: Ensure NUT can access USB UPS device
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
```
### NUT Client Configuration (PVE2)
#### Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
# Same NOTIFYMSG and NOTIFYFLAG as PVE
```
**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
---
## Custom Shutdown Script
### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
```bash
#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low
LOG="/var/log/ups-shutdown.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}
log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"
# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
log "Stopping VM $VMID..."
qm shutdown $VMID
done
# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
log "Stopping CT $CTID..."
pct shutdown $CTID
done
# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60
# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
log "Stopping TrueNAS (VM 100) last..."
qm shutdown 100
sleep 30
fi
log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="
```
**Make executable**:
```bash
chmod +x /usr/local/bin/ups-shutdown.sh
```
**Script behavior**:
1. Stops all VMs (except TrueNAS)
2. Stops all containers
3. Waits 60 seconds
4. Stops TrueNAS last (storage must be cleanly unmounted)
5. **Does NOT shut down Proxmox hosts** - intentionally left running
**Why not shut down hosts?**
- BIOS configured to "Restore on AC Power Loss"
- When power returns, servers auto-boot and start VMs in order
- Avoids need for manual intervention
---
## Power Failure Behavior
### When Power Fails
1. **UPS switches to battery** (`OB DISCHRG` status)
2. **NUT monitors runtime** - polls every 5 seconds
3. **At 120 seconds (2 min) remaining**:
- NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
- Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
4. **Hosts remain running** until UPS battery depletes
5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
### When Power Returns
1. **UPS charges battery**, power returns to servers
2. **BIOS "Restore on AC Power Loss"** boots both servers
3. **Proxmox starts** and auto-starts VMs in configured order:
| Order | Wait | VMs/CTs | Reason |
|-------|------|---------|--------|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
| 4 | 5s | pihole, traefik, findshyt | Containers |
PVE2 VMs: order=1, wait=10s
**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
---
## UPS Status Codes
| Code | Meaning | Action |
|------|---------|--------|
| `OL` | Online (AC power) | Normal operation |
| `OB` | On Battery | Power outage - monitor runtime |
| `LB` | Low Battery | <2 min remaining - shutdown imminent |
| `CHRG` | Charging | Battery charging after power restored |
| `DISCHRG` | Discharging | On battery, draining |
| `FSD` | Forced Shutdown | NUT triggered shutdown |
---
## Monitoring & Commands
### Check UPS Status
```bash
# Full status
ssh pve 'upsc cyberpower@localhost'
# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
# Example output:
# battery.charge: 100
# battery.runtime: 1234 (seconds remaining)
# ups.load: 33 (% load)
# ups.status: OL (online)
```
### Control UPS Beeper
```bash
# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
```
### Test Shutdown Procedure
**Simulate low battery** (careful - this will shut down VMs!):
```bash
# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'
# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
```
**Better test**: Run shutdown script manually without actually triggering NUT:
```bash
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
---
## Home Assistant Integration
UPS metrics are exposed to Home Assistant via NUT integration.
### Available Sensors
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
| `sensor.cyberpower_load` | Load % (0-100) |
| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
### Configuration
**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
### Example Automations
**Send notification when on battery**:
```yaml
automation:
- alias: "UPS On Battery Alert"
trigger:
- platform: state
entity_id: sensor.cyberpower_status
to: "OB"
action:
- service: notify.mobile_app
data:
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
```
**Alert when battery low**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_runtime
below: 300
action:
- service: notify.mobile_app
data:
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
```
---
## Testing Results
### Full Power Failure Test (2025-12-21)
Complete end-to-end test of power failure and recovery:
| Event | Time | Duration | Notes |
|-------|------|----------|-------|
| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
| **Power restored** | ~22:47 | - | Plugged back in |
| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
| **Total recovery** | - | **~7 min** | From power return to fully operational |
**Results**:
✅ VMs shut down gracefully
✅ Hosts remained running until UPS died (as intended)
✅ Auto-boot on power restoration worked
✅ VMs started in correct order with appropriate delays
✅ No data corruption or issues
**Runtime calculation**:
- Load: ~33% (440W estimated)
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
- Matches manufacturer estimate for 33% load
---
## Proxmox Cluster Quorum Fix
### Problem
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
### Solution
Modified `/etc/pve/corosync.conf` to enable 2-node mode:
```
quorum {
provider: corosync_votequorum
two_node: 1
}
```
**Effect**:
- Either node can operate independently if the other is down
- No more waiting for quorum when one server is offline
- Both nodes visible in single Proxmox interface when both up
**Applied**: 2025-12-21
---
## Maintenance
### Monthly Checks
```bash
# Check UPS status
ssh pve 'upsc cyberpower@localhost'
# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'
# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'
# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'
# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'
```
### Battery Health
**Check battery stats**:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery'
# Key metrics:
# battery.charge: 100 (should be near 100 when on AC)
# battery.runtime: 1200+ (seconds at current load)
# battery.voltage: ~24V (normal for 24V battery system)
```
**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
```
CyberPower batteries typically last 3-5 years.
### Firmware Updates
Check CyberPower website for firmware updates:
https://www.cyberpowersystems.com/support/firmware/
---
## Troubleshooting
### UPS Not Detected
```bash
# Check USB connection
ssh pve 'lsusb | grep Cyber'
# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'
```
### PVE2 Can't Connect
```bash
# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'
# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'
# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'
```
### Shutdown Script Not Running
```bash
# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
# Should be: -rwxr-xr-x (executable)
# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'
# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
### UPS Status Shows UNKNOWN
```bash
# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'
# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb
# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'
```
---
## Future Improvements
- [ ] Add email alerts for UPS events (power fail, low battery)
- [ ] Log runtime statistics to track battery degradation
- [ ] Set up Grafana dashboard for UPS metrics
- [ ] Test battery runtime at different load levels
- [ ] Upgrade to 20A circuit, restore original 5-20P plug
- [ ] Consider adding network management card for out-of-band UPS access
---
## Related Documentation
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
- [VMS.md](VMS.md) - VM startup order configuration
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
---
**Last Updated**: 2025-12-22

579
VMS.md Normal file
View File

@@ -0,0 +1,579 @@
# VMs and Containers
Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
## Overview
| Server | VMs | LXCs | Total |
|--------|-----|------|-------|
| **PVE** (10.10.10.120) | 6 | 3 | 9 |
| **PVE2** (10.10.10.102) | 2 | 0 | 2 |
| **Total** | **8** | **3** | **11** |
---
## PVE (10.10.10.120) - Primary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
### LXC Containers
| CTID | Name | IP | RAM | Storage | Purpose |
|------|------|-----|-----|---------|---------|
| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
---
## PVE2 (10.10.10.102) - Secondary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
### LXC Containers
None on PVE2.
---
## VM Details
### 100 - TrueNAS (Storage Server)
**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
**Specs**:
- **OS**: TrueNAS SCALE
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
- **Network**:
- Primary: 10 Gb (vmbr2)
- Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
**Hardware Passthrough**:
- LSI SAS2308 HBA (for EMC enclosure drives)
- Samsung NVMe (for ZFS caching)
**ZFS Pools**:
- `vault`: Main storage pool on EMC drives
- Boot pool on passed-through NVMe
**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
---
### 101 - Saltbox (Media Automation)
**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 16 GB
- **Storage**: nvme-mirror1
- **Network**: 10 Gb (vmbr2)
**GPU Passthrough**:
- NVIDIA TITAN RTX (for Plex hardware transcoding)
**Services**:
- Plex Media Server (plex.htsn.io)
- Sonarr, Radarr, Lidarr (TV/movie/music automation)
- SABnzbd, NZBGet (downloaders)
- Overseerr (request management)
- Tautulli (Plex stats)
- Organizr (dashboard)
- Authelia (SSO authentication)
- Traefik (reverse proxy - separate from CT 202)
**Managed By**: Saltbox Ansible playbooks
**See**: [SALTBOX.md](#) (coming soon)
---
### 105 - fs-dev (Development Environment)
**Purpose**: General development work, testing, prototyping
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 10
- **RAM**: 8 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
---
### 110 - Home Assistant (Home Automation)
**Purpose**: Smart home automation platform
**Specs**:
- **OS**: Home Assistant OS
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**:
- Web UI: https://homeassistant.htsn.io
- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
**Special Notes**:
- ❌ No QEMU agent (Home Assistant OS doesn't support it)
- No SSH server by default (access via web terminal)
---
### 111 - lmdev1 (AI/LLM Development)
**Purpose**: AI model development, fine-tuning, inference
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
### 201 - Copyparty (File Sharing)
**Purpose**: Simple HTTP file sharing server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**: https://copyparty.htsn.io
---
### 206 - docker-host (Docker Services)
**Purpose**: General-purpose Docker host for miscellaneous services
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
- **CPU**: `host` passthrough (for x86-64-v3 support)
**Services Running**:
- Excalidraw (excalidraw.htsn.io) - Whiteboard
- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
- Pulse (pulse.htsn.io) - Monitoring dashboard
**Docker Compose Files**: `/opt/*/docker-compose.yml`
---
### 300 - gitea-vm (Git Server)
**Purpose**: Self-hosted Git server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**Access**: https://git.htsn.io
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
---
### 301 - trading-vm (AI Trading Platform)
**Purpose**: Algorithmic trading system with AI models
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 32 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
**Software**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
---
## LXC Container Details
### 200 - Pi-hole (DNS & Ad Blocking)
**Purpose**: Network-wide DNS server and ad blocker
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.10
**Storage**: rpool
**Access**:
- Web UI: http://10.10.10.10/admin
- Public URL: https://pihole.htsn.io
**Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1)
- DHCP: Disabled (router handles DHCP)
- Interface: All interfaces
**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
---
### 202 - Traefik (Reverse Proxy)
**Purpose**: Primary reverse proxy for all public-facing services
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.250
**Storage**: rpool
**Configuration**: `/etc/traefik/`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
---
### 205 - FindShyt (Custom App)
**Purpose**: Custom application (details TBD)
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.8
**Storage**: rpool
**Access**: https://findshyt.htsn.io
---
## VM Startup Order & Dependencies
### Power-On Sequence
When servers boot (after power failure or restart), VMs/CTs start in this order:
#### PVE (10.10.10.120)
| Order | Wait | VMID | Name | Reason |
|-------|------|------|------|--------|
| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
**Configure startup order** (already set):
```bash
# View current config
ssh pve 'qm config 100 | grep -E "startup|onboot"'
# Set startup order (example)
ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
```
#### PVE2 (10.10.10.102)
| Order | Wait | VMID | Name |
|-------|------|------|------|
| **1** | 10s | 300, 301 | All VMs |
**Less critical** - no dependencies between PVE2 VMs.
---
## Resource Allocation Summary
### Total Allocated (PVE)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
| **RAM** | 98 GB | 128 GB | 77% |
**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
### Total Allocated (PVE2)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 18 | 64 | 28% |
| **RAM** | 36 GB | 128 GB | 28% |
**PVE2** has significant headroom for additional VMs.
---
## Adding a New VM
### Quick Template
```bash
# Create VM
ssh pve 'qm create VMID \
--name myvm \
--memory 4096 \
--cores 2 \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-pci \
--scsi0 nvme-mirror1:32 \
--boot order=scsi0 \
--ostype l26 \
--agent enabled=1'
# Attach ISO for installation
ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
# Start VM
ssh pve 'qm start VMID'
# Access console
ssh pve 'qm vncproxy VMID' # Then connect with VNC client
# Or via Proxmox web UI
```
### Cloud-Init Template (Faster)
Use cloud-init for automated VM deployment:
```bash
# Download cloud image
ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
# Create VM
ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
# Import disk
ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
# Attach disk
ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
# Add cloud-init drive
ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
# Set boot disk
ssh pve 'qm set VMID --boot order=scsi0'
# Configure cloud-init (user, SSH key, network)
ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
# Enable QEMU agent
ssh pve 'qm set VMID --agent enabled=1'
# Resize disk (cloud images are small by default)
ssh pve 'qm resize VMID scsi0 +30G'
# Start VM
ssh pve 'qm start VMID'
```
**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
---
## Adding a New LXC Container
```bash
# Download template (if not already downloaded)
ssh pve 'pveam update'
ssh pve 'pveam available | grep ubuntu'
ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
# Create container
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname mycontainer \
--memory 2048 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 \
--unprivileged 1 \
--features nesting=1 \
--start 1'
# Set root password
ssh pve 'pct exec CTID -- passwd'
# Add SSH key
ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
```
---
## GPU Passthrough Configuration
### Current GPU Assignments
| GPU | Location | Passed To | VMID | Purpose |
|-----|----------|-----------|------|---------|
| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
### How to Pass GPU to VM
1. **Identify GPU PCI ID**:
```bash
ssh pve 'lspci | grep -i nvidia'
# Example output:
# 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
# 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
```
2. **Pass GPU to VM** (include both VGA and Audio):
```bash
ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
# If multi-function device (GPU + Audio), use:
ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
```
3. **Configure VM for GPU**:
```bash
# Set machine type to q35
ssh pve 'qm set VMID --machine q35'
# Set BIOS to OVMF (UEFI)
ssh pve 'qm set VMID --bios ovmf'
# Add EFI disk
ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
```
4. **Reboot VM** and install NVIDIA drivers inside the VM
**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
---
## Backup Priority
See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
### Critical VMs (Must Backup)
| Priority | VMID | Name | Reason |
|----------|------|------|--------|
| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
### Medium Priority
| VMID | Name | Notes |
|------|------|-------|
| 200 | pihole | Easy to rebuild, but DNS config valuable |
| 202 | traefik | Config files backed up separately |
### Low Priority (Ephemeral/Rebuildable)
| VMID | Name | Notes |
|------|------|-------|
| 105 | fs-dev | Development - code is in Git |
| 111 | lmdev1 | Ephemeral development |
| 201 | copyparty | Simple app, easy to redeploy |
| 206 | docker-host | Docker Compose files backed up separately |
---
## Quick Reference Commands
```bash
# List all VMs
ssh pve 'qm list'
ssh pve2 'qm list'
# List all containers
ssh pve 'pct list'
# Start/stop VM
ssh pve 'qm start VMID'
ssh pve 'qm stop VMID'
ssh pve 'qm shutdown VMID' # Graceful
# Start/stop container
ssh pve 'pct start CTID'
ssh pve 'pct stop CTID'
ssh pve 'pct shutdown CTID' # Graceful
# VM console
ssh pve 'qm terminal VMID'
# Container console
ssh pve 'pct enter CTID'
# Clone VM
ssh pve 'qm clone VMID NEW_VMID --name newvm'
# Delete VM
ssh pve 'qm destroy VMID'
# Delete container
ssh pve 'pct destroy CTID'
```
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - Storage pool assignments
- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
---
**Last Updated**: 2025-12-22