From 56b82df4972240bb2471df6cdef0e959c98bbba9 Mon Sep 17 00:00:00 2001 From: Hutson Date: Tue, 23 Dec 2025 00:34:21 -0500 Subject: [PATCH] Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 --- BACKUP-STRATEGY.md | 358 ++++++++++++ CLAUDE.md | 1273 ++++++++----------------------------------- HARDWARE.md | 455 ++++++++++++++++ HOMEASSISTANT.md | 36 ++ MAINTENANCE.md | 618 +++++++++++++++++++++ MONITORING.md | 546 +++++++++++++++++++ POWER-MANAGEMENT.md | 509 +++++++++++++++++ README.md | 148 +++++ SERVICES.md | 591 ++++++++++++++++++++ SSH-ACCESS.md | 464 ++++++++++++++++ STORAGE.md | 510 +++++++++++++++++ TRAEFIK.md | 672 +++++++++++++++++++++++ UPS.md | 605 ++++++++++++++++++++ VMS.md | 579 ++++++++++++++++++++ 14 files changed, 6328 insertions(+), 1036 deletions(-) create mode 100644 BACKUP-STRATEGY.md create mode 100644 HARDWARE.md create mode 100644 MAINTENANCE.md create mode 100644 MONITORING.md create mode 100644 POWER-MANAGEMENT.md create mode 100644 README.md create mode 100644 SERVICES.md create mode 100644 SSH-ACCESS.md create mode 100644 STORAGE.md create mode 100644 TRAEFIK.md create mode 100644 UPS.md create mode 100644 VMS.md diff --git a/BACKUP-STRATEGY.md b/BACKUP-STRATEGY.md new file mode 100644 index 0000000..d1d93b8 --- /dev/null +++ b/BACKUP-STRATEGY.md @@ -0,0 +1,358 @@ +# Backup Strategy + +## 🚨 Current Status: CRITICAL GAPS IDENTIFIED + +This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.** + +## Executive Summary + +### What We Have ✅ +- **Syncthing**: File synchronization across 5+ devices +- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured) +- **Proxmox**: Built-in backup capabilities (not yet configured) + +### What We DON'T Have 🚨 +- ❌ No documented VM/CT backups +- ❌ No ZFS snapshot schedule +- ❌ No offsite backups +- ❌ No disaster recovery plan +- ❌ No tested restore procedures +- ❌ No configuration backups + +**Risk Level**: HIGH - A catastrophic failure could result in significant data loss. + +--- + +## Current State Analysis + +### Syncthing (File Synchronization) + +**What it is**: Real-time file sync across devices +**What it is NOT**: A backup solution + +| Folder | Devices | Size | Protected? | +|--------|---------|------|------------| +| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only | +| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only | +| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only | +| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only | +| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only | + +**Limitations**: +- ❌ Accidental deletion → deleted everywhere +- ❌ Ransomware/corruption → spreads everywhere +- ❌ No point-in-time recovery +- ❌ No version history (unless file versioning enabled - not documented) + +**Verdict**: Syncthing provides redundancy and availability, NOT backup protection. + +### ZFS on TrueNAS (Potential Backup Target) + +**Current Status**: ❓ Unknown - snapshots may or may not be configured + +**Needs Investigation**: +```bash +# Check if snapshots exist +ssh truenas 'zfs list -t snapshot' + +# Check if automated snapshots are configured +ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured" + +# Check snapshot schedule via TrueNAS API/UI +``` + +**If configured**, ZFS snapshots provide: +- ✅ Point-in-time recovery +- ✅ Protection against accidental deletion +- ✅ Fast rollback capability +- ⚠️ Still single location (no offsite protection) + +### Proxmox VM/CT Backups + +**Current Status**: ❓ Unknown - no backup jobs documented + +**Needs Investigation**: +```bash +# Check backup configuration +ssh pve 'pvesh get /cluster/backup' + +# Check if any backups exist +ssh pve 'ls -lh /var/lib/vz/dump/' +ssh pve2 'ls -lh /var/lib/vz/dump/' +``` + +**Critical VMs Needing Backup**: +| VM/CT | VMID | Priority | Notes | +|-------|------|----------|-------| +| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here | +| Saltbox | 101 | 🟡 HIGH | Media stack, complex config | +| homeassistant | 110 | 🟡 HIGH | Home automation config | +| gitea-vm | 300 | 🟡 HIGH | Git repositories | +| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) | +| traefik | 202 | 🟢 MEDIUM | Reverse proxy config | +| trading-vm | 301 | 🟡 HIGH | AI trading platform | +| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) | + +--- + +## Recommended Backup Strategy + +### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY) + +**ZFS Snapshots on TrueNAS** + +Schedule automatic snapshots for all datasets: + +| Dataset | Frequency | Retention | +|---------|-----------|-----------| +| vault/documents | Every 15 min | 1 hour | +| vault/documents | Hourly | 24 hours | +| vault/documents | Daily | 30 days | +| vault/documents | Weekly | 12 weeks | +| vault/documents | Monthly | 12 months | + +**Implementation**: +```bash +# Via TrueNAS UI: Storage → Snapshots → Add +# Or via CLI: +ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)' +``` + +**Proxmox VM Backups** + +Configure weekly backups to local storage: + +```bash +# Create backup job via Proxmox UI: +# Datacenter → Backup → Add +# - Schedule: Weekly (Sunday 2 AM) +# - Storage: local-zfs or nvme-mirror1 +# - Mode: Snapshot (fast) +# - Retention: 4 backups +``` + +**Or via CLI**: +```bash +ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4' +``` + +### Tier 2: Offsite Backups (CRITICAL GAP) + +**Option A: Cloud Storage (Recommended)** + +Use **rclone** or **restic** to sync critical data to cloud: + +| Provider | Cost | Pros | Cons | +|----------|------|------|------| +| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees | +| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval | +| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention | + +**Implementation Example (Backblaze B2)**: +```bash +# Install on TrueNAS +ssh truenas 'pkg install rclone restic' + +# Configure B2 +rclone config # Follow prompts for B2 + +# Daily backup critical folders +0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4 +``` + +**Option B: Offsite TrueNAS Replication** + +- Set up second TrueNAS at friend/family member's house +- Use ZFS replication to sync snapshots +- Requires: Static IP or Tailscale, trust + +**Option C: USB Drive Rotation** + +- Weekly backup to external USB drive +- Rotate 2-3 drives (one always offsite) +- Manual but simple + +### Tier 3: Configuration Backups + +**Proxmox Configuration** + +```bash +# Backup /etc/pve (configs are already in cluster filesystem) +# But also backup to external location: +ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service' + +# Copy to safe location +scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/ +``` + +**VM-Specific Configs** + +- Traefik configs: `/etc/traefik/` on CT 202 +- Saltbox configs: `/srv/git/saltbox/` on VM 101 +- Home Assistant: `/config/` on VM 110 + +**Script to backup all configs**: +```bash +#!/bin/bash +# Save as ~/bin/backup-homelab-configs.sh + +DATE=$(date +%Y%m%d) +BACKUP_DIR=~/Backups/homelab-configs/$DATE + +mkdir -p $BACKUP_DIR + +# Proxmox configs +ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz +ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz + +# Traefik +ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz + +# Saltbox +ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz + +# Home Assistant +ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz + +echo "Configs backed up to $BACKUP_DIR" +``` + +--- + +## Disaster Recovery Scenarios + +### Scenario 1: Single VM Failure + +**Impact**: Medium +**Recovery Time**: 30-60 minutes + +1. Restore from Proxmox backup: + ```bash + ssh pve 'qmrestore /path/to/backup.vma.zst VMID' + ``` +2. Start VM and verify +3. Update IP if needed + +### Scenario 2: TrueNAS Failure + +**Impact**: CATASTROPHIC (all storage lost) +**Recovery Time**: Unknown - NO PLAN + +**Current State**: 🚨 NO RECOVERY PLAN +**Needed**: +- Offsite backup of critical datasets +- Documented ZFS pool creation steps +- Share configuration export + +### Scenario 3: Complete PVE Server Failure + +**Impact**: SEVERE +**Recovery Time**: 4-8 hours + +**Current State**: ⚠️ PARTIALLY RECOVERABLE +**Needed**: +- VM backups stored on TrueNAS or PVE2 +- Proxmox reinstall procedure +- Network config documentation + +### Scenario 4: Complete Site Disaster (Fire/Flood) + +**Impact**: TOTAL LOSS +**Recovery Time**: Unknown + +**Current State**: 🚨 NO RECOVERY PLAN +**Needed**: +- Offsite backups (cloud or physical) +- Critical data prioritization +- Restore procedures + +--- + +## Action Plan + +### Immediate (Next 7 Days) + +- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist + ```bash + ssh truenas 'zfs list -t snapshot' + ssh pve 'ls -lh /var/lib/vz/dump/' + ``` + +- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets + +- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300) + +- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works + +### Short-term (Next 30 Days) + +- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended) + +- [ ] **Install backup tools**: rclone or restic on TrueNAS + +- [ ] **Configure daily cloud sync**: Critical folders to cloud storage + +- [ ] **Document restore procedures**: Step-by-step guides for each scenario + +### Long-term (Next 90 Days) + +- [ ] **Implement monitoring**: Alerts for backup failures + +- [ ] **Quarterly restore test**: Verify backups actually work + +- [ ] **Backup rotation policy**: Automate old backup cleanup + +- [ ] **Configuration backup automation**: Weekly cron job + +--- + +## Monitoring & Validation + +### Backup Health Checks + +```bash +# Check last ZFS snapshot +ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5' + +# Check Proxmox backup status +ssh pve 'pvesh get /cluster/backup-info/not-backed-up' + +# Check cloud sync status (if using rclone) +ssh truenas 'rclone ls b2:homelab-backup | wc -l' +``` + +### Alerts to Set Up + +- Email alert if no snapshot created in 24 hours +- Email alert if Proxmox backup fails +- Email alert if cloud sync fails +- Weekly backup status report + +--- + +## Cost Estimate + +**Monthly Backup Costs**: + +| Component | Cost | Notes | +|-----------|------|-------| +| Local storage (already owned) | $0 | Using existing TrueNAS | +| Proxmox backups (local) | $0 | Using existing storage | +| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi | +| **Total** | **~$10/mo** | Minimal cost for peace of mind | + +**One-time**: +- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup + +--- + +## Related Documentation + +- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity +- [VMS.md](VMS.md) - VM inventory and prioritization +- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon) + +--- + +**Last Updated**: 2025-12-22 +**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED diff --git a/CLAUDE.md b/CLAUDE.md index dcc8ba9..7c805c5 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,1176 +1,377 @@ -# Homelab Infrastructure +# Homelab Infrastructure - Quick Reference + +**Start here**: [README.md](README.md) - Documentation index and overview + +This is your **quick reference guide** for common homelab tasks. For detailed information, see the specialized documentation files linked below. + +--- ## Quick Reference - Common Tasks -| Task | Section | Quick Command | -|------|---------|---------------| -| **Add new public service** | [Reverse Proxy](#reverse-proxy-architecture-traefik) | Create Traefik config + Cloudflare DNS | -| **Add Cloudflare DNS** | [Cloudflare API](#cloudflare-api-access) | `curl -X POST cloudflare.com/...` | +| Task | Documentation | Quick Command | +|------|--------------|---------------| +| **Add new public service** | [TRAEFIK.md](TRAEFIK.md) | Create Traefik config + Cloudflare DNS | +| **Check UPS status** | [UPS.md](UPS.md) | `ssh pve 'upsc cyberpower@localhost'` | | **Check server temps** | [Temperature Check](#server-temperature-check) | `ssh pve 'grep Tctl ...'` | -| **Syncthing issues** | [Troubleshooting](#troubleshooting-runbooks) | Check API connections | -| **SSL cert issues** | [Traefik DNS Challenge](#ssl-certificates) | Use `cloudflare` resolver | +| **Syncthing issues** | [SYNCTHING.md](SYNCTHING.md) | Check API connections | +| **VM/CT management** | [VMS.md](VMS.md) | `ssh pve 'qm list'` | +| **Storage issues** | [STORAGE.md](STORAGE.md) | `ssh pve 'zpool status'` | +| **SSH access** | [SSH-ACCESS.md](SSH-ACCESS.md) | Use host aliases in `~/.ssh/config` | +| **Power optimization** | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU states | +| **Backup strategy** | [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | ⚠️ CRITICAL GAPS | -**Key Credentials (see sections for full details):** -- Cloudflare: `cloudflare@htsn.io` / API Key in [Cloudflare API](#cloudflare-api-access) +**Key Credentials:** - SSH Password: `GrilledCh33s3#` -- Traefik: CT 202 @ 10.10.10.250 +- Cloudflare: `cloudflare@htsn.io` / `849ebefd163d2ccdec25e49b3e1b3fe2cdadc` +- See individual docs for service-specific credentials --- ## Role -You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. Your responsibilities include: +You are the **Homelab Assistant** - a Claude Code session dedicated to managing and maintaining Hutson's home infrastructure. -- **Infrastructure Management**: Proxmox servers, VMs, containers, networking -- **File Sync**: Syncthing configuration across all devices (Mac Mini, MacBook, Windows PC, TrueNAS, Android) -- **Network Administration**: Router config, SSH access, Tailscale, device management -- **Power Optimization**: CPU governors, GPU power states, service tuning -- **Documentation**: Keep CLAUDE.md, SYNCTHING.md, and SHELL-ALIASES.md up to date -- **Automation**: Shell aliases, startup scripts, scheduled tasks +**Responsibilities:** +- Infrastructure Management (Proxmox, VMs, containers) +- File Sync (Syncthing across all devices) +- Network Administration +- Power Optimization +- Documentation (keep all docs current) +- Automation (shell aliases, scripts, scheduled tasks) -You have full access to all homelab devices via SSH and APIs. Use this context to help troubleshoot, configure, and optimize the infrastructure. +**Full access via**: SSH keys, APIs, QEMU guest agent -### Proactive Behaviors +--- -When the user mentions issues or asks questions, proactively: -- **"sync not working"** → Check Syncthing status on ALL devices, identify which is offline -- **"device offline"** → Ping both local and Tailscale IPs, check if service is running -- **"slow"** → Check CPU usage, running processes, Syncthing rescan activity +## Proactive Behaviors + +When the user mentions issues or asks questions: +- **"sync not working"** → Check Syncthing on ALL devices, identify which is offline +- **"device offline"** → Ping local + Tailscale IPs, check if service running +- **"slow"** → Check CPU usage, processes, Syncthing rescan activity - **"check status"** → Run full health check across all systems -- **"something's wrong"** → Run diagnostics on likely culprits based on context +- **"something's wrong"** → Run diagnostics on likely culprits -### Quick Health Checks +--- -Run these to get a quick overview of the homelab: +## Quick Health Checks ```bash # === FULL HEALTH CHECK === + # Syncthing connections (Mac Mini) -curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" "http://127.0.0.1:8384/rest/system/connections" | python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/connections" | \ + python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ + [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" # Proxmox VMs ssh pve 'qm list' 2>/dev/null || echo "PVE: unreachable" ssh pve2 'qm list' 2>/dev/null || echo "PVE2: unreachable" -# Ping critical devices +# Critical devices ping -c 1 -W 1 10.10.10.200 >/dev/null && echo "TrueNAS: UP" || echo "TrueNAS: DOWN" ping -c 1 -W 1 10.10.10.1 >/dev/null && echo "Router: UP" || echo "Router: DOWN" -# Check Windows PC Syncthing (often goes offline) -nc -zw1 10.10.10.150 22000 && echo "Windows Syncthing: UP" || echo "Windows Syncthing: DOWN" +# Windows PC Syncthing +nc -zw1 10.10.10.150 22000 && echo "Windows: UP" || echo "Windows: DOWN" ``` -### Troubleshooting Runbooks +--- -| Symptom | Check | Fix | -|---------|-------|-----| -| Device not syncing | `curl Syncthing API → connections` | Check if device online, restart Syncthing | -| Windows PC offline | `ping 10.10.10.150` then `nc -z 22000` | SSH in, `Start-ScheduledTask -TaskName "Syncthing"` | -| Phone not syncing | Phone Syncthing app in background? | User must open app, keep screen on | -| High CPU on TrueNAS | Syncthing rescan? KSM? | Check rescan intervals, disable KSM | -| VM won't start | Storage available? RAM free? | `ssh pve 'qm start VMID'`, check logs | -| Tailscale offline | `tailscale status` | `tailscale up` or restart service | -| Tailscale no subnet access | Check subnet routers | Verify pve or ucg-fiber advertising routes | -| Sync stuck at X% | Folder errors? Conflicts? | Check `rest/folder/errors?folder=NAME` | -| Server running hot | Check KSM, check CPU processes | Disable KSM, identify runaway process | -| Storage enclosure loud | Check fan speed via SES | See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | -| Drives not detected | Check SAS link, LCC status | Switch LCC, rescan SCSI hosts | +## Troubleshooting Runbooks + +| Symptom | Check | Fix | Docs | +|---------|-------|-----|------| +| Device not syncing | `curl Syncthing API` | Restart Syncthing | [SYNCTHING.md](SYNCTHING.md) | +| VM won't start | Storage/RAM available? | `ssh pve 'qm start VMID'` | [VMS.md](VMS.md) | +| Server running hot | Check KSM, CPU processes | Disable KSM | [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | +| Storage enclosure loud | Check fan speed via SES | Switch LCC | [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | +| UPS on battery | Check runtime | Monitor shutdown script | [UPS.md](UPS.md) | +| Service unreachable | Check Traefik config | Fix routing | [TRAEFIK.md](TRAEFIK.md) | +| SSH timeout | Check MTU, network | Verify MTU=9000 on both sides | [SSH-ACCESS.md](SSH-ACCESS.md) | + +--- + +## Server Temperature Check -### Server Temperature Check ```bash # Check temps on both servers (Threadripper PRO max safe: 90°C Tctl) -ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' -ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' -``` -**Healthy temps**: 70-80°C under load. **Warning**: >85°C. **Throttle**: 90°C. +ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ + label=$(cat ${f%_input}_label 2>/dev/null); \ + if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' -### Service Dependencies +ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ + label=$(cat ${f%_input}_label 2>/dev/null); \ + if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' +``` + +**Healthy**: 70-80°C under load | **Warning**: >85°C | **Throttle**: 90°C + +--- + +## Service Dependencies ``` TrueNAS (10.10.10.200) -├── Central Syncthing hub - if down, sync breaks between devices +├── Central Syncthing hub - if down, sync breaks ├── NFS/SMB shares for VMs └── Media storage for Plex PiHole (CT 200) -└── DNS for entire network - if down, name resolution fails +└── DNS for entire network Traefik (CT 202) -└── Reverse proxy - if down, external access to services fails +└── Reverse proxy - external access Router (10.10.10.1) -└── Everything - gateway for all traffic +└── Gateway for all traffic ``` -### API Quick Reference +--- + +## API Quick Reference | Service | Device | Endpoint | Auth | |---------|--------|----------|------| | Syncthing | Mac Mini | `http://127.0.0.1:8384/rest/` | `X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` | -| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` (via SSH) | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` | +| Syncthing | MacBook | `http://127.0.0.1:8384/rest/` | `X-API-Key: qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` | | Syncthing | Phone | `https://10.10.10.54:8384/rest/` | `X-API-Key: Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` | -| Proxmox | PVE | `https://10.10.10.120:8006/api2/json/` | SSH key auth | -| Proxmox | PVE2 | `https://10.10.10.102:8006/api2/json/` | SSH key auth | +| Proxmox | PVE/PVE2 | `https://10.10.10.120:8006/api2/json/` | SSH key auth | -### Common Maintenance Tasks +**See**: [SYNCTHING.md](SYNCTHING.md), [HOMEASSISTANT.md](HOMEASSISTANT.md) for more APIs -When user asks for maintenance or you notice issues: +--- -1. **Check Syncthing sync status** - Any folders behind? Errors? -2. **Verify all devices connected** - Run connection check -3. **Check disk space** - `ssh pve 'df -h'`, `ssh pve2 'df -h'` -4. **Review ZFS pool health** - `ssh pve 'zpool status'` -5. **Check for stuck processes** - High CPU? Memory pressure? -6. **Verify backups** - Are critical folders syncing? - -### Emergency Commands +## Emergency Commands ```bash -# Restart VM on Proxmox +# Restart VM ssh pve 'qm stop VMID && qm start VMID' -# Check what's using CPU +# Check CPU usage ssh pve 'ps aux --sort=-%cpu | head -10' -# Check ZFS pool status (via QEMU agent) +# Check ZFS pool (via QEMU agent) ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"' -# Check EMC enclosure fans -ssh pve 'qm guest exec 100 -- bash -c "sg_ses --index=coo,-1 --get=speed_code /dev/sg15"' - # Force Syncthing rescan -curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY" +curl -X POST "http://127.0.0.1:8384/rest/db/scan?folder=FOLDER" \ + -H "X-API-Key: API_KEY" -# Restart Syncthing on Windows (when stuck) -sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"' - -# Get all device IPs from router -expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' +# Restart Syncthing on Windows +sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 \ + 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"' ``` -## Overview +--- -Two Proxmox servers running various VMs and containers for home infrastructure, media, development, and AI workloads. +## Infrastructure Overview -## Servers +### Servers -### PVE (10.10.10.120) - Primary -- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP) -- **RAM**: 128 GB -- **Storage**: - - `nvme-mirror1`: 2x Sabrent Rocket Q NVMe (3.6TB usable) - - `nvme-mirror2`: 2x Kingston SFYRD 2TB (1.8TB usable) - - `rpool`: 2x Samsung 870 QVO 4TB SSD mirror (3.6TB usable) -- **GPUs**: - - NVIDIA Quadro P2000 (75W TDP) - Plex transcoding - - NVIDIA TITAN RTX (280W TDP) - AI workloads, passed to saltbox/lmdev1 -- **Role**: Primary VM host, TrueNAS, media services +| Server | CPU | RAM | Role | Details | +|--------|-----|-----|------|---------| +| **PVE** (10.10.10.120) | Threadripper PRO 3975WX (32C) | 128GB | Primary | [VMS.md](VMS.md) | +| **PVE2** (10.10.10.102) | Threadripper PRO 3975WX (32C) | 128GB | Secondary | [VMS.md](VMS.md) | -### PVE2 (10.10.10.102) - Secondary -- **CPU**: AMD Ryzen Threadripper PRO 3975WX (32-core, 64 threads, 280W TDP) -- **RAM**: 128 GB -- **Storage**: - - `nvme-mirror3`: 2x NVMe mirror - - `local-zfs2`: 2x WD Red 6TB HDD mirror -- **GPUs**: - - NVIDIA RTX A6000 (300W TDP) - passed to trading-vm -- **Role**: Trading platform, development +**Power**: ~1000-1350W under load | **UPS**: CyberPower 2200VA/1320W | **See**: [UPS.md](UPS.md), [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) -## SSH Access +### Critical VMs -### SSH Key Authentication (All Hosts) +| VMID | Name | IP | Purpose | Docs | +|------|------|-----|---------|------| +| 100 | truenas | 10.10.10.200 | NAS/storage | [STORAGE.md](STORAGE.md) | +| 101 | saltbox | 10.10.10.100 | Media stack (Plex) | [VMS.md](VMS.md) | +| 110 | homeassistant | 10.10.10.110 | Home automation | [HOMEASSISTANT.md](HOMEASSISTANT.md) | +| 202 | traefik (CT) | 10.10.10.250 | Reverse proxy | [TRAEFIK.md](TRAEFIK.md) | -SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. Use the `~/.ssh/homelab` key. +**Complete inventory**: [VMS.md](VMS.md) | **IP assignments**: [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) -| Host Alias | IP | User | Type | Notes | -|------------|-----|------|------|-------| -| `pve` | 10.10.10.120 | root | Proxmox | Primary server | -| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server | -| `truenas` | 10.10.10.200 | root | VM | NAS/storage | -| `saltbox` | 10.10.10.100 | hutson | VM | Media automation | -| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development | -| `docker-host` | 10.10.10.206 | hutson | VM | Docker services | -| `fs-dev` | 10.10.10.5 | hutson | VM | Development | -| `copyparty` | 10.10.10.201 | hutson | VM | File sharing | -| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server | -| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform | -| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking | -| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy | -| `findshyt` | 10.10.10.8 | root | LXC | Custom app | +--- -**Usage examples:** -```bash -ssh pve 'qm list' # List VMs -ssh truenas 'zpool status vault' # Check ZFS pool -ssh saltbox 'docker ps' # List containers -ssh pihole 'pihole status' # Check Pi-hole -``` +## Common Maintenance Tasks -### Password Auth (Special Cases) +1. **Check Syncthing sync** - Folders behind? Errors? +2. **Verify devices connected** - Run connection check +3. **Check disk space** - `ssh pve 'df -h'` +4. **Review ZFS health** - `ssh pve 'zpool status'` +5. **Check for stuck processes** - High CPU? Memory pressure? +6. **Verify backups** - Critical folders syncing? → See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) -| Device | IP | User | Auth Method | Notes | -|--------|-----|------|-------------|-------| -| UniFi Router | 10.10.10.1 | root | expect (keyboard-interactive) | Gateway | -| Windows PC | 10.10.10.150 | claude | sshpass | PowerShell, use `;` not `&&` | -| HomeAssistant | 10.10.10.110 | - | QEMU agent only | No SSH server | +--- -**Router access (requires expect):** -```bash -# Run command on router -expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' +## Network Quick Reference -# Get ARP table (all device IPs) -expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' -``` +**Ranges**: 10.10.10.0/24 (LAN), 10.10.20.0/24 (storage) +**Jumbo Frames**: MTU 9000 enabled +**Tailscale**: VPN with subnet routing (HA failover) -**Windows PC access:** -```bash -sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5' -``` +**See**: [NETWORK.md](NETWORK.md) for complete details -**HomeAssistant (no SSH, use QEMU agent):** -```bash -ssh pve 'qm guest exec 110 -- bash -c "ha core info"' -``` - -## VMs and Containers - -### PVE (10.10.10.120) -| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent | -|------|------|-------|-----|---------|-----------------|------------| -| 100 | truenas | 8 | 32GB | NAS, storage | LSI SAS2308 HBA, Samsung NVMe | Yes | -| 101 | saltbox | 16 | 16GB | Media automation | TITAN RTX | Yes | -| 105 | fs-dev | 10 | 8GB | Development | - | Yes | -| 110 | homeassistant | 2 | 2GB | Home automation | - | No | -| 111 | lmdev1 | 8 | 32GB | AI/LLM development | TITAN RTX | Yes | -| 201 | copyparty | 2 | 2GB | File sharing | - | Yes | -| 206 | docker-host | 2 | 4GB | Docker services | - | Yes | -| 200 | pihole (CT) | - | - | DNS/Ad blocking | - | N/A | -| 202 | traefik (CT) | - | - | Reverse proxy | - | N/A | -| 205 | findshyt (CT) | - | - | Custom app | - | N/A | - -### PVE2 (10.10.10.102) -| VMID | Name | vCPUs | RAM | Purpose | GPU/Passthrough | QEMU Agent | -|------|------|-------|-----|---------|-----------------|------------| -| 300 | gitea-vm | 2 | 4GB | Git server | - | Yes | -| 301 | trading-vm | 16 | 32GB | AI trading platform | RTX A6000 | Yes | - -### QEMU Guest Agent -VMs with QEMU agent can be managed via `qm guest exec`: -```bash -# Execute command in VM -ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"' - -# Get VM IP addresses -ssh pve 'qm guest exec 100 -- bash -c "ip addr"' -``` -Only VM 110 (homeassistant) lacks QEMU agent - use its web UI instead. - -## Power Management - -### Estimated Power Draw -- **PVE**: 500-750W (CPU + TITAN RTX + P2000 + storage + HBAs) -- **PVE2**: 450-600W (CPU + RTX A6000 + storage) -- **Combined**: ~1000-1350W under load - -### Optimizations Applied -1. **KSMD Disabled** (2024-12-17 updated) - - Was consuming 44-57% CPU on PVE with negative profit - - Caused CPU temp to rise from 74°C to 83°C - - Savings: ~7-10W + significant temp reduction - - Made permanent via: - - systemd service: `/etc/systemd/system/disable-ksm.service` - - **ksmtuned masked**: `systemctl mask ksmtuned` (prevents re-enabling) - - **Note**: KSM can get re-enabled by Proxmox updates. If CPU is hot, check: - ```bash - cat /sys/kernel/mm/ksm/run # Should be 0 - ps aux | grep ksmd # Should show 0% CPU - # If KSM is running (run=1), disable it: - echo 0 > /sys/kernel/mm/ksm/run - systemctl mask ksmtuned - ``` - -2. **Syncthing Rescan Intervals** (2024-12-16) - - Changed aggressive 60s rescans to 3600s for large folders - - Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config - - Savings: ~60-80W (TrueNAS VM was at constant 86% CPU) - -3. **CPU Governor Optimization** (2024-12-16) - - PVE: `powersave` governor + `balance_power` EPP (amd-pstate-epp driver) - - PVE2: `schedutil` governor (acpi-cpufreq driver) - - Made permanent via systemd service: `/etc/systemd/system/cpu-powersave.service` - - Savings: ~60-120W combined (CPUs now idle at 1.7-2.2GHz vs 4GHz) - -4. **GPU Power States** (2024-12-16) - Verified optimal - - RTX A6000: 11W idle (P8 state) - - TITAN RTX: 2-3W idle (P8 state) - - Quadro P2000: 25W (P0 - Plex keeps it active) - -5. **ksmtuned Disabled** (2024-12-16) - - KSM tuning daemon was still running after KSMD disabled - - Stopped and disabled on both servers - - Savings: ~2-5W - -6. **HDD Spindown on PVE2** (2024-12-16) - - local-zfs2 pool (2x WD Red 6TB) had only 768KB used but drives spinning 24/7 - - Set 30-minute spindown via `hdparm -S 241` - - Persistent via udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules` - - Savings: ~10-16W when spun down - -### Potential Optimizations -- [ ] PCIe ASPM power management -- [ ] NMI watchdog disable - -## Memory Configuration -- Ballooning enabled on most VMs but not actively used -- No memory overcommit (98GB allocated on 128GB physical for PVE) -- KSMD was wasting CPU with no benefit (negative general_profit) - -## Network - -See [NETWORK.md](NETWORK.md) for full details. - -### Network Ranges -| Network | Range | Purpose | -|---------|-------|---------| -| LAN | 10.10.10.0/24 | Primary network, all external access | -| Internal | 10.10.20.0/24 | Inter-VM only (storage, NFS/iSCSI) | - -### PVE Bridges (10.10.10.120) -| Bridge | NIC | Speed | Purpose | Use For | -|--------|-----|-------|---------|---------| -| vmbr0 | enp1s0 | 1 Gb | Management | General VMs/CTs | -| vmbr1 | enp35s0f0 | 10 Gb | High-speed LXC | Bandwidth-heavy containers | -| vmbr2 | enp35s0f1 | 10 Gb | High-speed VM | TrueNAS, Saltbox, storage VMs | -| vmbr3 | (none) | Virtual | Internal only | NFS/iSCSI traffic, no internet | - -### Quick Reference -```bash -# Add VM to standard network (1Gb) -qm set VMID --net0 virtio,bridge=vmbr0 - -# Add VM to high-speed network (10Gb) -qm set VMID --net0 virtio,bridge=vmbr2 - -# Add secondary NIC for internal storage network -qm set VMID --net1 virtio,bridge=vmbr3 -``` - -### MTU 9000 (Jumbo Frames) - -Jumbo frames are enabled across the network for improved throughput on large transfers. - -| Device | Interface | MTU | Persistent | -|--------|-----------|-----|------------| -| Mac Mini | en0 | 9000 | Yes (networksetup) | -| PVE | vmbr0, enp1s0 | 9000 | Yes (/etc/network/interfaces) | -| PVE2 | vmbr0, nic1 | 9000 | Yes (/etc/network/interfaces) | -| TrueNAS | enp6s18, enp6s19 | 9000 | Yes | -| UCG-Fiber | br0 | 9216 | Yes (default) | - -**Verify MTU:** -```bash -# Mac Mini -ifconfig en0 | grep mtu - -# PVE/PVE2 -ssh pve 'ip link show vmbr0 | grep mtu' -ssh pve2 'ip link show vmbr0 | grep mtu' - -# Test jumbo frames -ping -c 1 -D -s 8000 10.10.10.120 # 8000 + 8 byte header = 8008 bytes -``` - -**Important:** When setting MTU on Proxmox bridges, ensure BOTH the bridge (vmbr0) AND the underlying physical interface (enp1s0/nic1) have the same MTU, otherwise packets will be dropped. - -### Tailscale VPN - -Tailscale provides secure remote access to the homelab from anywhere. - -**Subnet Routers (HA Failover)** - -Two devices advertise the `10.10.10.0/24` subnet for redundancy: - -| Device | Tailscale IP | Role | Notes | -|--------|--------------|------|-------| -| pve | 100.113.177.80 | Primary | Proxmox host | -| ucg-fiber | 100.94.246.32 | Failover | UniFi router (always on) | - -If Proxmox goes down, Tailscale automatically fails over to the router (~10-30 sec). - -**Router Tailscale Setup (UCG-Fiber)** -- Installed via: `curl -fsSL https://tailscale.com/install.sh | sh` -- Config: `tailscale up --advertise-routes=10.10.10.0/24 --accept-routes` -- Survives reboots (systemd service) -- Routes must be approved in [Tailscale Admin Console](https://login.tailscale.com/admin/machines) - -**Tailscale IPs Quick Reference** - -| Device | Tailscale IP | Local IP | -|--------|--------------|----------| -| Mac Mini | 100.108.89.58 | 10.10.10.125 | -| PVE | 100.113.177.80 | 10.10.10.120 | -| UCG-Fiber | 100.94.246.32 | 10.10.10.1 | -| TrueNAS | 100.100.94.71 | 10.10.10.200 | -| Pi-hole | 100.112.59.128 | 10.10.10.10 | - -**Check Tailscale Status** -```bash -# From Mac Mini -/Applications/Tailscale.app/Contents/MacOS/Tailscale status - -# From router -expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' -``` +--- ## Common Commands + ```bash -# Check VM status -ssh pve 'qm list' -ssh pve2 'qm list' +# VM management +ssh pve 'qm list' # List VMs +ssh pve 'qm start VMID' # Start VM +ssh pve 'qm shutdown VMID' # Graceful shutdown -# Check container status -ssh pve 'pct list' +# Container management +ssh pve 'pct list' # List containers +ssh pve 'pct enter CTID' # Enter container shell -# Monitor CPU/power -ssh pve 'top -bn1 | head -20' +# Storage +ssh pve 'zpool status' # Check ZFS pools +ssh truenas 'zpool status vault' # Check TrueNAS pool -# Check ZFS pools -ssh pve 'zpool status' - -# Check GPU (if nvidia-smi installed in VM) -ssh pve 'lspci | grep -i nvidia' +# QEMU guest agent +ssh pve 'qm guest exec VMID -- bash -c "COMMAND"' ``` -## Remote Claude Code Sessions (Mac Mini) +**See**: [SSH-ACCESS.md](SSH-ACCESS.md), [VMS.md](VMS.md) -### Overview -The Mac Mini (`hutson-mac-mini.local`) runs the Happy Coder daemon, enabling on-demand Claude Code sessions accessible from anywhere via the Happy Coder mobile app. Sessions are created when you need them - no persistent tmux sessions required. +--- -### Architecture -``` -Mac Mini (100.108.89.58 via Tailscale) -├── launchd (auto-starts on boot) -│ └── com.hutson.happy-daemon.plist (starts Happy daemon) -├── Happy Coder daemon (manages remote sessions) -└── Tailscale (secure remote access) -``` +## Documentation Index -### How It Works -1. Happy daemon runs on Mac Mini (auto-starts on boot) -2. Open Happy Coder app on phone/tablet -3. Start a new Claude session from the app -4. Session runs in any working directory you choose -5. Session ends when you're done - no cleanup needed +### Infrastructure +- [README.md](README.md) - Start here +- [VMS.md](VMS.md) - VM/CT inventory +- [STORAGE.md](STORAGE.md) - ZFS pools, shares +- [NETWORK.md](NETWORK.md) - Bridges, VLANs, Tailscale +- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Optimizations +- [UPS.md](UPS.md) - UPS config, NUT monitoring -### Quick Commands -```bash -# Check daemon status -happy daemon list +### Services +- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy, SSL +- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home automation +- [SYNCTHING.md](SYNCTHING.md) - File sync +- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure -# Start a new session manually (from Mac Mini terminal) -cd ~/Projects/homelab && happy claude +### Operations +- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH keys, hosts +- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses +- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - ⚠️ Backups (CRITICAL) +- [SHELL-ALIASES.md](SHELL-ALIASES.md) - ZSH aliases -# Check active sessions -happy daemon list -``` +--- -### Mobile Access Setup (One-time) -1. Download Happy Coder app: - - iOS: https://apps.apple.com/us/app/happy-claude-code-client/id6748571505 - - Android: https://play.google.com/store/apps/details?id=com.ex3ndr.happy -2. On Mac Mini, ensure self-hosted server is configured: - ```bash - echo 'export HAPPY_SERVER_URL="https://happy.htsn.io"' >> ~/.zshrc - source ~/.zshrc - ``` -3. Authenticate with the Happy server: - ```bash - happy auth login --force # Opens browser, scan QR with app - ``` -4. Connect Claude API access: - ```bash - happy connect claude # Links your Anthropic API credentials - ``` -5. Ensure Claude is logged in locally (critical for spawned sessions): - ```bash - claude # Start Claude Code - /login # Authenticate if prompted - ``` -6. Daemon auto-starts on login via launchd - -### Daemon Management -```bash -happy daemon start # Start daemon -happy daemon stop # Stop daemon -happy daemon status # Check status -happy daemon list # List active sessions -``` - -### Remote Access via SSH + Tailscale -From any device on Tailscale network: -```bash -# SSH to Mac Mini -ssh hutson@100.108.89.58 - -# Or via hostname -ssh hutson@mac-mini - -# Start Claude in desired directory -cd ~/Projects/homelab && happy claude -``` - -### Files & Configuration -| File | Purpose | -|------|---------| -| `~/Library/LaunchAgents/com.hutson.happy-daemon.plist` | User LaunchAgent (starts at login) | -| `~/.happy/` | Happy Coder config, state, and logs | -| `~/.zshrc` | Contains `HAPPY_SERVER_URL` export | - -**Server:** `https://happy.htsn.io` (self-hosted Happy server on docker-host) - -### Troubleshooting -```bash -# Check if daemon is running -pgrep -f "happy.*daemon" - -# Check launchd status -launchctl list | grep happy - -# List active sessions -happy daemon list - -# Restart daemon -happy daemon stop && happy daemon start - -# If Tailscale is disconnected -/Applications/Tailscale.app/Contents/MacOS/Tailscale up -``` - -**Common Issues:** - -| Issue | Cause | Fix | -|-------|-------|-----| -| "Invalid API key" in spawned session | Claude not logged in locally | Run `claude` then `/login` on Mac Mini | -| "Failed to start daemon" | Stale lock file | `rm -f ~/.happy/daemon.state.json.lock ~/.happy/daemon.state.json` | -| Sessions not showing on phone | HAPPY_SERVER_URL not set | Add to `~/.zshrc`: `export HAPPY_SERVER_URL="https://happy.htsn.io"` | -| Slow responses | Cloudflare proxy enabled | Disable proxy for happy.htsn.io subdomain | - -## Happy Server (Self-Hosted Relay) - -Self-hosted Happy Coder relay server for lower latency and no external dependencies. - -### Architecture -``` -Phone App → https://happy.htsn.io → Traefik → docker-host:3002 → Happy Server - ↓ - PostgreSQL + Redis + MinIO (local) -``` - -### Service Details - -| Component | Location | Port | Notes | -|-----------|----------|------|-------| -| Happy Server | docker-host (10.10.10.206) | 3002 | Main relay service | -| PostgreSQL | docker-host | 5432 (internal) | User/session data | -| Redis | docker-host | 6379 (internal) | Real-time events | -| MinIO | docker-host | 9000 (internal) | File/image storage | -| Traefik | CT 202 | 443 | SSL termination | - -### Configuration - -**Docker Compose**: `/opt/happy-server/docker-compose.yml` -**Traefik Config**: `/etc/traefik/conf.d/happy.yaml` (on CT 202) -**DNS**: happy.htsn.io → 70.237.94.174 (Cloudflare DNS-only, NOT proxied for WebSocket performance) - -**Credentials**: -- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac` -- MinIO: `happyadmin` / `happyadmin123` -- PostgreSQL: `happy` / `happypass` - -### Quick Commands -```bash -# Check status -ssh docker-host 'docker ps --filter "name=happy"' - -# View logs -ssh docker-host 'docker logs -f happy-server' - -# Restart stack -ssh docker-host 'cd /opt/happy-server && sudo docker-compose restart' - -# Health check -curl https://happy.htsn.io/health - -# Run migrations (if needed) -ssh docker-host 'docker exec happy-server npx prisma migrate deploy' -``` - -### Connecting Devices - -**Phone (Happy App)**: -1. Settings → Relay Server URL -2. Enter: `https://happy.htsn.io` -3. Save and reconnect - -**CLI (Mac/Linux)**: -```bash -export HAPPY_SERVER_URL="https://happy.htsn.io" -happy auth # Re-authenticate with new server -``` - -### Maintenance - -**Backup data**: -```bash -ssh docker-host 'docker exec happy-postgres pg_dump -U happy happy > /tmp/happy-backup.sql' -``` - -**Update Happy Server**: -```bash -ssh docker-host 'cd /opt/happy-server && git pull && sudo docker-compose build && sudo docker-compose up -d' -``` - -## Agent and Tool Guidelines +## Agent & Tool Guidelines ### Background Agents -- **Always spin up background agents when doing multiple independent tasks** -- Background agents allow parallel execution of tasks that don't depend on each other -- This improves efficiency and reduces total execution time -- Use background agents for tasks like running tests, builds, or searches simultaneously +**Always** spin up background agents for multiple independent tasks: +- Parallel execution improves efficiency +- Use for: tests, builds, searches simultaneously -### MCP Tools for Web Searches +### MCP Tools -#### ref.tools - Documentation Lookups -- **`mcp__Ref__ref_search_documentation`**: Search through documentation for specific topics -- **`mcp__Ref__ref_read_url`**: Read and parse content from documentation URLs +| Tool | Provider | Use Case | +|------|----------|----------| +| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | +| `mcp__Ref__ref_read_url` | ref.tools | Read doc URLs | +| `mcp__exa__web_search_exa` | Exa | General web search | +| `mcp__exa__get_code_context_exa` | Exa | Code-specific search | -#### Exa MCP - General Web and Code Searches -- **`mcp__exa__web_search_exa`**: General web searches for current information -- **`mcp__exa__get_code_context_exa`**: Code-related searches and repository lookups - -### MCP Tools Reference Table - -| Tool Name | Provider | Purpose | Use Case | -|-----------|----------|---------|----------| -| `mcp__Ref__ref_search_documentation` | ref.tools | Search documentation | Finding specific topics in official docs | -| `mcp__Ref__ref_read_url` | ref.tools | Read documentation URLs | Parsing and extracting content from doc pages | -| `mcp__exa__web_search_exa` | Exa MCP | General web search | Current events, general information lookup | -| `mcp__exa__get_code_context_exa` | Exa MCP | Code-specific search | Finding code examples, repository searches | - -## Reverse Proxy Architecture (Traefik) - -### Overview -There are **TWO separate Traefik instances** handling different services: - -| Instance | Location | IP | Purpose | Manages | -|----------|----------|-----|---------|---------| -| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | All non-Saltbox services | -| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services | Plex, *arr apps, media stack | - -### ⚠️ CRITICAL RULE: Which Traefik to Use - -**When adding ANY new service:** -- ✅ **Use Traefik-Primary (10.10.10.250)** - Unless service lives inside Saltbox VM -- ❌ **DO NOT touch Traefik-Saltbox** - It manages Saltbox services with their own certificates - -**Why this matters:** -- Traefik-Saltbox has complex Saltbox-managed configs -- Messing with it breaks Plex, Sonarr, Radarr, and all media services -- Each Traefik has its own Let's Encrypt certificates -- Mixing them causes certificate conflicts - -### Traefik-Primary (CT 202) - For New Services - -**Location**: `/etc/traefik/` on Container 202 -**Config**: `/etc/traefik/traefik.yaml` -**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml` - -**Services using Traefik-Primary (10.10.10.250):** -- excalidraw.htsn.io → 10.10.10.206:8080 (docker-host) -- findshyt.htsn.io → 10.10.10.205 (CT 205) -- gitea (git.htsn.io) → 10.10.10.220:3000 -- homeassistant → 10.10.10.110 -- lmdev → 10.10.10.111 -- pihole → 10.10.10.200 -- truenas → 10.10.10.200 -- proxmox → 10.10.10.120 -- copyparty → 10.10.10.201 -- aitrade → trading server -- pulse.htsn.io → 10.10.10.206:7655 (Pulse monitoring) -- happy.htsn.io → 10.10.10.206:3002 (Happy Coder relay server) - -**Access Traefik config:** -```bash -# From Mac Mini: -ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml' -ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/' - -# Edit a service config: -ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml' -``` - -### Traefik-Saltbox (VM 101) - DO NOT MODIFY - -**Location**: `/opt/traefik/` inside Saltbox VM -**Managed by**: Saltbox Ansible playbooks -**Mounts**: Docker bind mount from `/opt/traefik` → `/etc/traefik` in container - -**Services using Traefik-Saltbox (10.10.10.100):** -- Plex (plex.htsn.io) -- Sonarr, Radarr, Lidarr -- SABnzbd, NZBGet, qBittorrent -- Overseerr, Tautulli, Organizr -- Jackett, NZBHydra2 -- Authelia (SSO) -- All other Saltbox-managed containers - -**View Saltbox Traefik (read-only):** -```bash -ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"' -``` - -### Adding a New Public Service - Complete Workflow - -Follow these steps to deploy a new service and make it publicly accessible at `servicename.htsn.io`. - -#### Step 0. Deploy Your Service - -First, deploy your service on the appropriate host: - -**Option A: Docker on docker-host (10.10.10.206)** -```bash -ssh hutson@10.10.10.206 -sudo mkdir -p /opt/myservice -cat > /opt/myservice/docker-compose.yml << 'EOF' -version: "3.8" -services: - myservice: - image: myimage:latest - ports: - - "8080:80" - restart: unless-stopped -EOF -cd /opt/myservice && sudo docker-compose up -d -``` - -**Option B: New LXC Container on PVE** -```bash -ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \ - --hostname myservice --memory 2048 --cores 2 \ - --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \ - --rootfs local-zfs:8 --unprivileged 1 --start 1' -``` - -**Option C: New VM on PVE** -```bash -ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \ - --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci' -``` - -#### Step 1. Create Traefik Config File - -Use this template for new services on **Traefik-Primary (CT 202)**: - -```yaml -# /etc/traefik/conf.d/myservice.yaml -http: - routers: - # HTTPS router - myservice-secure: - entryPoints: - - websecure - rule: "Host(`myservice.htsn.io`)" - service: myservice - tls: - certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only - priority: 50 - - # HTTP → HTTPS redirect - myservice-redirect: - entryPoints: - - web - rule: "Host(`myservice.htsn.io`)" - middlewares: - - myservice-https-redirect - service: myservice - priority: 50 - - services: - myservice: - loadBalancer: - servers: - - url: "http://10.10.10.XXX:PORT" - - middlewares: - myservice-https-redirect: - redirectScheme: - scheme: https - permanent: true -``` - -### SSL Certificates - -Traefik has **two certificate resolvers** configured: - -| Resolver | Use When | Challenge Type | Notes | -|----------|----------|----------------|-------| -| `letsencrypt` | Cloudflare DNS-only (gray cloud) | HTTP-01 | Requires port 80 reachable | -| `cloudflare` | Cloudflare Proxied (orange cloud) | DNS-01 | Works with Cloudflare proxy | - -**⚠️ Important:** If Cloudflare proxy is enabled (orange cloud), HTTP challenge fails because Cloudflare redirects HTTP→HTTPS. Use `cloudflare` resolver instead. - -**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`: -```bash -Environment="CF_API_EMAIL=cloudflare@htsn.io" -Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc" -``` - -**Certificate storage:** -- HTTP challenge certs: `/etc/traefik/acme.json` -- DNS challenge certs: `/etc/traefik/acme-cf.json` - -**Deploy the config:** -```bash -# Create file on CT 202 -ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\'' - -EOF"' - -# Traefik auto-reloads (watches conf.d directory) -# Check logs: -ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' -``` - -#### 2. Add Cloudflare DNS Entry - -**Cloudflare Credentials:** -- Email: `cloudflare@htsn.io` -- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc` - -**Manual method (via Cloudflare Dashboard):** -1. Go to https://dash.cloudflare.com/ -2. Select `htsn.io` domain -3. DNS → Add Record -4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️ - -**Automated method (CLI script):** - -Save this as `~/bin/add-cloudflare-dns.sh`: -```bash -#!/bin/bash -# Add DNS record to Cloudflare for htsn.io - -SUBDOMAIN="$1" -CF_EMAIL="cloudflare@htsn.io" -CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" -ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" # htsn.io zone -PUBLIC_IP="70.237.94.174" # Update if IP changes: curl -s ifconfig.me - -if [ -z "$SUBDOMAIN" ]; then - echo "Usage: $0 " - echo "Example: $0 myservice # Creates myservice.htsn.io" - exit 1 -fi - -curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ - -H "X-Auth-Email: $CF_EMAIL" \ - -H "X-Auth-Key: $CF_API_KEY" \ - -H "Content-Type: application/json" \ - --data "{ - \"type\":\"A\", - \"name\":\"$SUBDOMAIN\", - \"content\":\"$PUBLIC_IP\", - \"ttl\":1, - \"proxied\":true - }" | jq . -``` - -**Usage:** -```bash -chmod +x ~/bin/add-cloudflare-dns.sh -~/bin/add-cloudflare-dns.sh excalidraw # Creates excalidraw.htsn.io -``` - -#### 3. Testing - -```bash -# Check if DNS resolves -dig myservice.htsn.io - -# Test HTTP redirect -curl -I http://myservice.htsn.io - -# Test HTTPS -curl -I https://myservice.htsn.io - -# Check Traefik dashboard (if enabled) -# Access: http://10.10.10.250:8080/dashboard/ -``` - -#### Step 4. Update Documentation - -After deploying, update these files: - -1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table -2. **CLAUDE.md** - Add to "Services using Traefik-Primary" list (line ~495) - -### Quick Reference - One-Liner Commands - -```bash -# === DEPLOY SERVICE (example: myservice on docker-host port 8080) === - -# 1. Create Traefik config -ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF -http: - routers: - myservice-secure: - entryPoints: [websecure] - rule: Host(\\\`myservice.htsn.io\\\`) - service: myservice - tls: {certResolver: letsencrypt} - services: - myservice: - loadBalancer: - servers: - - url: http://10.10.10.206:8080 -EOF"' - -# 2. Add Cloudflare DNS -curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \ - -H "X-Auth-Email: cloudflare@htsn.io" \ - -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \ - -H "Content-Type: application/json" \ - --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}' - -# 3. Test (wait a few seconds for DNS propagation) -curl -I https://myservice.htsn.io -``` - -### Traefik Troubleshooting - -```bash -# View Traefik logs (CT 202) -ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' - -# Check if config is valid -ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml' - -# List all dynamic configs -ssh pve 'pct exec 202 -- ls -la /etc/traefik/conf.d/' - -# Check certificate -ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq' - -# Restart Traefik (if needed) -ssh pve 'pct exec 202 -- systemctl restart traefik' -``` - -### Certificate Management - -**Let's Encrypt certificates** are automatically managed by Traefik. - -**Certificate storage:** -- Traefik-Primary: `/etc/traefik/acme.json` on CT 202 -- Traefik-Saltbox: `/opt/traefik/acme.json` on VM 101 - -**Certificate renewal:** -- Automatic via HTTP-01 challenge -- Traefik checks every 24h -- Renews 30 days before expiry - -**If certificates fail:** -```bash -# Check acme.json permissions (must be 600) -ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme.json' - -# Check Traefik can reach Let's Encrypt -ssh pve 'pct exec 202 -- curl -I https://acme-v02.api.letsencrypt.org/directory' - -# Delete bad certificate (Traefik will re-request) -ssh pve 'pct exec 202 -- rm /etc/traefik/acme.json' -ssh pve 'pct exec 202 -- touch /etc/traefik/acme.json' -ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json' -ssh pve 'pct exec 202 -- systemctl restart traefik' -``` - -### Docker Service with Traefik Labels (Alternative) - -If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files: - -```yaml -# docker-compose.yml -services: - myservice: - image: myimage:latest - labels: - - "traefik.enable=true" - - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)" - - "traefik.http.routers.myservice.entrypoints=websecure" - - "traefik.http.routers.myservice.tls.certresolver=letsencrypt" - - "traefik.http.services.myservice.loadbalancer.server.port=8080" - networks: - - traefik - -networks: - traefik: - external: true -``` - -**Note**: This requires Traefik to have access to Docker socket and be on same network. - -## Cloudflare API Access - -**Credentials** (stored in Saltbox config): -- Email: `cloudflare@htsn.io` -- API Key: `849ebefd163d2ccdec25e49b3e1b3fe2cdadc` -- Domain: `htsn.io` - -**Retrieve from Saltbox:** -```bash -ssh pve 'qm guest exec 101 -- bash -c "cat /srv/git/saltbox/accounts.yml | grep -A2 cloudflare"' -``` - -**Cloudflare API Documentation:** -- API Docs: https://developers.cloudflare.com/api/ -- DNS Records: https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-create-dns-record - -**Common API operations:** - -```bash -# Set credentials -CF_EMAIL="cloudflare@htsn.io" -CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" -ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" - -# List all DNS records -curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ - -H "X-Auth-Email: $CF_EMAIL" \ - -H "X-Auth-Key: $CF_API_KEY" | jq - -# Add A record -curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ - -H "X-Auth-Email: $CF_EMAIL" \ - -H "X-Auth-Key: $CF_API_KEY" \ - -H "Content-Type: application/json" \ - --data '{"type":"A","name":"subdomain","content":"IP","proxied":true}' - -# Delete record -curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \ - -H "X-Auth-Email: $CF_EMAIL" \ - -H "X-Auth-Key: $CF_API_KEY" -``` +--- ## Git Repository -This documentation is stored at: - **Gitea**: https://git.htsn.io/hutson/homelab-docs - **Local**: `~/Projects/homelab` - **Notes**: `~/Notes/05_Homelab` (symlink) ```bash -# Clone -git clone git@git.htsn.io:hutson/homelab-docs.git - -# Push changes cd ~/Projects/homelab git add -A && git commit -m "Update docs" && git push ``` -## Related Documentation - -| File | Description | -|------|-------------| -| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | EMC storage enclosure (SES commands, LCC troubleshooting, maintenance) | -| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API access, automations, integrations | -| [NETWORK.md](NETWORK.md) | Network bridges, VLANs, which bridge to use for new VMs | -| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices and services | -| [SYNCTHING.md](SYNCTHING.md) | Syncthing setup, API access, device list, troubleshooting | -| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code (`chomelab`, `ctrading`, etc.) | -| [configs/](configs/) | Symlinks to shared shell configs | - --- ## Backlog -Future improvements and maintenance tasks: - | Priority | Task | Notes | |----------|------|-------| -| Medium | **Re-IP all devices** | Current IP scheme is inconsistent. Plan: VMs 10.10.10.100-199, LXCs 10.10.10.200-249, Services 10.10.10.250-254 | -| Low | Install SSH on HomeAssistant | Currently only accessible via QEMU agent | -| Low | Set up SSH key for router | Currently requires expect/password | +| Medium | Re-IP all devices | Current IPs inconsistent | +| Medium | Upgrade to 20A circuit for UPS | Plug rewired 5-20P→5-15P | +| Low | Install SSH on HomeAssistant | Currently QEMU agent only | --- -## Changelog +## Recent Changes + +### 2025-12-22 +- Created comprehensive Phase 1 documentation split +- New docs: README.md, BACKUP-STRATEGY.md, STORAGE.md, UPS.md, TRAEFIK.md, SSH-ACCESS.md, POWER-MANAGEMENT.md, VMS.md +- Cleaned up CLAUDE.md to quick reference only + +### 2025-12-21 +- UPS upgrade: CyberPower OR2200PFCRT2U (1320W) +- NUT monitoring configured (master/slave) +- Full power failure test successful (~7 min recovery) +- Happy Server self-hosted relay deployed +- PVE Tailscale routing fix +- Proxmox 2-node cluster quorum fix + +**Full changelog**: See end of this file + +--- + +**Last Updated**: 2025-12-22 +**Documentation Status**: ✅ Phase 1 Complete + +--- + +
+Full Changelog (Click to expand) ### 2025-12-21 +**UPS Upgrade** +- Replaced WattBox WB-1100-IPVMB-6 (660W) with CyberPower OR2200PFCRT2U (1320W) +- Temporarily rewired plug 5-20P → 5-15P for 15A circuit +- Runtime: ~15-20 min at 33% load + +**NUT Monitoring** +- Configured NUT on PVE (master), PVE2 (slave) +- Shutdown threshold: 120 seconds runtime +- Custom shutdown script: `/usr/local/bin/ups-shutdown.sh` +- Home Assistant integration (UPS sensors) + **Happy Server Self-Hosted Relay** -- Deployed self-hosted Happy Coder relay server on docker-host (10.10.10.206) -- Stack includes: Happy Server, PostgreSQL, Redis, MinIO (all containerized) -- Configured Traefik reverse proxy at https://happy.htsn.io -- Added Cloudflare DNS record (proxied) -- Fixed Dockerfile to include Prisma migrations on startup +- Deployed on docker-host (10.10.10.206) +- Stack: Happy Server + PostgreSQL + Redis + MinIO +- URL: https://happy.htsn.io +- Traefik reverse proxy configured -**Docker-host CPU Upgrade** -- Changed VM 206 CPU from emulated to `host` passthrough -- Fixes x86-64-v2 compatibility issues with modern binaries (Sharp, MinIO) -- Requires: `ssh pve 'qm set 206 -cpu host'` + VM reboot +**Proxmox Fixes** +- PVE Tailscale routing: Added rule for local network access +- PVE2 MTU fix: vmbr0 + nic1 both set to 9000 +- 2-node cluster quorum: `two_node: 1` in corosync.conf -**PVE Tailscale Routing Fix** -- Fixed issue where PVE was unreachable via local network (10.10.10.120) -- Root cause: Tailscale routing table 52 was capturing local subnet traffic -- Fix: Added routing rule `ip rule add from 10.10.10.120 table main priority 5200` -- Made permanent in `/etc/network/interfaces` under vmbr0 +**Power Failure Test** +- Full end-to-end test successful +- VMs stopped gracefully at 2 min runtime +- Total recovery: ~7 minutes ### 2024-12-20 -**Git Repository Setup** -- Created homelab-docs repo on Gitea (git.htsn.io/hutson/homelab-docs) -- Set up SSH key authentication for git@git.htsn.io -- Created symlink from ~/Notes/05_Homelab → ~/Projects/homelab -- Added Gitea API token for future automation - -**SSH Key Deployment - All Systems** -- Added SSH keys to ALL VMs and LXCs (13 total hosts now accessible via key) -- Updated `~/.ssh/config` with complete host aliases -- Fixed permissions: FindShyt LXC `.ssh` ownership, enabled PermitRootLogin on LXCs -- Hosts now accessible: pve, pve2, truenas, saltbox, lmdev1, docker-host, fs-dev, copyparty, gitea-vm, trading-vm, pihole, traefik, findshyt - -**Documentation Updates** -- Rewrote SSH Access section with complete host table -- Added Password Auth section for router/Windows/HomeAssistant -- Added Backlog section with re-IP task -- Added Git Repository section with clone/push instructions +**Git & SSH** +- Created homelab-docs repo on Gitea +- Deployed SSH keys to all VMs/LXCs (13 hosts) +- Updated ~/.ssh/config with host aliases ### 2024-12-19 -**EMC Storage Enclosure - LCC B Failure** -- Diagnosed loud fan issue (speed code 5 → 4160 RPM) -- Root cause: Faulty LCC B controller causing false readings -- Resolution: Switched SAS cable to LCC A, fans now quiet (speed code 3 → 2670 RPM) -- Replacement ordered: EMC 303-108-000E ($14.95 eBay) -- Created [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) with full documentation +**EMC Storage Enclosure** +- LCC B failure diagnosed, switched to LCC A +- Fans now quiet (speed code 3 vs 5) +- Created EMC-ENCLOSURE.md documentation -**SSH Key Consolidation** -- Renamed `~/.ssh/ai_trading_ed25519` → `~/.ssh/homelab` -- Updated `~/.ssh/config` on MacBook with all homelab hosts -- SSH key auth now works for: pve, pve2, docker-host, fs-dev, copyparty, lmdev1, gitea-vm, trading-vm -- No more sshpass needed for PVE servers +**QEMU Guest Agent** +- Installed on docker-host, fs-dev, copyparty +- All VMs now have agent except homeassistant -**QEMU Guest Agent Deployment** -- Installed on: docker-host (206), fs-dev (105), copyparty (201) -- All PVE VMs now have agent except homeassistant (110) -- Can now use `qm guest exec` for remote commands - -**VM Configuration Updates** -- docker-host: Fixed SSH key in cloud-init -- fs-dev: Fixed `.ssh` directory ownership (1000 → 1001) -- copyparty: Changed from DHCP to static IP (10.10.10.201) - -**Documentation Updates** -- Updated CLAUDE.md SSH section (removed sshpass examples) -- Added QEMU Agent column to VM tables -- Added storage enclosure troubleshooting to runbooks +
diff --git a/HARDWARE.md b/HARDWARE.md new file mode 100644 index 0000000..e063500 --- /dev/null +++ b/HARDWARE.md @@ -0,0 +1,455 @@ +# Hardware Inventory + +Complete hardware specifications for all homelab equipment. + +## Servers + +### PVE (10.10.10.120) - Primary Proxmox Server + +#### CPU +- **Model**: AMD Ryzen Threadripper PRO 3975WX +- **Cores**: 32 cores / 64 threads +- **Base Clock**: 3.5 GHz +- **Boost Clock**: 4.2 GHz +- **TDP**: 280W +- **Architecture**: Zen 2 (7nm) +- **Socket**: sTRX4 +- **Features**: ECC support, PCIe 4.0 + +#### RAM +- **Capacity**: 128 GB +- **Type**: DDR4 ECC Registered +- **Speed**: Unknown (needs investigation) +- **Channels**: 8-channel (quad-channel per socket) +- **Idle Power**: ~30-40W + +#### Storage + +**OS/VM Storage:** + +| Pool | Devices | Type | Capacity | Purpose | +|------|---------|------|----------|---------| +| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage | +| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage | +| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups | + +**Total Storage**: ~9 TB usable + +#### GPUs + +| Model | Slot | VRAM | TDP | Purpose | Passed To | +|-------|------|------|-----|---------|-----------| +| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host | +| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) | + +**Total GPU Power**: 75W + 280W = 355W (under load) + +#### Network Cards + +| Interface | Model | Speed | Purpose | Bridge | +|-----------|-------|-------|---------|--------| +| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 | +| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 | +| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 | + +**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode) + +#### Storage Controllers + +| Model | Interface | Purpose | +|-------|-----------|---------| +| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure | +| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching | + +#### Motherboard +- **Model**: Unknown - needs investigation +- **Chipset**: AMD TRX40 +- **Form Factor**: ATX/EATX +- **PCIe Slots**: Multiple PCIe 4.0 slots +- **Features**: IOMMU support, ECC memory + +#### Power Supply +- **Model**: Unknown +- **Wattage**: Likely 1000W+ (needs investigation) +- **Type**: ATX, 80+ certification unknown + +#### Cooling +- **CPU Cooler**: Unknown - likely large tower or AIO +- **Case Fans**: Unknown quantity +- **Note**: CPU temps 70-80°C under load (healthy) + +--- + +### PVE2 (10.10.10.102) - Secondary Proxmox Server + +#### CPU +- **Model**: AMD Ryzen Threadripper PRO 3975WX +- **Specs**: Same as PVE (32C/64T, 280W TDP) + +#### RAM +- **Capacity**: 128 GB DDR4 ECC +- **Same specs as PVE** + +#### Storage + +| Pool | Devices | Type | Capacity | Purpose | +|------|---------|------|----------|---------| +| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage | +| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) | + +**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W) + +#### GPUs + +| Model | Slot | VRAM | TDP | Purpose | Passed To | +|-------|------|------|-----|---------|-----------| +| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) | + +#### Network Cards + +| Interface | Model | Speed | Purpose | +|-----------|-------|-------|---------| +| nic1 | Unknown (onboard) | 1 Gb | Management | + +**Note**: MTU set to 9000 for jumbo frames + +#### Motherboard +- **Model**: Unknown +- **Chipset**: AMD TRX40 +- **Similar to PVE** + +--- + +## Network Equipment + +### UniFi Dream Machine Pro (UCG-Fiber) + +- **Model**: UniFi Cloud Gateway Fiber +- **IP**: 10.10.10.1 +- **Ports**: Multiple 1Gb + SFP+ uplink +- **Features**: Router, firewall, VPN, IDS/IPS +- **MTU**: 9216 (supports jumbo frames) +- **Tailscale**: Installed for VPN failover + +### Switches + +**Details needed** - investigate current switch setup: +- 10Gb switch for high-speed connections? +- 1Gb switch for general devices? +- PoE capabilities? + +```bash +# Check what's connected to 10Gb interfaces +ssh pve 'ip link show enp35s0f0' +ssh pve 'ip link show enp35s0f1' +``` + +--- + +## Storage Hardware + +### EMC Storage Enclosure + +**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details** + +- **Model**: EMC KTN-STL4 (or similar) +- **Form Factor**: 4U rackmount +- **Drive Bays**: 25x 3.5" SAS/SATA +- **Controllers**: Dual LCC (Link Control Cards) +- **Connection**: SAS via LSI SAS2308 HBA +- **Passed to**: TrueNAS VM (VMID 100) + +**Current Status**: +- LCC A: Active (working) +- LCC B: Failed (replacement ordered) + +**Drive Inventory**: Unknown - needs audit + +```bash +# Get drive list from TrueNAS +ssh truenas 'smartctl --scan' +ssh truenas 'lsblk' +``` + +### NVMe Drives + +| Model | Quantity | Capacity | Location | Pool | +|-------|----------|----------|----------|------| +| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 | +| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 | +| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 | +| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache | + +### SSDs + +| Model | Quantity | Capacity | Location | Pool | +|-------|----------|----------|----------|------| +| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool | + +### HDDs + +| Model | Quantity | Capacity | Location | Pool | +|-------|----------|----------|----------|------| +| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 | +| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault | + +--- + +## UPS + +### Current UPS + +| Specification | Value | +|---------------|-------| +| **Model** | CyberPower OR2200PFCRT2U | +| **Capacity** | 2200VA / 1320W | +| **Form Factor** | 2U rackmount | +| **Input** | NEMA 5-15P (rewired from 5-20P) | +| **Outlets** | 2x 5-20R + 6x 5-15R | +| **Output** | PFC Sinewave | +| **Runtime** | ~15-20 min @ 33% load | +| **Interface** | USB (connected to PVE) | + +**See [UPS.md](UPS.md) for configuration details** + +--- + +## Client Devices + +### Mac Mini (Hutson's Workstation) + +- **Model**: Unknown generation +- **CPU**: Unknown +- **RAM**: Unknown +- **Storage**: Unknown +- **Network**: 1Gb Ethernet (en0) - MTU 9000 +- **Tailscale IP**: 100.108.89.58 +- **Local IP**: 10.10.10.125 (static) +- **Purpose**: Primary workstation, Happy Coder daemon host + +### MacBook (Mobile) + +- **Model**: Unknown +- **Network**: Wi-Fi + Ethernet adapter +- **Tailscale IP**: Unknown +- **Purpose**: Mobile work, development + +### Windows PC + +- **Model**: Unknown +- **CPU**: Unknown +- **Network**: 1Gb Ethernet +- **IP**: 10.10.10.150 +- **Purpose**: Gaming, Windows development, Syncthing node + +### Phone (Android) + +- **Model**: Unknown +- **IP**: 10.10.10.54 (when on Wi-Fi) +- **Purpose**: Syncthing mobile node, Happy Coder client + +--- + +## Rack Layout (If Applicable) + +**Needs documentation** - Current rack configuration unknown + +Suggested format: +``` +U42: Blank panel +U41: UPS (CyberPower 2U) +U40: UPS (CyberPower 2U) +U39: Switch (10Gb) +U38-U35: EMC Storage Enclosure (4U) +U34: PVE Server +U33: PVE2 Server +... +``` + +--- + +## Power Consumption + +### Measured Power Draw + +| Component | Idle | Typical | Peak | Notes | +|-----------|------|---------|------|-------| +| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage | +| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage | +| Network Gear | ~50W | ~50W | ~50W | Router + switches | +| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load | + +**UPS Capacity**: 1320W +**Typical Load**: 33-50% (safe margin) +**Peak Load**: Can exceed UPS capacity temporarily (acceptable) + +### Power Optimizations Applied + +**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details** + +- KSMD disabled: ~60-80W saved +- CPU governors: ~60-120W saved +- Syncthing rescans: ~60-80W saved +- HDD spindown: ~10-16W saved when idle +- **Total savings**: ~150-300W + +--- + +## Thermal Management + +### CPU Cooling + +**PVE & PVE2**: +- CPU cooler: Unknown model +- Thermal paste: Unknown, likely needs refresh if temps >85°C +- Target temp: 70-80°C under load +- Max safe: 90°C Tctl (Threadripper PRO spec) + +### GPU Cooling + +All GPUs are passively managed (stock coolers): +- TITAN RTX: 2-3W idle, 280W load +- RTX A6000: 11W idle, 300W load +- Quadro P2000: 25W constant (Plex active) + +### Case Airflow + +**Unknown** - needs investigation: +- Case model? +- Fan configuration? +- Positive or negative pressure? + +--- + +## Cable Management + +### Network Cables + +| Connection | Type | Length | Speed | +|------------|------|--------|-------| +| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb | +| PVE2 → Router | Cat6 | Unknown | 1Gb | +| Mac Mini → Switch | Cat6 | Unknown | 1Gb | +| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s | + +### Power Cables + +**Critical**: All servers on UPS battery-backed outlets + +--- + +## Maintenance Schedule + +### Annual Maintenance + +- [ ] Clean dust from servers (every 6-12 months) +- [ ] Check thermal paste on CPUs (every 2-3 years) +- [ ] Test UPS battery runtime (annually) +- [ ] Verify all fans operational +- [ ] Check for bulging capacitors on PSUs + +### Drive Health + +```bash +# Check SMART status on all drives +ssh pve 'smartctl -a /dev/nvme0' +ssh pve2 'smartctl -a /dev/sda' +ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done' +``` + +### Temperature Monitoring + +```bash +# Check all temps (needs lm-sensors installed) +ssh pve 'sensors' +ssh pve2 'sensors' +``` + +--- + +## Warranty & Purchase Info + +**Needs documentation**: +- When were servers purchased? +- Where were components bought? +- Any warranties still active? +- Replacement part sources? + +--- + +## Upgrade Path + +### Short-term Upgrades (< 6 months) + +- [ ] 20A circuit for UPS (restore original 5-20P plug) +- [ ] Document missing hardware specs +- [ ] Label all cables +- [ ] Create rack diagram + +### Medium-term Upgrades (6-12 months) + +- [ ] Additional 10Gb NIC for PVE2? +- [ ] More NVMe storage? +- [ ] Upgrade network switches? +- [ ] Replace EMC enclosure with newer model? + +### Long-term Upgrades (1-2 years) + +- [ ] CPU upgrade to newer Threadripper? +- [ ] RAM expansion to 256GB? +- [ ] Additional GPU for AI workloads? +- [ ] Migrate to PCIe 5.0 storage? + +--- + +## Investigation Needed + +High-priority items to document: + +- [ ] Get exact motherboard model (both servers) +- [ ] Get PSU model and wattage +- [ ] CPU cooler models +- [ ] Network switch models and configuration +- [ ] Complete drive inventory in EMC enclosure +- [ ] RAM speed and timings +- [ ] Case models +- [ ] Exact NVMe models for all drives + +**Commands to gather info**: + +```bash +# Motherboard +ssh pve 'dmidecode -t baseboard' + +# CPU details +ssh pve 'lscpu' + +# RAM details +ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"' + +# Storage devices +ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL' + +# Network cards +ssh pve 'lspci | grep -i network' + +# GPU details +ssh pve 'lspci | grep -i vga' +ssh pve 'nvidia-smi -L' # If nvidia-smi available +``` + +--- + +## Related Documentation + +- [VMS.md](VMS.md) - VM resource allocation +- [STORAGE.md](STORAGE.md) - Storage pools and usage +- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations +- [UPS.md](UPS.md) - UPS configuration +- [NETWORK.md](NETWORK.md) - Network configuration +- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details + +--- + +**Last Updated**: 2025-12-22 +**Status**: ⚠️ Incomplete - many specs need investigation diff --git a/HOMEASSISTANT.md b/HOMEASSISTANT.md index 43a8784..3c23476 100644 --- a/HOMEASSISTANT.md +++ b/HOMEASSISTANT.md @@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \ - **Philips Hue** - Lights - **Sonos** - Speakers - **Motion Sensors** - Various locations +- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21) + +### NUT / UPS Integration + +Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE. + +**Connection:** +- Host: 10.10.10.120 +- Port: 3493 +- Username: upsmon +- Password: upsmon123 + +**Entities:** +| Entity ID | Description | +|-----------|-------------| +| `sensor.cyberpower_battery_charge` | Battery percentage | +| `sensor.cyberpower_load` | Current load % | +| `sensor.cyberpower_input_voltage` | Input voltage | +| `sensor.cyberpower_output_voltage` | Output voltage | +| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) | +| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) | + +**Dashboard Card Example:** +```yaml +type: entities +title: UPS Status +entities: + - entity: sensor.cyberpower_status + name: Status + - entity: sensor.cyberpower_battery_charge + name: Battery + - entity: sensor.cyberpower_load + name: Load + - entity: sensor.cyberpower_input_voltage + name: Input Voltage +``` ## Automations diff --git a/MAINTENANCE.md b/MAINTENANCE.md new file mode 100644 index 0000000..fef9648 --- /dev/null +++ b/MAINTENANCE.md @@ -0,0 +1,618 @@ +# Maintenance Procedures and Schedules + +Regular maintenance procedures for homelab infrastructure to ensure reliability and performance. + +## Overview + +| Frequency | Tasks | Estimated Time | +|-----------|-------|----------------| +| **Daily** | Quick health check | 2-5 min | +| **Weekly** | Service status, logs review | 15-30 min | +| **Monthly** | Updates, backups verification | 1-2 hours | +| **Quarterly** | Full system audit, testing | 2-4 hours | +| **Annual** | Hardware maintenance, planning | 4-8 hours | + +--- + +## Daily Maintenance (Automated) + +### Quick Health Check Script + +Save as `~/bin/homelab-health-check.sh`: + +```bash +#!/bin/bash +# Daily homelab health check + +echo "=== Homelab Health Check ===" +echo "Date: $(date)" +echo "" + +echo "=== Server Status ===" +ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE" +ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE" +echo "" + +echo "=== CPU Temperatures ===" +ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done' +ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done' +echo "" + +echo "=== UPS Status ===" +ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' +echo "" + +echo "=== ZFS Pools ===" +ssh pve 'zpool status -x' 2>/dev/null +ssh pve2 'zpool status -x' 2>/dev/null +ssh truenas 'zpool status -x vault' +echo "" + +echo "=== Disk Space ===" +ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"' +ssh truenas 'df -h /mnt/vault' +echo "" + +echo "=== VM Status ===" +ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:" +ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:" +echo "" + +echo "=== Syncthing Connections ===" +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/connections" | \ + python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ + [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" +echo "" + +echo "=== Check Complete ===" +``` + +**Run daily via cron**: +```bash +# Add to crontab +0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com +``` + +--- + +## Weekly Maintenance + +### Service Status Review + +**Check all critical services**: +```bash +# Proxmox services +ssh pve 'systemctl status pve-cluster pvedaemon pveproxy' +ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy' + +# NUT (UPS monitoring) +ssh pve 'systemctl status nut-server nut-monitor' +ssh pve2 'systemctl status nut-monitor' + +# Container services +ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole +ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik + +# VM services (via QEMU agent) +ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS +``` + +### Log Review + +**Check for errors in critical logs**: +```bash +# Proxmox system logs +ssh pve 'journalctl -p err -b | tail -50' +ssh pve2 'journalctl -p err -b | tail -50' + +# VM logs (if QEMU agent available) +ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"' + +# Traefik access logs +ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log' +``` + +### Syncthing Sync Status + +**Check for sync errors**: +```bash +# Check all folder errors +for folder in documents downloads desktop movies pictures notes config; do + echo "=== $folder ===" + curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq +done +``` + +**See**: [SYNCTHING.md](SYNCTHING.md) + +--- + +## Monthly Maintenance + +### System Updates + +#### Proxmox Updates + +**Check for updates**: +```bash +ssh pve 'apt update && apt list --upgradable' +ssh pve2 'apt update && apt list --upgradable' +``` + +**Apply updates**: +```bash +# PVE +ssh pve 'apt update && apt dist-upgrade -y' + +# PVE2 +ssh pve2 'apt update && apt dist-upgrade -y' + +# Reboot if kernel updated +ssh pve 'reboot' +ssh pve2 'reboot' +``` + +**⚠️ Important**: +- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates +- Test on PVE2 first if possible +- Ensure all VMs are backed up before updating +- Monitor VMs after reboot - some may need manual restart + +#### Container Updates (LXC) + +```bash +# Update all containers +ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done' +``` + +#### VM Updates + +**Update VMs individually via SSH**: +```bash +# Ubuntu/Debian VMs +ssh truenas 'apt update && apt upgrade -y' +ssh docker-host 'apt update && apt upgrade -y' +ssh fs-dev 'apt update && apt upgrade -y' + +# Check if reboot required +ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"' +``` + +### ZFS Scrubs + +**Schedule**: Run monthly on all pools + +**PVE**: +```bash +# Start scrub on all pools +ssh pve 'zpool scrub nvme-mirror1' +ssh pve 'zpool scrub nvme-mirror2' +ssh pve 'zpool scrub rpool' + +# Check scrub status +ssh pve 'zpool status | grep -A2 scrub' +``` + +**PVE2**: +```bash +ssh pve2 'zpool scrub nvme-mirror3' +ssh pve2 'zpool scrub local-zfs2' +ssh pve2 'zpool status | grep -A2 scrub' +``` + +**TrueNAS**: +```bash +# Scrub via TrueNAS web UI or SSH +ssh truenas 'zpool scrub vault' +ssh truenas 'zpool status vault | grep -A2 scrub' +``` + +**Automate scrubs**: +```bash +# Add to crontab (run on 1st of month at 2 AM) +0 2 1 * * /sbin/zpool scrub nvme-mirror1 +0 2 1 * * /sbin/zpool scrub nvme-mirror2 +0 2 1 * * /sbin/zpool scrub rpool +``` + +**See**: [STORAGE.md](STORAGE.md) for pool details + +### SMART Tests + +**Run extended SMART tests monthly**: + +```bash +# TrueNAS drives (via QEMU agent) +ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"' + +# Check results after 4-8 hours +ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"' + +# PVE drives +ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done' + +# PVE2 drives +ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done' +``` + +**Automate SMART tests**: +```bash +# Add to crontab (run on 15th of month at 3 AM) +0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0 +0 3 15 * * /usr/sbin/smartctl -t long /dev/sda +``` + +### Certificate Renewal Verification + +**Check SSL certificate expiry**: +```bash +# Check Traefik certificates +ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"' + +# Check specific service +echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates +``` + +**Certificates should auto-renew 30 days before expiry via Traefik** + +**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management + +### Backup Verification + +**⚠️ TODO**: No backup strategy currently in place + +**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan + +--- + +## Quarterly Maintenance + +### Full System Audit + +**Check all systems comprehensively**: + +1. **ZFS Pool Health**: + ```bash + ssh pve 'zpool status -v' + ssh pve2 'zpool status -v' + ssh truenas 'zpool status -v vault' + ``` + Look for: errors, degraded vdevs, resilver operations + +2. **SMART Health**: + ```bash + # Run SMART health check script + ~/bin/smart-health-check.sh + ``` + Look for: reallocated sectors, pending sectors, failures + +3. **Disk Space Trends**: + ```bash + # Check growth rate + ssh pve 'zpool list -o name,size,allocated,free,fragmentation' + ssh truenas 'df -h /mnt/vault' + ``` + Plan for expansion if >80% full + +4. **VM Resource Usage**: + ```bash + # Check if VMs need more/less resources + ssh pve 'qm list' + ssh pve 'pvesh get /nodes/pve/status' + ``` + +5. **Network Performance**: + ```bash + # Test bandwidth between critical nodes + iperf3 -s # On one host + iperf3 -c 10.10.10.120 # From another + ``` + +6. **Temperature Monitoring**: + ```bash + # Check max temps over past quarter + # TODO: Set up Prometheus/Grafana for historical data + ssh pve 'sensors' + ssh pve2 'sensors' + ``` + +### Service Dependency Testing + +**Test critical paths**: + +1. **Power failure recovery** (if safe to test): + - See [UPS.md](UPS.md) for full procedure + - Verify VM startup order works + - Confirm all services come back online + +2. **Failover testing**: + - Tailscale subnet routing (PVE → UCG-Fiber) + - NUT monitoring (PVE server → PVE2 client) + +3. **Backup restoration** (when backups implemented): + - Test restoring a VM from backup + - Test restoring files from Syncthing versioning + +### Documentation Review + +- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) +- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md) +- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md) +- [ ] Update any changed procedures in this document + +--- + +## Annual Maintenance + +### Hardware Maintenance + +**Physical cleaning**: +```bash +# Shut down servers (coordinate with users) +ssh pve 'shutdown -h now' +ssh pve2 'shutdown -h now' + +# Clean dust from: +# - CPU heatsinks +# - GPU fans +# - Case fans +# - PSU vents +# - Storage enclosure fans + +# Check for: +# - Bulging capacitors on PSU/motherboard +# - Loose cables +# - Fan noise/vibration +``` + +**Thermal paste inspection** (every 2-3 years): +- Check CPU temps vs baseline +- If temps >85°C under load, consider reapplying paste +- Threadripper PRO: Tctl max safe = 90°C + +**See**: [HARDWARE.md](HARDWARE.md) for component details + +### UPS Battery Test + +**Runtime test**: +```bash +# Check battery health +ssh pve 'upsc cyberpower@localhost | grep battery' + +# Perform runtime test (coordinate power loss) +# 1. Note current runtime estimate +# 2. Unplug UPS from wall +# 3. Let battery drain to 20% +# 4. Note actual runtime vs estimate +# 5. Plug back in before shutdown triggers + +# Battery replacement if: +# - Runtime < 10 min at typical load +# - Battery age > 3-5 years +# - Battery charge < 100% when on AC for 24h +``` + +**See**: [UPS.md](UPS.md) for full UPS details + +### Drive Replacement Planning + +**Check drive age and health**: +```bash +# Get drive hours and health +ssh truenas 'smartctl --scan | while read dev type; do + echo "=== $dev ==="; + smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending"; +done' +``` + +**Replace drives if**: +- Reallocated sectors > 0 +- Pending sectors > 0 +- SMART pre-fail warnings +- Age > 5 years for HDDs (3-5 years for SSDs/NVMe) +- Hours > 50,000 for consumer drives + +**Budget for replacements**: +- HDDs: WD Red 6TB (~$150/drive) +- NVMe: Samsung/Kingston 2TB (~$150-200/drive) + +### Capacity Planning + +**Review growth trends**: +```bash +# Storage growth (compare to last year) +ssh pve 'zpool list' +ssh truenas 'df -h /mnt/vault' + +# Network bandwidth (if monitoring in place) +# Review Grafana dashboards + +# Power consumption +ssh pve 'upsc cyberpower@localhost ups.load' +``` + +**Plan expansions**: +- Storage: Add drives if >70% full +- RAM: Check if VMs hitting limits +- Network: Upgrade if bandwidth saturation +- UPS: Upgrade if load >80% + +### License and Subscription Review + +**Proxmox subscription** (if applicable): +- Community (free) or Enterprise subscription? +- Check for updates to pricing/features + +**Service subscriptions**: +- Domain registration (htsn.io) +- Cloudflare plan (currently free) +- Let's Encrypt (free, no action needed) + +--- + +## Update Schedules + +### Proxmox + +| Component | Frequency | Notes | +|-----------|-----------|-------| +| Security patches | Weekly | Via `apt upgrade` | +| Minor updates | Monthly | Test on PVE2 first | +| Major versions | Quarterly | Read release notes, plan downtime | +| Kernel updates | Monthly | Requires reboot | + +**Update procedure**: +1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) +2. Backup VM configs: `vzdump --dumpdir /tmp` +3. Update: `apt update && apt dist-upgrade` +4. Reboot if kernel changed: `reboot` +5. Verify VMs auto-started: `qm list` + +### Containers (LXC) + +| Container | Update Frequency | Package Manager | +|-----------|------------------|-----------------| +| Pi-hole (200) | Weekly | `apt` | +| Traefik (202) | Monthly | `apt` | +| FindShyt (205) | As needed | `apt` | + +**Update command**: +```bash +ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"' +``` + +### VMs + +| VM | Update Frequency | Notes | +|----|------------------|-------| +| TrueNAS | Monthly | Via web UI or `apt` | +| Saltbox | Weekly | Managed by Saltbox updates | +| HomeAssistant | Monthly | Via HA supervisor | +| Docker-host | Weekly | `apt` + Docker images | +| Trading-VM | As needed | Via SSH | +| Gitea-VM | Monthly | Via web UI + `apt` | + +**Docker image updates**: +```bash +ssh docker-host 'docker-compose pull && docker-compose up -d' +``` + +### Firmware Updates + +| Component | Check Frequency | Update Method | +|-----------|----------------|---------------| +| Motherboard BIOS | Annually | Manual flash (high risk) | +| GPU firmware | Rarely | `nvidia-smi` or manual | +| SSD/NVMe firmware | Quarterly | Vendor tools | +| HBA firmware | Annually | LSI tools | +| UPS firmware | Annually | PowerPanel or manual | + +**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if: +- Critical security issue +- Needed for hardware compatibility +- Fixing known bug affecting you + +--- + +## Testing Checklists + +### Pre-Update Checklist + +Before ANY system update: +- [ ] Check current system state: `uptime`, `qm list`, `zpool status` +- [ ] Verify backups are current (when backup system in place) +- [ ] Check for critical VMs/services that can't have downtime +- [ ] Review update changelog/release notes +- [ ] Test on non-critical system first (PVE2 or test VM) +- [ ] Plan rollback strategy if update fails +- [ ] Notify users if downtime expected + +### Post-Update Checklist + +After system update: +- [ ] Verify system booted correctly: `uptime` +- [ ] Check all VMs/CTs started: `qm list`, `pct list` +- [ ] Test critical services: + - [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10` + - [ ] Traefik routing: `curl -I https://plex.htsn.io` + - [ ] NFS/SMB shares: Test mount from VM + - [ ] Syncthing sync: Check all devices connected +- [ ] Review logs for errors: `journalctl -p err -b` +- [ ] Check temperatures: `sensors` +- [ ] Verify UPS monitoring: `upsc cyberpower@localhost` + +### Disaster Recovery Test + +**Quarterly test** (when backup system in place): +- [ ] Simulate VM failure: Restore from backup +- [ ] Simulate storage failure: Import pool on different system +- [ ] Simulate network failure: Verify Tailscale failover +- [ ] Simulate power failure: Test UPS shutdown procedure (if safe) +- [ ] Document recovery time and issues + +--- + +## Log Rotation + +**System logs** are automatically rotated by systemd-journald and logrotate. + +**Check log sizes**: +```bash +# Journalctl size +ssh pve 'journalctl --disk-usage' + +# Traefik logs +ssh pve 'pct exec 202 -- du -sh /var/log/traefik/' +``` + +**Configure retention**: +```bash +# Limit journald to 500MB +ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf' +ssh pve 'systemctl restart systemd-journald' +``` + +**Traefik log rotation** (already configured): +```bash +# /etc/logrotate.d/traefik on CT 202 +/var/log/traefik/*.log { + daily + rotate 7 + compress + delaycompress + missingok + notifempty +} +``` + +--- + +## Monitoring Integration + +**TODO**: Set up automated monitoring for these procedures + +**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)): +- ZFS scrub completion/errors +- SMART test failures +- Certificate expiry warnings (<30 days) +- Update availability notifications +- Disk space thresholds (>80%) +- Temperature warnings (>85°C) + +--- + +## Related Documentation + +- [MONITORING.md](MONITORING.md) - Automated health checks and alerts +- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan +- [UPS.md](UPS.md) - Power failure procedures +- [STORAGE.md](STORAGE.md) - ZFS pool management +- [HARDWARE.md](HARDWARE.md) - Hardware specifications +- [SERVICES.md](SERVICES.md) - Service inventory + +--- + +**Last Updated**: 2025-12-22 +**Status**: ⚠️ Manual procedures only - monitoring automation needed diff --git a/MONITORING.md b/MONITORING.md new file mode 100644 index 0000000..fad8286 --- /dev/null +++ b/MONITORING.md @@ -0,0 +1,546 @@ +# Monitoring and Alerting + +Documentation for system monitoring, health checks, and alerting across the homelab. + +## Current Monitoring Status + +| Component | Monitored? | Method | Alerts | Notes | +|-----------|------------|--------|--------|-------| +| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked | +| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available | +| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command | +| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring | +| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks | +| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks | +| **Network** | ❌ No | - | ❌ No | No uptime monitoring | +| **Services** | ❌ No | - | ❌ No | No health checks | +| **Backups** | ❌ No | - | ❌ No | No verification | + +**Overall Status**: ⚠️ **MINIMAL** - Most monitoring is manual, no automated alerts + +--- + +## Existing Monitoring + +### UPS Monitoring (NUT) + +**Status**: ✅ **Active and working** + +**What's monitored**: +- Battery charge percentage +- Runtime remaining (seconds) +- Load percentage +- Input/output voltage +- UPS status (OL/OB/LB) + +**Access**: +```bash +# Full UPS status +ssh pve 'upsc cyberpower@localhost' + +# Key metrics +ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' +``` + +**Home Assistant Integration**: +- Sensors: `sensor.cyberpower_*` +- Can be used for automation/alerts +- Currently: No alerts configured + +**See**: [UPS.md](UPS.md) + +--- + +### Syncthing Monitoring + +**Status**: ⚠️ **Partial** - API available, no automated monitoring + +**What's available**: +- Device connection status +- Folder sync status +- Sync errors +- Bandwidth usage + +**Manual Checks**: +```bash +# Check connections (Mac Mini) +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/connections" | \ + python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ + [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" + +# Check folder status +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/db/status?folder=documents" | jq + +# Check errors +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq +``` + +**Needs**: Automated monitoring script + alerts + +**See**: [SYNCTHING.md](SYNCTHING.md) + +--- + +### Temperature Monitoring + +**Status**: ⚠️ **Manual only** + +**Current Method**: +```bash +# CPU temperature (Threadripper Tctl) +ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ + label=$(cat ${f%_input}_label 2>/dev/null); \ + if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' + +ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \ + label=$(cat ${f%_input}_label 2>/dev/null); \ + if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' +``` + +**Thresholds**: +- Healthy: 70-80°C under load +- Warning: >85°C +- Critical: >90°C (throttling) + +**Needs**: Automated monitoring + alert if >85°C + +--- + +### Proxmox VM Monitoring + +**Status**: ⚠️ **Manual only** + +**Current Access**: +- Proxmox Web UI: Node → Summary +- CLI: `ssh pve 'qm list'` + +**Metrics Available** (via Proxmox): +- CPU usage per VM +- RAM usage per VM +- Disk I/O +- Network I/O +- VM uptime + +**Needs**: API-based monitoring + alerts for VM down + +--- + +## Recommended Monitoring Stack + +### Option 1: Prometheus + Grafana (Recommended) + +**Why**: +- Industry standard +- Extensive integrations +- Beautiful dashboards +- Flexible alerting + +**Architecture**: +``` +Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection) + ↓ + Alertmanager (alerts) +``` + +**Required Exporters**: +| Exporter | Monitors | Install On | +|----------|----------|------------| +| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs | +| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS | +| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS | +| nut_exporter | UPS metrics | PVE | +| proxmox_exporter | VM/CT stats | PVE, PVE2 | +| cadvisor | Docker containers | Saltbox, docker-host | + +**Deployment**: +```bash +# Create monitoring VM +ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \ + --net0 virtio,bridge=vmbr0' + +# Install Prometheus + Grafana (via Docker) +# /opt/monitoring/docker-compose.yml +``` + +**Estimated Setup Time**: 4-6 hours + +--- + +### Option 2: Uptime Kuma (Simpler Alternative) + +**Why**: +- Lightweight +- Easy to set up +- Web-based dashboard +- Built-in alerts (email, Slack, etc.) + +**What it monitors**: +- HTTP/HTTPS endpoints +- Ping (ICMP) +- Ports (TCP) +- Docker containers + +**Deployment**: +```bash +ssh docker-host 'mkdir -p /opt/uptime-kuma' +cat > docker-compose.yml << 'EOF' +version: "3.8" +services: + uptime-kuma: + image: louislam/uptime-kuma:latest + ports: + - "3001:3001" + volumes: + - ./data:/app/data + restart: unless-stopped +EOF + +# Access: http://10.10.10.206:3001 +# Add Traefik config for uptime.htsn.io +``` + +**Estimated Setup Time**: 1-2 hours + +--- + +### Option 3: Netdata (Real-time Monitoring) + +**Why**: +- Real-time metrics (1-second granularity) +- Auto-discovers services +- Low overhead +- Beautiful web UI + +**Deployment**: +```bash +# Install on each server +ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)' +ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)' + +# Access: +# http://10.10.10.120:19999 (PVE) +# http://10.10.10.102:19999 (PVE2) +``` + +**Parent-Child Setup** (optional): +- Configure PVE as parent +- Stream metrics from PVE2 → PVE +- Single dashboard for both servers + +**Estimated Setup Time**: 1 hour + +--- + +## Critical Metrics to Monitor + +### Server Health + +| Metric | Threshold | Action | +|--------|-----------|--------| +| **CPU usage** | >90% for 5 min | Alert | +| **CPU temp** | >85°C | Alert | +| **CPU temp** | >90°C | Critical alert | +| **RAM usage** | >95% | Alert | +| **Disk space** | >80% | Warning | +| **Disk space** | >90% | Alert | +| **Load average** | >CPU count | Alert | + +### Storage Health + +| Metric | Threshold | Action | +|--------|-----------|--------| +| **ZFS pool errors** | >0 | Alert immediately | +| **ZFS pool degraded** | Any degraded vdev | Critical alert | +| **ZFS scrub failed** | Last scrub error | Alert | +| **SMART reallocated sectors** | >0 | Warning | +| **SMART pending sectors** | >0 | Alert | +| **SMART failure** | Pre-fail | Critical - replace drive | + +### UPS + +| Metric | Threshold | Action | +|--------|-----------|--------| +| **Battery charge** | <20% | Warning | +| **Battery charge** | <10% | Alert | +| **On battery** | >5 min | Alert | +| **Runtime** | <5 min | Critical | + +### Network + +| Metric | Threshold | Action | +|--------|-----------|--------| +| **Device unreachable** | >2 min down | Alert | +| **High packet loss** | >5% | Warning | +| **Bandwidth saturation** | >90% | Warning | + +### VMs/Services + +| Metric | Threshold | Action | +|--------|-----------|--------| +| **VM stopped** | Critical VM down | Alert immediately | +| **Service unreachable** | HTTP 5xx or timeout | Alert | +| **Backup failed** | Any backup failure | Alert | +| **Certificate expiry** | <30 days | Warning | +| **Certificate expiry** | <7 days | Alert | + +--- + +## Alert Destinations + +### Email Alerts + +**Recommended**: Set up SMTP relay for email alerts + +**Options**: +1. Gmail SMTP (free, rate-limited) +2. SendGrid (free tier: 100 emails/day) +3. Mailgun (free tier available) +4. Self-hosted mail server (complex) + +**Configuration Example** (Prometheus Alertmanager): +```yaml +# /etc/alertmanager/alertmanager.yml +receivers: + - name: 'email' + email_configs: + - to: 'hutson@example.com' + from: 'alerts@htsn.io' + smarthost: 'smtp.gmail.com:587' + auth_username: 'alerts@htsn.io' + auth_password: 'app-password-here' +``` + +--- + +### Push Notifications + +**Options**: +- **Pushover**: $5 one-time, reliable +- **Pushbullet**: Free tier available +- **Telegram Bot**: Free +- **Discord Webhook**: Free +- **Slack**: Free tier available + +**Recommended**: Pushover or Telegram for mobile alerts + +--- + +### Home Assistant Alerts + +Since Home Assistant is already running, use it for alerts: + +**Automation Example**: +```yaml +automation: + - alias: "UPS Low Battery Alert" + trigger: + - platform: numeric_state + entity_id: sensor.cyberpower_battery_charge + below: 20 + action: + - service: notify.mobile_app + data: + message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%" + + - alias: "Server High Temperature" + trigger: + - platform: template + value_template: "{{ sensor.pve_cpu_temp > 85 }}" + action: + - service: notify.mobile_app + data: + message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C" +``` + +**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant + +--- + +## Monitoring Scripts + +### Daily Health Check + +Save as `~/bin/homelab-health-check.sh`: + +```bash +#!/bin/bash +# Daily homelab health check + +echo "=== Homelab Health Check ===" +echo "Date: $(date)" +echo "" + +echo "=== Server Status ===" +ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE" +ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE" +echo "" + +echo "=== CPU Temperatures ===" +ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done' +ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done' +echo "" + +echo "=== UPS Status ===" +ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' +echo "" + +echo "=== ZFS Pools ===" +ssh pve 'zpool status -x' 2>/dev/null +ssh pve2 'zpool status -x' 2>/dev/null +ssh truenas 'zpool status -x vault' +echo "" + +echo "=== Disk Space ===" +ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"' +ssh truenas 'df -h /mnt/vault' +echo "" + +echo "=== VM Status ===" +ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:" +ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:" +echo "" + +echo "=== Syncthing Connections ===" +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/connections" | \ + python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \ + [print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]" +echo "" + +echo "=== Check Complete ===" +``` + +**Run daily**: +```cron +0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com +``` + +--- + +### ZFS Scrub Checker + +```bash +#!/bin/bash +# Check last ZFS scrub status + +echo "=== ZFS Scrub Status ===" + +for host in pve pve2; do + echo "--- $host ---" + ssh $host 'zpool status | grep -A1 scrub' + echo "" +done + +echo "--- TrueNAS ---" +ssh truenas 'zpool status vault | grep -A1 scrub' +``` + +--- + +### SMART Health Checker + +```bash +#!/bin/bash +# Check SMART health on all drives + +echo "=== SMART Health Check ===" + +echo "--- TrueNAS Drives ---" +ssh truenas 'smartctl --scan | while read dev type; do + echo "=== $dev ==="; + smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED"; +done' + +echo "--- PVE Drives ---" +ssh pve 'for dev in /dev/nvme* /dev/sd*; do + [ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED"; +done' +``` + +--- + +## Dashboard Recommendations + +### Grafana Dashboard Layout + +**Page 1: Overview** +- Server uptime +- CPU usage (all servers) +- RAM usage (all servers) +- Disk space (all pools) +- Network traffic +- UPS status + +**Page 2: Storage** +- ZFS pool health +- SMART status for all drives +- I/O latency +- Scrub progress +- Disk temperatures + +**Page 3: VMs** +- VM status (up/down) +- VM resource usage +- VM disk I/O +- VM network traffic + +**Page 4: Services** +- Service health checks +- HTTP response times +- Certificate expiry dates +- Syncthing sync status + +--- + +## Implementation Plan + +### Phase 1: Basic Monitoring (Week 1) + +- [ ] Install Uptime Kuma or Netdata +- [ ] Add HTTP checks for all services +- [ ] Configure UPS alerts in Home Assistant +- [ ] Set up daily health check email + +**Estimated Time**: 4-6 hours + +--- + +### Phase 2: Advanced Monitoring (Week 2-3) + +- [ ] Install Prometheus + Grafana +- [ ] Deploy node_exporter on all servers +- [ ] Deploy zfs_exporter +- [ ] Deploy smartmon_exporter +- [ ] Create Grafana dashboards + +**Estimated Time**: 8-12 hours + +--- + +### Phase 3: Alerting (Week 4) + +- [ ] Configure Alertmanager +- [ ] Set up email/push notifications +- [ ] Create alert rules for all critical metrics +- [ ] Test all alert paths +- [ ] Document alert procedures + +**Estimated Time**: 4-6 hours + +--- + +## Related Documentation + +- [UPS.md](UPS.md) - UPS monitoring details +- [STORAGE.md](STORAGE.md) - ZFS health checks +- [SERVICES.md](SERVICES.md) - Service inventory +- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations +- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks + +--- + +**Last Updated**: 2025-12-22 +**Status**: ⚠️ **Minimal monitoring currently in place - implementation needed** diff --git a/POWER-MANAGEMENT.md b/POWER-MANAGEMENT.md new file mode 100644 index 0000000..dbc049f --- /dev/null +++ b/POWER-MANAGEMENT.md @@ -0,0 +1,509 @@ +# Power Management and Optimization + +Documentation of power optimizations applied to reduce idle power consumption and heat generation. + +## Overview + +Combined estimated power draw: **~1000-1350W under load**, **500-700W idle** + +Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings. + +--- + +## Power Draw Estimates + +### PVE (10.10.10.120) + +| Component | Idle | Load | TDP | +|-----------|------|------|-----| +| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W | +| NVIDIA TITAN RTX | 2-3W | 250W | 280W | +| NVIDIA Quadro P2000 | 25W | 70W | 75W | +| RAM (128 GB DDR4) | 30-40W | 30-40W | - | +| Storage (NVMe + SSD) | 20-30W | 40-50W | - | +| HBAs, fans, misc | 20-30W | 20-30W | - | +| **Total** | **250-350W** | **800-940W** | - | + +### PVE2 (10.10.10.102) + +| Component | Idle | Load | TDP | +|-----------|------|------|-----| +| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W | +| NVIDIA RTX A6000 | 11W | 280W | 300W | +| RAM (128 GB DDR4) | 30-40W | 30-40W | - | +| Storage (NVMe + HDD) | 20-30W | 40-50W | - | +| Fans, misc | 15-20W | 15-20W | - | +| **Total** | **226-330W** | **765-890W** | - | + +### Combined + +| Metric | Idle | Load | +|--------|------|------| +| Servers | 476-680W | 1565-1830W | +| Network gear | ~50W | ~50W | +| **Total** | **~530-730W** | **~1615-1880W** | +| **UPS Load** | 40-55% | 120-140% ⚠️ | + +**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare. + +--- + +## Optimizations Applied + +### 1. KSMD Disabled (2024-12-17) + +**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs. + +**Problem**: +- KSMD was consuming 44-57% CPU continuously on PVE +- Caused CPU temp to rise from 74°C to 83°C +- **Negative profit**: More power spent scanning than saved from deduplication + +**Solution**: Disabled KSM permanently + +**Configuration**: + +**Systemd service**: `/etc/systemd/system/disable-ksm.service` +```ini +[Unit] +Description=Disable KSM (Kernel Same-page Merging) +After=multi-user.target + +[Service] +Type=oneshot +ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run' +RemainAfterExit=yes + +[Install] +WantedBy=multi-user.target +``` + +**Enable and start**: +```bash +systemctl daemon-reload +systemctl enable --now disable-ksm +systemctl mask ksmtuned # Prevent re-enabling +``` + +**Verify**: +```bash +# KSM should be disabled (run=0) +cat /sys/kernel/mm/ksm/run # Should output: 0 + +# ksmd should show 0% CPU +ps aux | grep ksmd +``` + +**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented) + +**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check: +```bash +cat /sys/kernel/mm/ksm/run +# If 1, disable it: +echo 0 > /sys/kernel/mm/ksm/run +systemctl mask ksmtuned +``` + +--- + +### 2. CPU Governor Optimization (2024-12-16) + +Default CPU governor keeps cores at max frequency even when idle, wasting power. + +#### PVE: `amd-pstate-epp` Driver + +**Driver**: `amd-pstate-epp` (modern AMD P-state driver) +**Governor**: `powersave` +**EPP**: `balance_power` + +**Configuration**: + +**Systemd service**: `/etc/systemd/system/cpu-powersave.service` +```ini +[Unit] +Description=Set CPU governor to powersave with balance_power EPP +After=multi-user.target + +[Service] +Type=oneshot +ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done' +ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done' +RemainAfterExit=yes + +[Install] +WantedBy=multi-user.target +``` + +**Enable**: +```bash +systemctl daemon-reload +systemctl enable --now cpu-powersave +``` + +**Verify**: +```bash +# Check governor +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor +# Output: powersave + +# Check EPP +cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference +# Output: balance_power + +# Check current frequency (should be low when idle) +grep MHz /proc/cpuinfo | head -5 +# Should show ~1700-2200 MHz idle, up to 4000 MHz under load +``` + +#### PVE2: `acpi-cpufreq` Driver + +**Driver**: `acpi-cpufreq` (older ACPI driver) +**Governor**: `schedutil` (adaptive, better than powersave for this driver) + +**Configuration**: + +**Systemd service**: `/etc/systemd/system/cpu-powersave.service` +```ini +[Unit] +Description=Set CPU governor to schedutil +After=multi-user.target + +[Service] +Type=oneshot +ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done' +RemainAfterExit=yes + +[Install] +WantedBy=multi-user.target +``` + +**Enable**: +```bash +systemctl daemon-reload +systemctl enable --now cpu-powersave +``` + +**Verify**: +```bash +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor +# Output: schedutil + +grep MHz /proc/cpuinfo | head -5 +# Should show ~1700-2200 MHz idle +``` + +**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz) + +**Performance impact**: Minimal - CPU still boosts to max frequency under load + +--- + +### 3. GPU Power States (2024-12-16) + +GPUs automatically enter low-power states when idle. Verified optimal. + +| GPU | Location | Idle Power | P-State | Notes | +|-----|----------|------------|---------|-------| +| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power | +| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power | +| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active | + +**Check GPU power state**: +```bash +# Via nvidia-smi (if installed in VM) +ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv' + +# Expected output: +# name, power.draw [W], pstate +# NVIDIA TITAN RTX, 2.50 W, P8 + +# Via lspci (from Proxmox host - shows link speed, not power) +ssh pve 'lspci | grep -i nvidia' +``` + +**P-States**: +- **P0**: Maximum performance +- **P8**: Minimum power (idle) + +**No action needed** - GPUs automatically manage power states. + +**Savings**: N/A (already optimal) + +--- + +### 4. Syncthing Rescan Intervals (2024-12-16) + +Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly. + +**Changed**: +- Large folders: 60s → **3600s** (1 hour) +- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config + +**Configuration**: Via Syncthing UI on each device +- Settings → Folders → [Folder Name] → Advanced → Rescan Interval + +**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%) + +**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute +- Still acceptable for most use cases +- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"` + +--- + +### 5. ksmtuned Disabled (2024-12-16) + +**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running. + +**Solution**: Stopped and disabled on both servers + +```bash +systemctl stop ksmtuned +systemctl disable ksmtuned +systemctl mask ksmtuned # Prevent re-enabling +``` + +**Savings**: ~2-5W + +--- + +### 6. HDD Spindown on PVE2 (2024-12-16) + +**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7 + +**Solution**: Configure 30-minute spindown timeout + +**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules` +```udev +# Spin down WD Red 6TB drives after 30 minutes idle +ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k" +``` + +**hdparm value**: 241 = 30 minutes +- Formula: `value * 5 seconds = timeout` +- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances) + +**Apply rule**: +```bash +udevadm control --reload-rules +udevadm trigger + +# Verify drives have spindown set +hdparm -I /dev/sda | grep -i standby +hdparm -I /dev/sdb | grep -i standby +``` + +**Check if drives are spun down**: +```bash +hdparm -C /dev/sda +# Output: drive state is: standby (spun down) +# or: drive state is: active/idle (spinning) +``` + +**Savings**: ~10-16W when spun down (8W per drive) + +**Trade-off**: 5-10 second delay when accessing pool after spindown + +--- + +## Potential Optimizations (Not Yet Applied) + +### PCIe ASPM (Active State Power Management) + +**Benefit**: Reduce power of idle PCIe devices +**Risk**: May cause stability issues with some devices +**Estimated savings**: 5-15W + +**Test**: +```bash +# Check current ASPM state +lspci -vv | grep -i aspm + +# Enable ASPM (test first) +# Add to kernel cmdline: pcie_aspm=force +# Edit /etc/default/grub: +GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force" + +# Update grub +update-grub +reboot +``` + +### NMI Watchdog Disable + +**Benefit**: Reduce CPU wakeups +**Risk**: Harder to debug kernel hangs +**Estimated savings**: 1-3W + +**Test**: +```bash +# Disable NMI watchdog +echo 0 > /proc/sys/kernel/nmi_watchdog + +# Make permanent (add to kernel cmdline) +# Edit /etc/default/grub: +GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0" + +update-grub +reboot +``` + +--- + +## Monitoring + +### CPU Frequency + +```bash +# Current frequency on all cores +ssh pve 'grep MHz /proc/cpuinfo | head -10' + +# Governor +ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor' + +# Available governors +ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors' +``` + +### CPU Temperature + +```bash +# PVE +ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done' + +# PVE2 +ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done' +``` + +**Healthy temps**: 70-80°C under load +**Warning**: >85°C +**Throttle**: 90°C (Tctl max for Threadripper PRO) + +### GPU Power Draw + +```bash +# If nvidia-smi installed in VM +ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv' + +# Sample output: +# name, power.draw [W], power.limit [W], pstate +# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8 +``` + +### Power Consumption (UPS) + +```bash +# Check UPS load percentage +ssh pve 'upsc cyberpower@localhost ups.load' + +# Battery runtime (seconds) +ssh pve 'upsc cyberpower@localhost battery.runtime' + +# Full UPS status +ssh pve 'upsc cyberpower@localhost' +``` + +See [UPS.md](UPS.md) for more UPS monitoring details. + +### ZFS ARC Memory Usage + +```bash +# PVE +ssh pve 'arc_summary | grep -A5 "ARC size"' + +# TrueNAS +ssh truenas 'arc_summary | grep -A5 "ARC size"' +``` + +**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed: + +```bash +# Limit ARC to 32 GB (example) +# Edit /etc/modprobe.d/zfs.conf: +options zfs zfs_arc_max=34359738368 + +# Apply (reboot required) +update-initramfs -u +reboot +``` + +--- + +## Troubleshooting + +### CPU Not Downclocking + +```bash +# Check current governor +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor + +# Should be: powersave (PVE) or schedutil (PVE2) +# If not, systemd service may have failed + +# Check service status +systemctl status cpu-powersave + +# Manually set governor (temporary) +echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor + +# Check frequency +grep MHz /proc/cpuinfo | head -5 +``` + +### High Idle Power After Update + +**Common causes**: +1. **KSM re-enabled** after Proxmox update + - Check: `cat /sys/kernel/mm/ksm/run` + - Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned` + +2. **CPU governor reset** to default + - Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` + - Fix: `systemctl restart cpu-powersave` + +3. **GPU stuck in high-performance mode** + - Check: `nvidia-smi --query-gpu=pstate --format=csv` + - Fix: Restart VM or power cycle GPU + +### HDDs Won't Spin Down + +```bash +# Check spindown setting +hdparm -I /dev/sda | grep -i standby + +# Set spindown manually (temporary) +hdparm -S 241 /dev/sda + +# Check if drive is idle (ZFS may keep it active) +zpool iostat -v 1 5 # Watch for activity + +# Check what's accessing the drive +lsof | grep /mnt/pool +``` + +--- + +## Power Optimization Summary + +| Optimization | Savings | Applied | Notes | +|--------------|---------|---------|-------| +| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly | +| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil | +| **GPU power states** | 0W | ✅ | Already optimal (automatic) | +| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage | +| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win | +| **HDD spindown** | 10-16W | ✅ | Only when drives idle | +| PCIe ASPM | 5-15W | ❌ | Not yet tested | +| NMI watchdog | 1-3W | ❌ | Not yet tested | +| **Total savings** | **~150-300W** | - | Significant reduction | + +--- + +## Related Documentation + +- [UPS.md](UPS.md) - UPS capacity and power monitoring +- [STORAGE.md](STORAGE.md) - HDD spindown configuration +- [VMS.md](VMS.md) - VM resource allocation + +--- + +**Last Updated**: 2025-12-22 diff --git a/README.md b/README.md new file mode 100644 index 0000000..c10c8fc --- /dev/null +++ b/README.md @@ -0,0 +1,148 @@ +# Homelab Documentation + +Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads. + +## 🚀 Quick Start + +**New to this homelab?** Start here: +1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks +2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems +3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address +4. [SERVICES.md](SERVICES.md) - What services are running + +**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center. + +## 📚 Documentation Index + +### Infrastructure + +| Document | Description | +|----------|-------------| +| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough | +| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs | +| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning | +| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN | +| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations | +| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling | + +### Services & Applications + +| Document | Description | +|----------|-------------| +| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials | +| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates | +| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations | +| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting | +| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) | + +### Access & Security + +| Document | Description | +|----------|-------------| +| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent | +| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices | +| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) | + +### Operations + +| Document | Description | +|----------|-------------| +| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) | +| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists | +| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations | +| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) | + +### Reference + +| Document | Description | +|----------|-------------| +| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting | +| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions | + +## 🖥️ System Overview + +### Servers + +- **PVE** (10.10.10.120) - Primary Proxmox server + - AMD Threadripper PRO 3975WX (32-core) + - 128 GB RAM + - NVIDIA Quadro P2000 + TITAN RTX + +- **PVE2** (10.10.10.102) - Secondary Proxmox server + - AMD Threadripper PRO 3975WX (32-core) + - 128 GB RAM + - NVIDIA RTX A6000 + +### Key Services + +| Service | Location | URL | +|---------|----------|-----| +| **Proxmox** | PVE | https://pve.htsn.io | +| **TrueNAS** | VM 100 | https://truenas.htsn.io | +| **Plex** | Saltbox VM | https://plex.htsn.io | +| **Home Assistant** | VM 110 | https://homeassistant.htsn.io | +| **Gitea** | VM 300 | https://git.htsn.io | +| **Pi-hole** | CT 200 | http://10.10.10.10/admin | +| **Traefik** | CT 202 | http://10.10.10.250:8080 | + +[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md) + +## 🔥 Emergency Procedures + +### Power Failure +1. UPS provides ~15 min runtime at typical load +2. At 2 min remaining, NUT triggers graceful VM shutdown +3. When power returns, servers auto-boot and start VMs in order + +See [UPS.md](UPS.md) for details. + +### Service Down + +```bash +# Quick health check (run from Mac Mini) +ssh pve 'qm list' # Check VMs on PVE +ssh pve2 'qm list' # Check VMs on PVE2 +ssh pve 'pct list' # Check containers + +# Syncthing status +curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/connections" + +# Restart a VM +ssh pve 'qm stop VMID && qm start VMID' +``` + +See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks. + +## 📞 Getting Help + +**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md + +**Key Contacts**: +- Homelab Owner: Hutson +- Git Repo: https://git.htsn.io/hutson/homelab-docs +- Local Path: `~/Projects/homelab` + +## 🔄 Recent Changes + +See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md). + +## 📝 Contributing + +When updating docs: +1. Keep CLAUDE.md as quick reference only +2. Move detailed content to specialized docs +3. Update cross-references +4. Test all commands before committing +5. Add entries to changelog + +```bash +cd ~/Projects/homelab +git add -A +git commit -m "Update documentation: " +git push +``` + +--- + +**Last Updated**: 2025-12-22 diff --git a/SERVICES.md b/SERVICES.md new file mode 100644 index 0000000..b8dd970 --- /dev/null +++ b/SERVICES.md @@ -0,0 +1,591 @@ +# Services Inventory + +Complete inventory of all services running across the homelab infrastructure. + +## Overview + +| Category | Services | Location | Access | +|----------|----------|----------|--------| +| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH | +| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI | +| **Development** | Gitea, Docker services | VMs | Web UI | +| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API | +| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API | + +**Total Services**: 25+ running services + +--- + +## Service URLs Quick Reference + +| Service | URL | Authentication | Purpose | +|---------|-----|----------------|---------| +| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management | +| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management | +| **Plex** | https://plex.htsn.io | Plex account | Media streaming | +| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation | +| **Gitea** | https://git.htsn.io | Username/password | Git repositories | +| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard | +| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions | +| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking | +| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard | +| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard | +| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing | +| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app | + +--- + +## Infrastructure Services + +### Proxmox VE (PVE & PVE2) + +**Purpose**: Virtualization platform, VM/CT host +**Location**: Physical servers (10.10.10.120, 10.10.10.102) +**Access**: https://pve.htsn.io:8006, SSH +**Version**: Unknown (check: `pveversion`) + +**Key Features**: +- Web-based management +- VM and LXC container support +- ZFS storage pools +- Clustering (2-node) +- API access + +**Common Operations**: +```bash +# List VMs +ssh pve 'qm list' + +# Create VM +ssh pve 'qm create VMID --name myvm ...' + +# Backup VM +ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump' +``` + +**See**: [VMS.md](VMS.md) + +--- + +### TrueNAS SCALE (VM 100) + +**Purpose**: Central file storage, NFS/SMB shares +**Location**: VM on PVE (10.10.10.200) +**Access**: https://truenas.htsn.io, SSH +**Version**: TrueNAS SCALE (check version in UI) + +**Key Features**: +- ZFS storage management +- NFS exports +- SMB shares +- Syncthing hub +- Snapshot management + +**Storage Pools**: +- `vault`: Main data pool on EMC enclosure + +**Shares** (needs documentation): +- NFS exports for Saltbox media +- SMB shares for Windows access +- Syncthing sync folders + +**See**: [STORAGE.md](STORAGE.md) + +--- + +### Pi-hole (CT 200) + +**Purpose**: Network-wide DNS server and ad blocker +**Location**: LXC on PVE (10.10.10.10) +**Access**: http://10.10.10.10/admin +**Version**: Unknown + +**Configuration**: +- **Upstream DNS**: Cloudflare (1.1.1.1) +- **Blocklists**: Unknown count +- **Queries**: All network DNS traffic +- **DHCP**: Disabled (router handles DHCP) + +**Stats** (example): +```bash +ssh pihole 'pihole -c -e' # Stats +ssh pihole 'pihole status' # Status +``` + +**Common Tasks**: +- Update blocklists: `ssh pihole 'pihole -g'` +- Whitelist domain: `ssh pihole 'pihole -w example.com'` +- View logs: `ssh pihole 'pihole -t'` + +--- + +### Traefik (CT 202) + +**Purpose**: Reverse proxy for all public-facing services +**Location**: LXC on PVE (10.10.10.250) +**Access**: http://10.10.10.250:8080/dashboard/ +**Version**: Unknown (check: `traefik version`) + +**Managed Services**: +- All *.htsn.io domains (except Saltbox services) +- SSL/TLS certificates via Let's Encrypt +- HTTP → HTTPS redirects + +**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration + +--- + +## Media Services (Saltbox VM) + +All media services run in Docker on the Saltbox VM (10.10.10.100). + +### Plex Media Server + +**Purpose**: Media streaming platform +**URL**: https://plex.htsn.io +**Access**: Plex account + +**Features**: +- Hardware transcoding (TITAN RTX) +- Libraries: Movies, TV, Music +- Remote access enabled +- Managed by Saltbox + +**Media Storage**: +- Source: TrueNAS NFS mounts +- Location: `/mnt/unionfs/` + +**Common Tasks**: +```bash +# View Plex status +ssh saltbox 'docker logs -f plex' + +# Restart Plex +ssh saltbox 'docker restart plex' + +# Scan library +# (via Plex UI: Settings → Library → Scan) +``` + +--- + +### *arr Apps (Media Automation) + +Running on Saltbox VM, managed via Traefik-Saltbox. + +| Service | Purpose | URL | Notes | +|---------|---------|-----|-------| +| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV | +| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies | +| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music | +| **Overseerr** | Request management | overseerr.htsn.io | User requests for media | +| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles | + +**Downloaders**: +| Service | Purpose | URL | +|---------|---------|-----| +| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io | +| **NZBGet** | Usenet downloader | nzbget.htsn.io | +| **qBittorrent** | Torrent client | qbittorrent.htsn.io | + +**Indexers**: +| Service | Purpose | URL | +|---------|---------|-----| +| **Jackett** | Torrent indexer proxy | jackett.htsn.io | +| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io | + +--- + +### Supporting Media Services + +| Service | Purpose | URL | +|---------|---------|-----| +| **Tautulli** | Plex statistics | tautulli.htsn.io | +| **Organizr** | Service dashboard | organizr.htsn.io | +| **Authelia** | SSO authentication | auth.htsn.io | + +--- + +## Development Services + +### Gitea (VM 300) + +**Purpose**: Self-hosted Git server +**Location**: VM on PVE2 (10.10.10.220) +**URL**: https://git.htsn.io +**Access**: Username/password + +**Repositories**: +- homelab-docs (this documentation) +- Personal projects +- Private repos + +**Common Tasks**: +```bash +# SSH to Gitea VM +ssh gitea-vm + +# View logs +ssh gitea-vm 'journalctl -u gitea -f' + +# Backup +ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini' +``` + +**See**: Gitea documentation for API usage + +--- + +### Docker Services (docker-host VM) + +Running on VM 206 (10.10.10.206). + +| Service | URL | Purpose | Port | +|---------|-----|---------|------| +| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 | +| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 | +| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 | + +**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml` + +**Managing services**: +```bash +ssh docker-host 'docker ps' +ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f' +ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart' +``` + +--- + +## Home Automation + +### Home Assistant (VM 110) + +**Purpose**: Smart home automation platform +**Location**: VM on PVE (10.10.10.110) +**URL**: https://homeassistant.htsn.io +**Access**: Username/password + +**Integrations**: +- UPS monitoring (NUT sensors) +- Unknown other integrations (needs documentation) + +**Sensors**: +- `sensor.cyberpower_battery_charge` +- `sensor.cyberpower_load` +- `sensor.cyberpower_battery_runtime` +- `sensor.cyberpower_status` + +**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md) + +--- + +### Happy Coder Relay (docker-host) + +**Purpose**: Self-hosted relay server for Happy Coder mobile app +**Location**: docker-host (10.10.10.206) +**URL**: https://happy.htsn.io +**Access**: QR code authentication + +**Stack**: +- Happy Server (Node.js) +- PostgreSQL (user/session data) +- Redis (real-time events) +- MinIO (file/image storage) + +**Clients**: +- Mac Mini (Happy daemon) +- Mobile app (iOS/Android) + +**Credentials**: +- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac` +- PostgreSQL: `happy` / `happypass` +- MinIO: `happyadmin` / `happyadmin123` + +--- + +## File Sync & Storage + +### Syncthing + +**Purpose**: File synchronization across all devices +**Devices**: +- Mac Mini (10.10.10.125) - Hub +- MacBook - Mobile sync +- TrueNAS (10.10.10.200) - Central storage +- Windows PC (10.10.10.150) - Windows sync +- Phone (10.10.10.54) - Mobile sync + +**API Keys**: +- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5` +- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ` +- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM` + +**Synced Folders**: +- documents (~11 GB) +- downloads (~38 GB) +- pictures +- notes +- desktop (~7.2 GB) +- config +- movies + +**See**: [SYNCTHING.md](SYNCTHING.md) + +--- + +### Copyparty (VM 201) + +**Purpose**: Simple HTTP file sharing +**Location**: VM on PVE (10.10.10.201) +**URL**: https://copyparty.htsn.io +**Access**: Unknown + +**Features**: +- Web-based file upload/download +- Lightweight + +--- + +## Trading & AI Services + +### AI Trading Platform (trading-vm) + +**Purpose**: Algorithmic trading with AI models +**Location**: VM 301 on PVE2 (10.10.10.221) +**URL**: https://aitrade.htsn.io (if accessible) +**GPU**: RTX A6000 (48GB VRAM) + +**Components**: +- Trading algorithms +- AI models for market prediction +- Real-time data feeds +- Backtesting infrastructure + +**Access**: SSH only (no web UI documented) + +--- + +### LM Dev (lmdev1) + +**Purpose**: AI/LLM development environment +**Location**: VM 111 on PVE (10.10.10.111) +**URL**: https://lmdev.htsn.io (if accessible) +**GPU**: TITAN RTX (shared with Saltbox) + +**Installed**: +- CUDA toolkit +- Python 3.11+ +- PyTorch, TensorFlow +- Hugging Face transformers + +--- + +## Monitoring & Utilities + +### UPS Monitoring (NUT) + +**Purpose**: Monitor UPS status and trigger shutdowns +**Location**: PVE (master), PVE2 (slave) +**Access**: Command-line (`upsc`) + +**Key Commands**: +```bash +ssh pve 'upsc cyberpower@localhost' +ssh pve 'upsc cyberpower@localhost ups.load' +ssh pve 'upsc cyberpower@localhost battery.runtime' +``` + +**Home Assistant Integration**: UPS sensors exposed + +**See**: [UPS.md](UPS.md) + +--- + +### Pulse Monitoring + +**Purpose**: Unknown monitoring dashboard +**Location**: docker-host (10.10.10.206:7655) +**URL**: https://pulse.htsn.io +**Access**: Unknown + +**Needs documentation**: +- What does it monitor? +- How to configure? +- Authentication? + +--- + +### Tailscale VPN + +**Purpose**: Secure remote access to homelab +**Subnet Routers**: +- PVE (100.113.177.80) - Primary +- UCG-Fiber (100.94.246.32) - Failover + +**Devices on Tailscale**: +- Mac Mini: 100.108.89.58 +- PVE: 100.113.177.80 +- TrueNAS: 100.100.94.71 +- Pi-hole: 100.112.59.128 + +**See**: [NETWORK.md](NETWORK.md) + +--- + +## Custom Applications + +### FindShyt (CT 205) + +**Purpose**: Unknown custom application +**Location**: LXC on PVE (10.10.10.8) +**URL**: https://findshyt.htsn.io +**Access**: Unknown + +**Needs documentation**: +- What is this app? +- How to use it? +- Tech stack? + +--- + +## Service Dependencies + +### Critical Dependencies + +``` +TrueNAS +├── Plex (media files via NFS) +├── *arr apps (downloads via NFS) +├── Syncthing (central storage hub) +└── Backups (if configured) + +Traefik (CT 202) +├── All *.htsn.io services +└── SSL certificate management + +Pi-hole +└── DNS for entire network + +Router +└── Gateway for all services +``` + +### Startup Order + +**See [VMS.md](VMS.md)** for VM boot order configuration: +1. TrueNAS (storage first) +2. Saltbox (depends on TrueNAS NFS) +3. Other VMs +4. Containers + +--- + +## Service Port Reference + +### Well-Known Ports + +| Port | Service | Protocol | Purpose | +|------|---------|----------|---------| +| 22 | SSH | TCP | Remote access | +| 53 | Pi-hole | UDP | DNS queries | +| 80 | Traefik | TCP | HTTP (redirects to 443) | +| 443 | Traefik | TCP | HTTPS | +| 3000 | Gitea | TCP | Git HTTP/S | +| 8006 | Proxmox | TCP | Web UI | +| 8096 | Plex | TCP | Plex Media Server | +| 8384 | Syncthing | TCP | Web UI | +| 22000 | Syncthing | TCP | Sync protocol | + +### Internal Ports + +| Port | Service | Purpose | +|------|---------|---------| +| 3002 | Happy Server | Relay backend | +| 5432 | PostgreSQL | Happy Server DB | +| 6379 | Redis | Happy Server cache | +| 7655 | Pulse | Monitoring | +| 8080 | Excalidraw | Whiteboard | +| 8080 | Traefik | Dashboard | +| 9000 | MinIO | Object storage | + +--- + +## Service Health Checks + +### Quick Health Check Script + +```bash +#!/bin/bash +# Check all critical services + +echo "=== Infrastructure ===" +curl -Is https://pve.htsn.io:8006 | head -1 +curl -Is https://truenas.htsn.io | head -1 +curl -I http://10.10.10.10/admin 2>/dev/null | head -1 +echo "" + +echo "=== Media Services ===" +curl -Is https://plex.htsn.io | head -1 +curl -Is https://sonarr.htsn.io | head -1 +curl -Is https://radarr.htsn.io | head -1 +echo "" + +echo "=== Development ===" +curl -Is https://git.htsn.io | head -1 +curl -Is https://excalidraw.htsn.io | head -1 +echo "" + +echo "=== Home Automation ===" +curl -Is https://homeassistant.htsn.io | head -1 +curl -Is https://happy.htsn.io/health | head -1 +``` + +### Service-Specific Checks + +```bash +# Proxmox VMs +ssh pve 'qm list | grep running' + +# Docker services +ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"' + +# Syncthing +curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \ + "http://127.0.0.1:8384/rest/system/status" + +# UPS +ssh pve 'upsc cyberpower@localhost ups.status' +``` + +--- + +## Service Credentials + +**Location**: See individual service documentation + +| Service | Credentials Location | Notes | +|---------|---------------------|-------| +| Proxmox | Proxmox UI | Username + 2FA | +| TrueNAS | TrueNAS UI | Root password | +| Plex | Plex account | Managed externally | +| Gitea | Gitea DB | Self-managed | +| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password | +| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords | + +**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management. + +--- + +## Related Documentation + +- [VMS.md](VMS.md) - VM/service locations +- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config +- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses +- [NETWORK.md](NETWORK.md) - Network configuration +- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon) + +--- + +**Last Updated**: 2025-12-22 +**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage) diff --git a/SSH-ACCESS.md b/SSH-ACCESS.md new file mode 100644 index 0000000..36507f1 --- /dev/null +++ b/SSH-ACCESS.md @@ -0,0 +1,464 @@ +# SSH Access + +Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage. + +## Overview + +Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations. + +**SSH Password**: `GrilledCh33s3#` (for systems without key auth) + +--- + +## SSH Key Authentication (Primary Method) + +### SSH Key Configuration + +SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook. + +**Key file**: `~/.ssh/homelab` (Ed25519 key) + +**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts) + +### Host Aliases + +Use these convenient aliases instead of IP addresses: + +| Host Alias | IP | User | Type | Notes | +|------------|-----|------|------|-------| +| `pve` | 10.10.10.120 | root | Proxmox | Primary server | +| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server | +| `truenas` | 10.10.10.200 | root | VM | NAS/storage | +| `saltbox` | 10.10.10.100 | hutson | VM | Media automation | +| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development | +| `docker-host` | 10.10.10.206 | hutson | VM | Docker services | +| `fs-dev` | 10.10.10.5 | hutson | VM | Development | +| `copyparty` | 10.10.10.201 | hutson | VM | File sharing | +| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server | +| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform | +| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking | +| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy | +| `findshyt` | 10.10.10.8 | root | LXC | Custom app | + +### Usage Examples + +```bash +# List VMs on PVE +ssh pve 'qm list' + +# Check ZFS pool on TrueNAS +ssh truenas 'zpool status vault' + +# List Docker containers on Saltbox +ssh saltbox 'docker ps' + +# Check Pi-hole status +ssh pihole 'pihole status' + +# View Traefik config +ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml' +``` + +### SSH Config File + +**Location**: `~/.ssh/config` + +**Example entries**: + +```sshconfig +# Proxmox Servers +Host pve + HostName 10.10.10.120 + User root + IdentityFile ~/.ssh/homelab + +Host pve2 + HostName 10.10.10.102 + User root + IdentityFile ~/.ssh/homelab + # Post-quantum KEX causes MTU issues - use classic + KexAlgorithms curve25519-sha256 + +# VMs +Host truenas + HostName 10.10.10.200 + User root + IdentityFile ~/.ssh/homelab + +Host saltbox + HostName 10.10.10.100 + User hutson + IdentityFile ~/.ssh/homelab + +Host lmdev1 + HostName 10.10.10.111 + User hutson + IdentityFile ~/.ssh/homelab + +Host docker-host + HostName 10.10.10.206 + User hutson + IdentityFile ~/.ssh/homelab + +Host fs-dev + HostName 10.10.10.5 + User hutson + IdentityFile ~/.ssh/homelab + +Host copyparty + HostName 10.10.10.201 + User hutson + IdentityFile ~/.ssh/homelab + +Host gitea-vm + HostName 10.10.10.220 + User hutson + IdentityFile ~/.ssh/homelab + +Host trading-vm + HostName 10.10.10.221 + User hutson + IdentityFile ~/.ssh/homelab + +# LXC Containers +Host pihole + HostName 10.10.10.10 + User root + IdentityFile ~/.ssh/homelab + +Host traefik + HostName 10.10.10.250 + User root + IdentityFile ~/.ssh/homelab + +Host findshyt + HostName 10.10.10.8 + User root + IdentityFile ~/.ssh/homelab +``` + +--- + +## Password Authentication (Special Cases) + +Some systems don't support SSH key auth or have other limitations. + +### UniFi Router (10.10.10.1) + +**Issue**: Uses `keyboard-interactive` auth method, incompatible with `sshpass` +**Solution**: Use `expect` to automate password entry + +**Commands**: + +```bash +# Run command on router +expect -c 'spawn ssh root@10.10.10.1 "hostname"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' + +# Get ARP table (all device IPs) +expect -c 'spawn ssh root@10.10.10.1 "cat /proc/net/arp"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' + +# Check Tailscale status +expect -c 'spawn ssh root@10.10.10.1 "tailscale status"; expect "Password:"; send "GrilledCh33s3#\r"; expect eof' +``` + +**Why not key auth?**: UniFi router firmware doesn't persist SSH keys across reboots. + +### Windows PC (10.10.10.150) + +**OS**: Windows with OpenSSH server +**User**: `claude` +**Password**: `GrilledCh33s3#` +**Shell**: PowerShell (not bash) + +**Commands**: + +```bash +# Run PowerShell command +sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5' + +# Check Syncthing status +sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue' + +# Restart Syncthing +sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"' +``` + +**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax). + +**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows. + +--- + +## QEMU Guest Agent + +Most VMs have the QEMU guest agent installed, allowing command execution without SSH. + +### VMs with QEMU Agent + +| VMID | VM Name | Use Case | +|------|---------|----------| +| 100 | truenas | Execute commands, check ZFS | +| 101 | saltbox | Execute commands, Docker mgmt | +| 105 | fs-dev | Execute commands | +| 111 | lmdev1 | Execute commands | +| 201 | copyparty | Execute commands | +| 206 | docker-host | Execute commands | +| 300 | gitea-vm | Execute commands | +| 301 | trading-vm | Execute commands | + +### VM WITHOUT QEMU Agent + +**VMID 110 (homeassistant)**: No QEMU agent installed +- Access via web UI only +- Or install SSH server manually if needed + +### Usage Examples + +**Basic syntax**: +```bash +ssh pve 'qm guest exec VMID -- bash -c "COMMAND"' +``` + +**Examples**: + +```bash +# Check ZFS pool on TrueNAS (without SSH) +ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"' + +# Get VM IP addresses +ssh pve 'qm guest exec 100 -- bash -c "ip addr"' + +# Check Docker containers on Saltbox +ssh pve 'qm guest exec 101 -- bash -c "docker ps"' + +# Run multi-line command +ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"' +``` + +**When to use QEMU agent vs SSH**: +- ✅ Use **SSH** for interactive sessions, file editing, complex tasks +- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network +- ⚠️ QEMU agent is slower for multiple commands (use SSH instead) + +--- + +## Troubleshooting SSH Issues + +### Connection Refused + +```bash +# Check if SSH service is running +ssh pve 'systemctl status sshd' + +# Check if port 22 is open +nc -zv 10.10.10.XXX 22 + +# Check firewall +ssh pve 'iptables -L -n | grep 22' +``` + +### Permission Denied (Public Key) + +```bash +# Verify key file exists +ls -la ~/.ssh/homelab + +# Check key permissions (should be 600) +chmod 600 ~/.ssh/homelab + +# Test SSH key auth verbosely +ssh -vvv -i ~/.ssh/homelab root@10.10.10.120 + +# Check authorized_keys on remote (via QEMU agent if SSH broken) +ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"' +``` + +### Slow SSH Connection (PVE2 Issue) + +**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting +**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation +**Fix**: Use classic KEX algorithm instead + +**In `~/.ssh/config`**: +```sshconfig +Host pve2 + HostName 10.10.10.102 + User root + IdentityFile ~/.ssh/homelab + KexAlgorithms curve25519-sha256 # Avoid mlkem768x25519-sha256 +``` + +**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2 + +--- + +## Adding SSH Keys to New Systems + +### Linux (VMs/LXCs) + +```bash +# Copy public key to new host +ssh-copy-id -i ~/.ssh/homelab user@hostname + +# Or manually: +ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub +ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys' +``` + +### LXC Containers (Root User) + +```bash +# Via pct exec from Proxmox host +ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"' +ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"' +ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"' + +# Also enable PermitRootLogin in sshd_config +ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"' +ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"' +``` + +### VMs (via QEMU Agent) + +```bash +# Add key via QEMU agent (if SSH not working) +ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"' +ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"' +ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"' +``` + +--- + +## SSH Key Management + +### Rotate SSH Keys (Future) + +When rotating SSH keys: + +1. Generate new key pair: + ```bash + ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new" + ``` + +2. Deploy new key to all hosts (keep old key for now): + ```bash + for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do + ssh-copy-id -i ~/.ssh/homelab-new $host + done + ``` + +3. Update `~/.ssh/config` to use new key: + ```sshconfig + IdentityFile ~/.ssh/homelab-new + ``` + +4. Test all connections: + ```bash + for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do + echo "Testing $host..." + ssh $host 'hostname' + done + ``` + +5. Remove old key from all hosts once confirmed working + +--- + +## Quick Reference + +### Common SSH Operations + +```bash +# Execute command on remote host +ssh host 'command' + +# Execute multiple commands +ssh host 'command1 && command2' + +# Copy file to remote +scp file host:/path/ + +# Copy file from remote +scp host:/path/file ./ + +# Execute command on Proxmox VM (via QEMU agent) +ssh pve 'qm guest exec VMID -- bash -c "command"' + +# Execute command on LXC +ssh pve 'pct exec CTID -- command' + +# Interactive shell +ssh host + +# SSH with X11 forwarding +ssh -X host +``` + +### Troubleshooting Commands + +```bash +# Test SSH with verbose output +ssh -vvv host + +# Check SSH service status (remote) +ssh host 'systemctl status sshd' + +# Check SSH config (local) +ssh -G host + +# Test port connectivity +nc -zv hostname 22 +``` + +--- + +## Security Best Practices + +### Current Security Posture + +✅ **Good**: +- SSH keys used instead of passwords (where possible) +- Keys use Ed25519 (modern, secure algorithm) +- Root login disabled on VMs (use sudo instead) +- SSH keys have proper permissions (600) + +⚠️ **Could Improve**: +- [ ] Disable password authentication on all hosts (force key-only) +- [ ] Use SSH certificate authority instead of individual keys +- [ ] Set up SSH bastion host (jump server) +- [ ] Enable 2FA for SSH (via PAM + Google Authenticator) +- [ ] Implement SSH key rotation policy (annually) + +### Hardening SSH (Future) + +For additional security, consider: + +```sshconfig +# /etc/ssh/sshd_config (on remote hosts) +PermitRootLogin prohibit-password # No root password login +PasswordAuthentication no # Disable password auth entirely +PubkeyAuthentication yes # Only allow key auth +AuthorizedKeysFile .ssh/authorized_keys +MaxAuthTries 3 # Limit auth attempts +MaxSessions 10 # Limit concurrent sessions +ClientAliveInterval 300 # Timeout idle sessions +ClientAliveCountMax 2 # Drop after 2 keepalives +``` + +**Apply after editing**: +```bash +systemctl restart sshd +``` + +--- + +## Related Documentation + +- [VMS.md](VMS.md) - Complete VM/CT inventory +- [NETWORK.md](NETWORK.md) - Network configuration +- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts +- [SECURITY.md](#) - Security policies (coming soon) + +--- + +**Last Updated**: 2025-12-22 diff --git a/STORAGE.md b/STORAGE.md new file mode 100644 index 0000000..3f3b702 --- /dev/null +++ b/STORAGE.md @@ -0,0 +1,510 @@ +# Storage Architecture + +Documentation of all storage pools, datasets, shares, and capacity planning across the homelab. + +## Overview + +### Storage Distribution + +| Location | Type | Capacity | Purpose | +|----------|------|----------|---------| +| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO | +| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data | +| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB | + +--- + +## PVE (10.10.10.120) Storage Pools + +### nvme-mirror1 (Primary Fast Storage) +- **Type**: ZFS mirror +- **Devices**: 2x Sabrent Rocket Q NVMe +- **Capacity**: 3.6 TB usable +- **Purpose**: High-performance VM storage +- **Used By**: + - Critical VMs requiring fast IO + - Database workloads + - Development environments + +**Check status**: +```bash +ssh pve 'zpool status nvme-mirror1' +ssh pve 'zpool list nvme-mirror1' +``` + +### nvme-mirror2 (Secondary Fast Storage) +- **Type**: ZFS mirror +- **Devices**: 2x Kingston SFYRD 2TB NVMe +- **Capacity**: 1.8 TB usable +- **Purpose**: Additional fast VM storage +- **Used By**: TBD + +**Check status**: +```bash +ssh pve 'zpool status nvme-mirror2' +ssh pve 'zpool list nvme-mirror2' +``` + +### rpool (Root Pool) +- **Type**: ZFS mirror +- **Devices**: 2x Samsung 870 QVO 4TB SSD +- **Capacity**: 3.6 TB usable +- **Purpose**: Proxmox OS, container storage, VM backups +- **Used By**: + - Proxmox root filesystem + - LXC containers + - Local VM backups + +**Check status**: +```bash +ssh pve 'zpool status rpool' +ssh pve 'df -h /var/lib/vz' +``` + +### Storage Pool Usage Summary (PVE) + +**Get current usage**: +```bash +ssh pve 'zpool list' +ssh pve 'pvesm status' +``` + +--- + +## PVE2 (10.10.10.102) Storage Pools + +### nvme-mirror3 (Fast Storage) +- **Type**: ZFS mirror +- **Devices**: 2x NVMe (model unknown) +- **Capacity**: Unknown (needs investigation) +- **Purpose**: High-performance VM storage +- **Used By**: Trading VM (301), other VMs + +**Check status**: +```bash +ssh pve2 'zpool status nvme-mirror3' +ssh pve2 'zpool list nvme-mirror3' +``` + +### local-zfs2 (Bulk Storage) +- **Type**: ZFS mirror +- **Devices**: 2x WD Red 6TB HDD +- **Capacity**: ~6 TB usable +- **Purpose**: Bulk/archival storage +- **Power Management**: 30-minute spindown configured + - Saves ~10-16W when idle + - Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules` + - Command: `hdparm -S 241` (30 min) + +**Notes**: +- Pool had only 768 KB used as of 2024-12-16 +- Drives configured to spin down after 30 min idle +- Good for archival, NOT for active workloads + +**Check status**: +```bash +ssh pve2 'zpool status local-zfs2' +ssh pve2 'zpool list local-zfs2' + +# Check if drives are spun down +ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby +``` + +--- + +## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage + +### ZFS Pool: vault + +**Primary storage pool** for all shared data. + +**Devices**: ❓ Needs investigation +- EMC storage enclosure with multiple drives +- SAS connection via LSI SAS2308 HBA (passed through to VM) + +**Capacity**: ❓ Needs investigation + +**Check pool status**: +```bash +ssh truenas 'zpool status vault' +ssh truenas 'zpool list vault' + +# Get detailed capacity +ssh truenas 'zfs list -o name,used,avail,refer,mountpoint' +``` + +### Datasets (Known) + +Based on Syncthing configuration, likely datasets: + +| Dataset | Purpose | Synced Devices | Notes | +|---------|---------|----------------|-------| +| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB | +| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB | +| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size | +| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size | +| vault/desktop | Desktop sync | Unknown | 7.2 GB | +| vault/movies | Movie library | Unknown | Unknown size | +| vault/config | Config files | Mac Mini, MacBook | Unknown size | + +**Get complete dataset list**: +```bash +ssh truenas 'zfs list -r vault' +``` + +### NFS/SMB Shares + +**Status**: ❓ Not documented + +**Needs investigation**: +```bash +# List NFS exports +ssh truenas 'showmount -e localhost' + +# List SMB shares +ssh truenas 'smbclient -L localhost -N' + +# Via TrueNAS API/UI +# Sharing → Unix Shares (NFS) +# Sharing → Windows Shares (SMB) +``` + +**Expected shares**: +- Media libraries for Plex (on Saltbox VM) +- Document storage +- VM backups? +- ISO storage? + +### EMC Storage Enclosure + +**Model**: EMC KTN-STL4 (or similar) +**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM) +**Drives**: ❓ Unknown count and capacity + +**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for: +- SES commands +- Fan control +- LCC (Link Control Card) troubleshooting +- Maintenance procedures + +**Check enclosure status**: +```bash +ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor +ssh truenas 'smartctl --scan' # List all drives +``` + +--- + +## Storage Network Architecture + +### Internal Storage Network (10.10.10.20.0/24) + +**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network. + +**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC) +**Subnet**: 10.10.10.20.0/24 +**DHCP**: No +**Gateway**: No (internal only, no internet) + +**Connected VMs**: +- TrueNAS VM (secondary NIC) +- Saltbox VM (secondary NIC) - for NFS mounts +- Other VMs needing storage access + +**Configuration**: +```bash +# On TrueNAS VM - check second NIC +ssh truenas 'ip addr show enp6s19' + +# On Saltbox - check NFS mounts +ssh saltbox 'mount | grep nfs' +``` + +**Benefits**: +- Separates storage traffic from general network +- Prevents NFS/SMB from saturating main network +- Better performance for storage-heavy workloads + +--- + +## Storage Capacity Planning + +### Current Usage (Estimate) + +**Needs actual audit**: +```bash +# PVE pools +ssh pve 'zpool list -o name,size,alloc,free' + +# PVE2 pools +ssh pve2 'zpool list -o name,size,alloc,free' + +# TrueNAS vault pool +ssh truenas 'zpool list vault' + +# Get detailed breakdown +ssh truenas 'zfs list -r vault -o name,used,avail' +``` + +### Growth Rate + +**Needs tracking** - recommend monthly snapshots of capacity: + +```bash +#!/bin/bash +# Save as ~/bin/storage-capacity-report.sh + +DATE=$(date +%Y-%m-%d) +REPORT=~/Backups/storage-reports/capacity-$DATE.txt + +mkdir -p ~/Backups/storage-reports + +echo "Storage Capacity Report - $DATE" > $REPORT +echo "================================" >> $REPORT +echo "" >> $REPORT + +echo "PVE Pools:" >> $REPORT +ssh pve 'zpool list' >> $REPORT +echo "" >> $REPORT + +echo "PVE2 Pools:" >> $REPORT +ssh pve2 'zpool list' >> $REPORT +echo "" >> $REPORT + +echo "TrueNAS Pools:" >> $REPORT +ssh truenas 'zpool list' >> $REPORT +echo "" >> $REPORT + +echo "TrueNAS Datasets:" >> $REPORT +ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT + +echo "Report saved to $REPORT" +``` + +**Run monthly via cron**: +```cron +0 9 1 * * ~/bin/storage-capacity-report.sh +``` + +### Expansion Planning + +**When to expand**: +- Pool reaches 80% capacity +- Performance degrades +- New workloads require more space + +**Expansion options**: +1. Add drives to existing pools (if mirrors, add mirror vdev) +2. Add new NVMe drives to PVE/PVE2 +3. Expand EMC enclosure (add more drives) +4. Add second EMC enclosure + +**Cost estimates**: TBD + +--- + +## ZFS Health Monitoring + +### Daily Health Checks + +```bash +# Check for errors on all pools +ssh pve 'zpool status -x' # Shows only unhealthy pools +ssh pve2 'zpool status -x' +ssh truenas 'zpool status -x' + +# Check scrub status +ssh pve 'zpool status | grep scrub' +ssh pve2 'zpool status | grep scrub' +ssh truenas 'zpool status | grep scrub' +``` + +### Scrub Schedule + +**Recommended**: Monthly scrub on all pools + +**Configure scrub**: +```bash +# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub +# Or via cron: +0 2 1 * * /sbin/zpool scrub nvme-mirror1 +0 2 1 * * /sbin/zpool scrub rpool +``` + +**On TrueNAS**: +- Configure via UI: Storage → Pools → Scrub Tasks +- Recommended: 1st of every month at 2 AM + +### SMART Monitoring + +**Check drive health**: +```bash +# PVE +ssh pve 'smartctl -a /dev/nvme0' +ssh pve 'smartctl -a /dev/sda' + +# TrueNAS +ssh truenas 'smartctl --scan' +ssh truenas 'smartctl -a /dev/sdX' # For each drive +``` + +**Configure SMART tests**: +- TrueNAS UI: Tasks → S.M.A.R.T. Tests +- Recommended: Weekly short test, monthly long test + +### Alerts + +**Set up email alerts for**: +- ZFS pool errors +- SMART test failures +- Pool capacity > 80% +- Scrub failures + +--- + +## Storage Performance Tuning + +### ZFS ARC (Cache) + +**Check ARC usage**: +```bash +ssh pve 'arc_summary' +ssh truenas 'arc_summary' +``` + +**Tuning** (if needed): +- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf` +- TrueNAS: Configure via UI (System → Advanced → Tunables) + +### NFS Performance + +**Mount options** (on clients like Saltbox): +``` +rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3 +``` + +**Verify NFS mounts**: +```bash +ssh saltbox 'mount | grep nfs' +``` + +### Record Size Optimization + +**Different workloads need different record sizes**: +- VMs: 64K (default, good for VMs) +- Databases: 8K or 16K +- Media files: 1M (large sequential reads) + +**Set record size** (on TrueNAS datasets): +```bash +ssh truenas 'zfs set recordsize=1M vault/movies' +``` + +--- + +## Disaster Recovery + +### Pool Recovery + +**If a pool fails to import**: +```bash +# Try importing with different name +zpool import -f -N poolname newpoolname + +# Check pool with readonly +zpool import -f -o readonly=on poolname + +# Force import (last resort) +zpool import -f -F poolname +``` + +### Drive Replacement + +**When a drive fails**: +```bash +# Identify failed drive +zpool status poolname + +# Replace drive +zpool replace poolname old-device new-device + +# Monitor resilver +watch zpool status poolname +``` + +### Data Recovery + +**If pool is completely lost**: +1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md)) +2. Recreate pool structure +3. Restore data + +**Critical**: This is why we need offsite backups! + +--- + +## Quick Reference + +### Common Commands + +```bash +# Pool status +zpool status [poolname] +zpool list + +# Dataset usage +zfs list +zfs list -r vault + +# Check pool health (only unhealthy) +zpool status -x + +# Scrub pool +zpool scrub poolname + +# Get pool IO stats +zpool iostat -v 1 + +# Snapshot management +zfs snapshot poolname/dataset@snapname +zfs list -t snapshot +zfs rollback poolname/dataset@snapname +zfs destroy poolname/dataset@snapname +``` + +### Storage Locations by Use Case + +| Use Case | Recommended Storage | Why | +|----------|---------------------|-----| +| VM OS disk | nvme-mirror1 (PVE) | Fastest IO | +| Database | nvme-mirror1/2 | Low latency | +| Media files | TrueNAS vault | Large capacity | +| Development | nvme-mirror2 | Fast, mid-tier | +| Containers | rpool | Good performance | +| Backups | TrueNAS or rpool | Large capacity | +| Archive | local-zfs2 (PVE2) | Cheap, can spin down | + +--- + +## Investigation Needed + +- [ ] Get complete TrueNAS dataset list +- [ ] Document NFS/SMB share configuration +- [ ] Inventory EMC enclosure drives (count, capacity, model) +- [ ] Document current pool usage percentages +- [ ] Set up monthly capacity reports +- [ ] Configure ZFS scrub schedules +- [ ] Set up storage health alerts + +--- + +## Related Documentation + +- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy +- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance +- [VMS.md](VMS.md) - VM storage assignments +- [NETWORK.md](NETWORK.md) - Storage network configuration + +--- + +**Last Updated**: 2025-12-22 diff --git a/TRAEFIK.md b/TRAEFIK.md new file mode 100644 index 0000000..77af1d9 --- /dev/null +++ b/TRAEFIK.md @@ -0,0 +1,672 @@ +# Traefik Reverse Proxy + +Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services. + +## Overview + +There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical. + +| Instance | Location | IP | Purpose | Managed By | +|----------|----------|-----|---------|------------| +| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files | +| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible | + +--- + +## ⚠️ CRITICAL RULE: Which Traefik to Use + +### When Adding ANY New Service: + +✅ **USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services +❌ **DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself + +### Why This Matters: + +- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated) +- Messing with it breaks Plex, Sonarr, Radarr, and all media services +- Each Traefik has its own Let's Encrypt certificates +- Mixing them causes certificate conflicts and routing issues + +--- + +## Traefik-Primary (CT 202) - For New Services + +### Configuration + +**Location**: Container 202 on PVE (10.10.10.250) +**Config Directory**: `/etc/traefik/` +**Main Config**: `/etc/traefik/traefik.yaml` +**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml` + +### Access Traefik Config + +```bash +# From Mac Mini: +ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml' +ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/' + +# Edit a service config: +ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml' + +# View logs: +ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' +``` + +### Services Using Traefik-Primary + +| Service | Domain | Backend | +|---------|--------|---------| +| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) | +| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) | +| Gitea | git.htsn.io | 10.10.10.220:3000 | +| Home Assistant | homeassistant.htsn.io | 10.10.10.110 | +| LM Dev | lmdev.htsn.io | 10.10.10.111 | +| Pi-hole | pihole.htsn.io | 10.10.10.200 | +| TrueNAS | truenas.htsn.io | 10.10.10.200 | +| Proxmox | pve.htsn.io | 10.10.10.120 | +| Copyparty | copyparty.htsn.io | 10.10.10.201 | +| AI Trade | aitrade.htsn.io | (trading server) | +| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) | +| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) | + +--- + +## Traefik-Saltbox (VM 101) - DO NOT MODIFY + +### Configuration + +**Location**: `/opt/traefik/` inside Saltbox VM +**Managed By**: Saltbox Ansible playbooks (automatic) +**Docker Mount**: `/opt/traefik` → `/etc/traefik` in container + +### Services Using Traefik-Saltbox + +- Plex (plex.htsn.io) +- Sonarr, Radarr, Lidarr +- SABnzbd, NZBGet, qBittorrent +- Overseerr, Tautulli, Organizr +- Jackett, NZBHydra2 +- Authelia (SSO authentication) +- All other Saltbox-managed containers + +### View Saltbox Traefik (Read-Only) + +```bash +# View config (don't edit!) +ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"' + +# View logs +ssh saltbox 'docker logs -f traefik' +``` + +**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services. + +--- + +## Adding a New Public Service - Complete Workflow + +Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`. + +### Step 0: Deploy Your Service + +First, deploy your service on the appropriate host. + +#### Option A: Docker on docker-host (10.10.10.206) + +```bash +ssh hutson@10.10.10.206 +sudo mkdir -p /opt/myservice +cat > /opt/myservice/docker-compose.yml << 'EOF' +version: "3.8" +services: + myservice: + image: myimage:latest + ports: + - "8080:80" + restart: unless-stopped +EOF +cd /opt/myservice && sudo docker-compose up -d +``` + +#### Option B: New LXC Container on PVE + +```bash +ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \ + --hostname myservice --memory 2048 --cores 2 \ + --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \ + --rootfs local-zfs:8 --unprivileged 1 --start 1' +``` + +#### Option C: New VM on PVE + +```bash +ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \ + --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci' +``` + +### Step 1: Create Traefik Config File + +Use this template for new services on **Traefik-Primary (CT 202)**: + +#### Basic Template + +```yaml +# /etc/traefik/conf.d/myservice.yaml +http: + routers: + # HTTPS router + myservice-secure: + entryPoints: + - websecure + rule: "Host(`myservice.htsn.io`)" + service: myservice + tls: + certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only + priority: 50 + + # HTTP → HTTPS redirect + myservice-redirect: + entryPoints: + - web + rule: "Host(`myservice.htsn.io`)" + middlewares: + - myservice-https-redirect + service: myservice + priority: 50 + + services: + myservice: + loadBalancer: + servers: + - url: "http://10.10.10.XXX:PORT" + + middlewares: + myservice-https-redirect: + redirectScheme: + scheme: https + permanent: true +``` + +#### Deploy the Config + +```bash +# Create file on CT 202 +ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\'' + +EOF"' + +# Traefik auto-reloads (watches conf.d directory) +# Check logs: +ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' +``` + +### Step 2: Add Cloudflare DNS Entry + +#### Cloudflare Credentials + +| Field | Value | +|-------|-------| +| Email | cloudflare@htsn.io | +| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc | +| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af | +| Public IP | 70.237.94.174 | + +#### Method 1: Manual (Cloudflare Dashboard) + +1. Go to https://dash.cloudflare.com/ +2. Select `htsn.io` domain +3. DNS → Add Record +4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️ + +#### Method 2: Automated (CLI) + +Save this as `~/bin/add-cloudflare-dns.sh`: + +```bash +#!/bin/bash +# Add DNS record to Cloudflare for htsn.io + +SUBDOMAIN="$1" +CF_EMAIL="cloudflare@htsn.io" +CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" +ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" +PUBLIC_IP="70.237.94.174" + +if [ -z "$SUBDOMAIN" ]; then + echo "Usage: $0 " + echo "Example: $0 myservice # Creates myservice.htsn.io" + exit 1 +fi + +curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ + -H "X-Auth-Email: $CF_EMAIL" \ + -H "X-Auth-Key: $CF_API_KEY" \ + -H "Content-Type: application/json" \ + --data "{ + \"type\":\"A\", + \"name\":\"$SUBDOMAIN\", + \"content\":\"$PUBLIC_IP\", + \"ttl\":1, + \"proxied\":true + }" | jq . +``` + +**Usage**: +```bash +chmod +x ~/bin/add-cloudflare-dns.sh +~/bin/add-cloudflare-dns.sh myservice # Creates myservice.htsn.io +``` + +### Step 3: Testing + +```bash +# Check if DNS resolves +dig myservice.htsn.io + +# Should return: 70.237.94.174 (or Cloudflare IPs if proxied) + +# Test HTTP redirect +curl -I http://myservice.htsn.io + +# Expected: 301 redirect to https:// + +# Test HTTPS +curl -I https://myservice.htsn.io + +# Expected: 200 OK + +# Check Traefik dashboard (if enabled) +# http://10.10.10.250:8080/dashboard/ +``` + +### Step 4: Update Documentation + +After deploying, update: + +1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table +2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list +3. **CLAUDE.md** - Update quick reference if needed + +--- + +## SSL Certificates + +Traefik has **two certificate resolvers** configured: + +| Resolver | Use When | Challenge Type | Notes | +|----------|----------|----------------|-------| +| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable | +| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy | + +### ⚠️ Important: HTTP Challenge vs DNS Challenge + +**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server. + +**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead. + +### Certificate Resolver Configuration + +**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`: + +```ini +Environment="CF_API_EMAIL=cloudflare@htsn.io" +Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc" +``` + +### Certificate Storage + +| Resolver | Storage File | +|----------|--------------| +| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` | +| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` | + +**Permissions**: Must be `600` (read/write owner only) + +```bash +# Check permissions +ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json' + +# Fix if needed +ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json' +ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json' +``` + +### Certificate Renewal + +- **Automatic** via Traefik +- Checks every 24 hours +- Renews 30 days before expiry +- No manual intervention needed + +### Troubleshooting Certificates + +#### Certificate Fails to Issue + +```bash +# Check Traefik logs +ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error' + +# Verify Cloudflare API access +curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \ + -H "X-Auth-Email: cloudflare@htsn.io" \ + -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" + +# Check acme.json permissions +ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json' +``` + +#### Force Certificate Renewal + +```bash +# Delete certificate (Traefik will re-request) +ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json' +ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json' +ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json' +ssh pve 'pct exec 202 -- systemctl restart traefik' + +# Watch logs +ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' +``` + +--- + +## Quick Deployment - One-Liner + +For fast deployment, use this all-in-one command: + +```bash +# === DEPLOY SERVICE (example: myservice on docker-host port 8080) === + +# 1. Create Traefik config +ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF +http: + routers: + myservice-secure: + entryPoints: [websecure] + rule: Host(\\\`myservice.htsn.io\\\`) + service: myservice + tls: {certResolver: cloudflare} + services: + myservice: + loadBalancer: + servers: + - url: http://10.10.10.206:8080 +EOF"' + +# 2. Add Cloudflare DNS +curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \ + -H "X-Auth-Email: cloudflare@htsn.io" \ + -H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \ + -H "Content-Type: application/json" \ + --data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}' + +# 3. Test (wait a few seconds for DNS propagation) +curl -I https://myservice.htsn.io +``` + +--- + +## Docker Service with Traefik Labels (Alternative) + +If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files. + +**Requirements**: +- Traefik must have access to Docker socket +- Service must be on same Docker network as Traefik + +**Example docker-compose.yml**: + +```yaml +version: "3.8" + +services: + myservice: + image: myimage:latest + labels: + - "traefik.enable=true" + - "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)" + - "traefik.http.routers.myservice.entrypoints=websecure" + - "traefik.http.routers.myservice.tls.certresolver=letsencrypt" + - "traefik.http.services.myservice.loadbalancer.server.port=8080" + networks: + - traefik + +networks: + traefik: + external: true +``` + +**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred. + +--- + +## Cloudflare API Reference + +### API Credentials + +| Field | Value | +|-------|-------| +| Email | cloudflare@htsn.io | +| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc | +| Zone ID | c0f5a80448c608af35d39aa820a5f3af | + +### Common API Operations + +Set credentials: +```bash +CF_EMAIL="cloudflare@htsn.io" +CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc" +ZONE_ID="c0f5a80448c608af35d39aa820a5f3af" +``` + +**List all DNS records**: +```bash +curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ + -H "X-Auth-Email: $CF_EMAIL" \ + -H "X-Auth-Key: $CF_API_KEY" | jq +``` + +**Add A record**: +```bash +curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \ + -H "X-Auth-Email: $CF_EMAIL" \ + -H "X-Auth-Key: $CF_API_KEY" \ + -H "Content-Type: application/json" \ + --data '{ + "type":"A", + "name":"subdomain", + "content":"70.237.94.174", + "proxied":true + }' +``` + +**Delete record**: +```bash +curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \ + -H "X-Auth-Email: $CF_EMAIL" \ + -H "X-Auth-Key: $CF_API_KEY" +``` + +**Update record** (toggle proxy): +```bash +curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \ + -H "X-Auth-Email: $CF_EMAIL" \ + -H "X-Auth-Key: $CF_API_KEY" \ + -H "Content-Type: application/json" \ + --data '{"proxied":false}' +``` + +--- + +## Troubleshooting + +### Service Not Accessible + +```bash +# 1. Check if DNS resolves +dig myservice.htsn.io + +# 2. Check if backend is reachable +curl -I http://10.10.10.XXX:PORT + +# 3. Check Traefik logs +ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log' + +# 4. Check Traefik config is valid +ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml' + +# 5. Restart Traefik (if needed) +ssh pve 'pct exec 202 -- systemctl restart traefik' +``` + +### Certificate Issues + +```bash +# Check certificate status in acme.json +ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq' + +# Check certificate expiry +echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates +``` + +### 502 Bad Gateway + +**Cause**: Backend service is down or unreachable + +```bash +# Check if backend is running +ssh backend-host 'systemctl status myservice' + +# Check if port is open +nc -zv 10.10.10.XXX PORT + +# Check firewall +ssh backend-host 'iptables -L -n | grep PORT' +``` + +### 404 Not Found + +**Cause**: Traefik can't match the request to a router + +```bash +# Check router rule matches domain +ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule' + +# Should be: rule: "Host(`myservice.htsn.io`)" + +# Check DNS is pointing to correct IP +dig myservice.htsn.io + +# Restart Traefik to reload config +ssh pve 'pct exec 202 -- systemctl restart traefik' +``` + +--- + +## Advanced Configuration Examples + +### WebSocket Support + +For services that use WebSockets (like Home Assistant): + +```yaml +http: + routers: + myservice-secure: + entryPoints: + - websecure + rule: "Host(`myservice.htsn.io`)" + service: myservice + tls: + certResolver: cloudflare + + services: + myservice: + loadBalancer: + servers: + - url: "http://10.10.10.XXX:PORT" + # No special config needed - WebSockets work by default in Traefik v2+ +``` + +### Custom Headers + +Add custom headers (e.g., security headers): + +```yaml +http: + routers: + myservice-secure: + middlewares: + - myservice-headers + + middlewares: + myservice-headers: + headers: + customResponseHeaders: + X-Frame-Options: "DENY" + X-Content-Type-Options: "nosniff" + Referrer-Policy: "strict-origin-when-cross-origin" +``` + +### Basic Authentication + +Protect a service with basic auth: + +```yaml +http: + routers: + myservice-secure: + middlewares: + - myservice-auth + + middlewares: + myservice-auth: + basicAuth: + users: + - "user:$apr1$..." # Generate with: htpasswd -nb user password +``` + +--- + +## Maintenance + +### Monthly Checks + +```bash +# Check Traefik status +ssh pve 'pct exec 202 -- systemctl status traefik' + +# Review logs for errors +ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20' + +# Check certificate expiry dates +ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"' + +# Verify all services responding +for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do + echo "Testing $domain..." + curl -sI https://$domain | head -1 +done +``` + +### Backup Traefik Config + +```bash +# Backup all configs +ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik' + +# Copy to safe location +scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/ +``` + +--- + +## Related Documentation + +- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses +- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon) +- [SERVICES.md](#) - Complete service inventory (coming soon) + +--- + +**Last Updated**: 2025-12-22 diff --git a/UPS.md b/UPS.md new file mode 100644 index 0000000..1aeb248 --- /dev/null +++ b/UPS.md @@ -0,0 +1,605 @@ +# UPS and Power Management + +Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures. + +## Hardware + +### Current UPS + +| Specification | Value | +|---------------|-------| +| **Model** | CyberPower OR2200PFCRT2U | +| **Capacity** | 2200VA / 1320W | +| **Form Factor** | 2U rackmount | +| **Output** | PFC Sinewave (compatible with active PFC PSUs) | +| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) | +| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** | +| **Runtime** | ~15-20 min at typical load (~33% / 440W) | +| **Installed** | 2025-12-21 | +| **Status** | Active | + +### ⚠️ Temporary Wiring Modification + +**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit +**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility +**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V) +**Current draw**: ~1000-1350W total (safe margin) +**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug + +### Previous UPS + +| Model | Capacity | Issue | Replaced | +|-------|----------|-------|----------| +| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 | + +**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity. + +--- + +## Power Draw Estimates + +### Typical Load + +| Component | Idle | Load | Notes | +|-----------|------|------|-------| +| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage | +| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage | +| Network gear | ~50W | ~50W | Router, switches | +| **Total** | **500-700W** | **1000-1400W** | Varies by workload | + +**UPS Load**: ~33-50% typical, 70-80% under heavy load + +### Runtime Calculation + +At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21) +At **660W load** (50%): ~10-12 min estimated +At **1000W load** (75%): ~6-8 min estimated + +**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime + +--- + +## NUT (Network UPS Tools) Configuration + +### Architecture + +``` +UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave) + │ + └──> Home Assistant (monitoring only) +``` + +**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server +**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered + +### NUT Server Configuration (PVE) + +#### 1. UPS Driver Config: `/etc/nut/ups.conf` + +```ini +[cyberpower] + driver = usbhid-ups + port = auto + desc = "CyberPower OR2200PFCRT2U" + override.battery.charge.low = 20 + override.battery.runtime.low = 120 +``` + +**Key settings**: +- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower) +- `port = auto`: Auto-detect USB device +- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining + +#### 2. NUT Server Config: `/etc/nut/upsd.conf` + +```ini +LISTEN 127.0.0.1 3493 +LISTEN 10.10.10.120 3493 +``` + +**Listens on**: +- Localhost (for local monitoring) +- LAN IP (for PVE2 to connect) + +#### 3. User Config: `/etc/nut/upsd.users` + +```ini +[admin] + password = upsadmin123 + actions = SET + instcmds = ALL + +[upsmon] + password = upsmon123 + upsmon master +``` + +**Users**: +- `admin`: Full control, can run commands +- `upsmon`: Monitoring only (used by PVE2) + +#### 4. Monitor Config: `/etc/nut/upsmon.conf` + +```ini +MONITOR cyberpower@localhost 1 upsmon upsmon123 master + +MINSUPPLIES 1 +SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh" +NOTIFYCMD /usr/sbin/upssched +POLLFREQ 5 +POLLFREQALERT 5 +HOSTSYNC 15 +DEADTIME 15 +POWERDOWNFLAG /etc/killpower + +NOTIFYMSG ONLINE "UPS %s on line power" +NOTIFYMSG ONBATT "UPS %s on battery" +NOTIFYMSG LOWBATT "UPS %s battery is low" +NOTIFYMSG FSD "UPS %s: forced shutdown in progress" +NOTIFYMSG COMMOK "Communications with UPS %s established" +NOTIFYMSG COMMBAD "Communications with UPS %s lost" +NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding" +NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced" +NOTIFYMSG NOCOMM "UPS %s is unavailable" +NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible" + +NOTIFYFLAG ONLINE SYSLOG+WALL +NOTIFYFLAG ONBATT SYSLOG+WALL +NOTIFYFLAG LOWBATT SYSLOG+WALL +NOTIFYFLAG FSD SYSLOG+WALL +NOTIFYFLAG COMMOK SYSLOG+WALL +NOTIFYFLAG COMMBAD SYSLOG+WALL +NOTIFYFLAG SHUTDOWN SYSLOG+WALL +NOTIFYFLAG REPLBATT SYSLOG+WALL +NOTIFYFLAG NOCOMM SYSLOG+WALL +NOTIFYFLAG NOPARENT SYSLOG +``` + +**Key settings**: +- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS +- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script +- `POLLFREQ 5`: Check UPS every 5 seconds + +#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules` + +```udev +SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut" +``` + +**Purpose**: Ensure NUT can access USB UPS device + +**Apply rule**: +```bash +udevadm control --reload-rules +udevadm trigger +``` + +### NUT Client Configuration (PVE2) + +#### Monitor Config: `/etc/nut/upsmon.conf` + +```ini +MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave + +MINSUPPLIES 1 +SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh" +POLLFREQ 5 +POLLFREQALERT 5 +HOSTSYNC 15 +DEADTIME 15 +POWERDOWNFLAG /etc/killpower + +# Same NOTIFYMSG and NOTIFYFLAG as PVE +``` + +**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE + +--- + +## Custom Shutdown Script + +### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2) + +```bash +#!/bin/bash +# Graceful VM/CT shutdown when UPS battery low + +LOG="/var/log/ups-shutdown.log" + +log() { + echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG" +} + +log "=== UPS Shutdown Triggered ===" +log "Battery low - initiating graceful shutdown of VMs/CTs" + +# Get list of running VMs (skip TrueNAS for now) +VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}') +for VMID in $VMS; do + log "Stopping VM $VMID..." + qm shutdown $VMID +done + +# Get list of running containers +CTS=$(pct list | awk '$2=="running" {print $1}') +for CTID in $CTS; do + log "Stopping CT $CTID..." + pct shutdown $CTID +done + +# Wait for VMs/CTs to stop +log "Waiting 60 seconds for VMs/CTs to shut down..." +sleep 60 + +# Now stop TrueNAS (storage - must be last) +if qm status 100 | grep -q running; then + log "Stopping TrueNAS (VM 100) last..." + qm shutdown 100 + sleep 30 +fi + +log "All VMs/CTs stopped. Host will remain running until UPS dies." +log "=== UPS Shutdown Complete ===" +``` + +**Make executable**: +```bash +chmod +x /usr/local/bin/ups-shutdown.sh +``` + +**Script behavior**: +1. Stops all VMs (except TrueNAS) +2. Stops all containers +3. Waits 60 seconds +4. Stops TrueNAS last (storage must be cleanly unmounted) +5. **Does NOT shut down Proxmox hosts** - intentionally left running + +**Why not shut down hosts?** +- BIOS configured to "Restore on AC Power Loss" +- When power returns, servers auto-boot and start VMs in order +- Avoids need for manual intervention + +--- + +## Power Failure Behavior + +### When Power Fails + +1. **UPS switches to battery** (`OB DISCHRG` status) +2. **NUT monitors runtime** - polls every 5 seconds +3. **At 120 seconds (2 min) remaining**: + - NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers + - Script gracefully stops all VMs/CTs + - TrueNAS stopped last (storage integrity) +4. **Hosts remain running** until UPS battery depletes +5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped) + +### When Power Returns + +1. **UPS charges battery**, power returns to servers +2. **BIOS "Restore on AC Power Loss"** boots both servers +3. **Proxmox starts** and auto-starts VMs in configured order: + +| Order | Wait | VMs/CTs | Reason | +|-------|------|---------|--------| +| 1 | 30s | TrueNAS (VM 100) | Storage must start first | +| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS | +| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs | +| 4 | 5s | pihole, traefik, findshyt | Containers | + +PVE2 VMs: order=1, wait=10s + +**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21) + +--- + +## UPS Status Codes + +| Code | Meaning | Action | +|------|---------|--------| +| `OL` | Online (AC power) | Normal operation | +| `OB` | On Battery | Power outage - monitor runtime | +| `LB` | Low Battery | <2 min remaining - shutdown imminent | +| `CHRG` | Charging | Battery charging after power restored | +| `DISCHRG` | Discharging | On battery, draining | +| `FSD` | Forced Shutdown | NUT triggered shutdown | + +--- + +## Monitoring & Commands + +### Check UPS Status + +```bash +# Full status +ssh pve 'upsc cyberpower@localhost' + +# Key metrics only +ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' + +# Example output: +# battery.charge: 100 +# battery.runtime: 1234 (seconds remaining) +# ups.load: 33 (% load) +# ups.status: OL (online) +``` + +### Control UPS Beeper + +```bash +# Mute beeper (temporary - until next power event) +ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute' + +# Disable beeper (permanent) +ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable' + +# Enable beeper +ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable' +``` + +### Test Shutdown Procedure + +**Simulate low battery** (careful - this will shut down VMs!): + +```bash +# Set a very high low battery threshold to trigger shutdown +ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost' + +# Watch it trigger (when runtime drops below 300 seconds) +ssh pve 'tail -f /var/log/ups-shutdown.log' + +# Reset to normal +ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost' +``` + +**Better test**: Run shutdown script manually without actually triggering NUT: +```bash +ssh pve '/usr/local/bin/ups-shutdown.sh' +``` + +--- + +## Home Assistant Integration + +UPS metrics are exposed to Home Assistant via NUT integration. + +### Available Sensors + +| Entity ID | Description | +|-----------|-------------| +| `sensor.cyberpower_battery_charge` | Battery % (0-100) | +| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery | +| `sensor.cyberpower_load` | Load % (0-100) | +| `sensor.cyberpower_input_voltage` | Input voltage (V AC) | +| `sensor.cyberpower_output_voltage` | Output voltage (V AC) | +| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) | + +### Configuration + +**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup. + +### Example Automations + +**Send notification when on battery**: +```yaml +automation: + - alias: "UPS On Battery Alert" + trigger: + - platform: state + entity_id: sensor.cyberpower_status + to: "OB" + action: + - service: notify.mobile_app + data: + message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s" +``` + +**Alert when battery low**: +```yaml +automation: + - alias: "UPS Low Battery Alert" + trigger: + - platform: numeric_state + entity_id: sensor.cyberpower_battery_runtime + below: 300 + action: + - service: notify.mobile_app + data: + message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining" +``` + +--- + +## Testing Results + +### Full Power Failure Test (2025-12-21) + +Complete end-to-end test of power failure and recovery: + +| Event | Time | Duration | Notes | +|-------|------|----------|-------| +| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load | +| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran | +| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed | +| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery | +| **Power restored** | ~22:47 | - | Plugged back in | +| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started | +| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started | +| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order | +| **Total recovery** | - | **~7 min** | From power return to fully operational | + +**Results**: +✅ VMs shut down gracefully +✅ Hosts remained running until UPS died (as intended) +✅ Auto-boot on power restoration worked +✅ VMs started in correct order with appropriate delays +✅ No data corruption or issues + +**Runtime calculation**: +- Load: ~33% (440W estimated) +- Total runtime on battery: ~16 minutes (22:30 → 22:46:29) +- Matches manufacturer estimate for 33% load + +--- + +## Proxmox Cluster Quorum Fix + +### Problem + +With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs. + +During UPS testing, this would prevent the remaining node from starting VMs after power restoration. + +### Solution + +Modified `/etc/pve/corosync.conf` to enable 2-node mode: + +``` +quorum { + provider: corosync_votequorum + two_node: 1 +} +``` + +**Effect**: +- Either node can operate independently if the other is down +- No more waiting for quorum when one server is offline +- Both nodes visible in single Proxmox interface when both up + +**Applied**: 2025-12-21 + +--- + +## Maintenance + +### Monthly Checks + +```bash +# Check UPS status +ssh pve 'upsc cyberpower@localhost' + +# Check NUT server running +ssh pve 'systemctl status nut-server' +ssh pve 'systemctl status nut-monitor' + +# Check NUT client running (PVE2) +ssh pve2 'systemctl status nut-monitor' + +# Verify PVE2 can see UPS +ssh pve2 'upsc cyberpower@10.10.10.120' + +# Check logs for errors +ssh pve 'journalctl -u nut-server -n 50' +ssh pve 'journalctl -u nut-monitor -n 50' +``` + +### Battery Health + +**Check battery stats**: +```bash +ssh pve 'upsc cyberpower@localhost | grep battery' + +# Key metrics: +# battery.charge: 100 (should be near 100 when on AC) +# battery.runtime: 1200+ (seconds at current load) +# battery.voltage: ~24V (normal for 24V battery system) +``` + +**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`: +```bash +ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date' +``` + +CyberPower batteries typically last 3-5 years. + +### Firmware Updates + +Check CyberPower website for firmware updates: +https://www.cyberpowersystems.com/support/firmware/ + +--- + +## Troubleshooting + +### UPS Not Detected + +```bash +# Check USB connection +ssh pve 'lsusb | grep Cyber' + +# Expected: +# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS + +# Restart NUT driver +ssh pve 'systemctl restart nut-driver' +ssh pve 'systemctl status nut-driver' +``` + +### PVE2 Can't Connect + +```bash +# Verify NUT server listening +ssh pve 'netstat -tuln | grep 3493' + +# Should show: +# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN + +# Test connection from PVE2 +ssh pve2 'telnet 10.10.10.120 3493' + +# Check firewall (should allow port 3493) +ssh pve 'iptables -L -n | grep 3493' +``` + +### Shutdown Script Not Running + +```bash +# Check script permissions +ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh' + +# Should be: -rwxr-xr-x (executable) + +# Check logs +ssh pve 'cat /var/log/ups-shutdown.log' + +# Test script manually +ssh pve '/usr/local/bin/ups-shutdown.sh' +``` + +### UPS Status Shows UNKNOWN + +```bash +# Driver may not be compatible +ssh pve 'upsc cyberpower@localhost ups.status' + +# Try different driver (in /etc/nut/ups.conf) +# driver = usbhid-ups +# or +# driver = blazer_usb + +# Restart after change +ssh pve 'systemctl restart nut-driver nut-server' +``` + +--- + +## Future Improvements + +- [ ] Add email alerts for UPS events (power fail, low battery) +- [ ] Log runtime statistics to track battery degradation +- [ ] Set up Grafana dashboard for UPS metrics +- [ ] Test battery runtime at different load levels +- [ ] Upgrade to 20A circuit, restore original 5-20P plug +- [ ] Consider adding network management card for out-of-band UPS access + +--- + +## Related Documentation + +- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization +- [VMS.md](VMS.md) - VM startup order configuration +- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration + +--- + +**Last Updated**: 2025-12-22 diff --git a/VMS.md b/VMS.md new file mode 100644 index 0000000..b8e1d4c --- /dev/null +++ b/VMS.md @@ -0,0 +1,579 @@ +# VMs and Containers + +Complete inventory of all virtual machines and LXC containers across both Proxmox servers. + +## Overview + +| Server | VMs | LXCs | Total | +|--------|-----|------|-------| +| **PVE** (10.10.10.120) | 6 | 3 | 9 | +| **PVE2** (10.10.10.102) | 2 | 0 | 2 | +| **Total** | **8** | **3** | **11** | + +--- + +## PVE (10.10.10.120) - Primary Server + +### Virtual Machines + +| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent | +|------|------|-----|-------|-----|---------|---------|-----------------|------------| +| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes | +| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes | +| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes | +| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No | +| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes | +| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes | +| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes | + +### LXC Containers + +| CTID | Name | IP | RAM | Storage | Purpose | +|------|------|-----|-----|---------|---------| +| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking | +| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) | +| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app | + +--- + +## PVE2 (10.10.10.102) - Secondary Server + +### Virtual Machines + +| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent | +|------|------|-----|-------|-----|---------|---------|-----------------|------------| +| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes | +| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes | + +### LXC Containers + +None on PVE2. + +--- + +## VM Details + +### 100 - TrueNAS (Storage Server) + +**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries + +**Specs**: +- **OS**: TrueNAS SCALE +- **vCPUs**: 8 +- **RAM**: 32 GB +- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough) +- **Network**: + - Primary: 10 Gb (vmbr2) + - Secondary: Internal storage network (vmbr3 @ 10.10.20.x) + +**Hardware Passthrough**: +- LSI SAS2308 HBA (for EMC enclosure drives) +- Samsung NVMe (for ZFS caching) + +**ZFS Pools**: +- `vault`: Main storage pool on EMC drives +- Boot pool on passed-through NVMe + +**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) + +--- + +### 101 - Saltbox (Media Automation) + +**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc. + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 16 +- **RAM**: 16 GB +- **Storage**: nvme-mirror1 +- **Network**: 10 Gb (vmbr2) + +**GPU Passthrough**: +- NVIDIA TITAN RTX (for Plex hardware transcoding) + +**Services**: +- Plex Media Server (plex.htsn.io) +- Sonarr, Radarr, Lidarr (TV/movie/music automation) +- SABnzbd, NZBGet (downloaders) +- Overseerr (request management) +- Tautulli (Plex stats) +- Organizr (dashboard) +- Authelia (SSO authentication) +- Traefik (reverse proxy - separate from CT 202) + +**Managed By**: Saltbox Ansible playbooks +**See**: [SALTBOX.md](#) (coming soon) + +--- + +### 105 - fs-dev (Development Environment) + +**Purpose**: General development work, testing, prototyping + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 10 +- **RAM**: 8 GB +- **Storage**: rpool +- **Network**: 1 Gb (vmbr0) + +--- + +### 110 - Home Assistant (Home Automation) + +**Purpose**: Smart home automation platform + +**Specs**: +- **OS**: Home Assistant OS +- **vCPUs**: 2 +- **RAM**: 2 GB +- **Storage**: rpool +- **Network**: 1 Gb (vmbr0) + +**Access**: +- Web UI: https://homeassistant.htsn.io +- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md) + +**Special Notes**: +- ❌ No QEMU agent (Home Assistant OS doesn't support it) +- No SSH server by default (access via web terminal) + +--- + +### 111 - lmdev1 (AI/LLM Development) + +**Purpose**: AI model development, fine-tuning, inference + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 8 +- **RAM**: 32 GB +- **Storage**: nvme-mirror1 +- **Network**: 1 Gb (vmbr0) + +**GPU Passthrough**: +- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed) + +**Installed**: +- CUDA toolkit +- Python 3.11+ +- PyTorch, TensorFlow +- Hugging Face transformers + +--- + +### 201 - Copyparty (File Sharing) + +**Purpose**: Simple HTTP file sharing server + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 2 +- **RAM**: 2 GB +- **Storage**: rpool +- **Network**: 1 Gb (vmbr0) + +**Access**: https://copyparty.htsn.io + +--- + +### 206 - docker-host (Docker Services) + +**Purpose**: General-purpose Docker host for miscellaneous services + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 2 +- **RAM**: 4 GB +- **Storage**: rpool +- **Network**: 1 Gb (vmbr0) +- **CPU**: `host` passthrough (for x86-64-v3 support) + +**Services Running**: +- Excalidraw (excalidraw.htsn.io) - Whiteboard +- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app +- Pulse (pulse.htsn.io) - Monitoring dashboard + +**Docker Compose Files**: `/opt/*/docker-compose.yml` + +--- + +### 300 - gitea-vm (Git Server) + +**Purpose**: Self-hosted Git server + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 2 +- **RAM**: 4 GB +- **Storage**: nvme-mirror3 (PVE2) +- **Network**: 1 Gb (vmbr0) + +**Access**: https://git.htsn.io + +**Repositories**: +- homelab-docs (this documentation) +- Personal projects +- Private repos + +--- + +### 301 - trading-vm (AI Trading Platform) + +**Purpose**: Algorithmic trading system with AI models + +**Specs**: +- **OS**: Ubuntu 22.04 +- **vCPUs**: 16 +- **RAM**: 32 GB +- **Storage**: nvme-mirror3 (PVE2) +- **Network**: 1 Gb (vmbr0) + +**GPU Passthrough**: +- NVIDIA RTX A6000 (300W TDP, 48GB VRAM) + +**Software**: +- Trading algorithms +- AI models for market prediction +- Real-time data feeds +- Backtesting infrastructure + +--- + +## LXC Container Details + +### 200 - Pi-hole (DNS & Ad Blocking) + +**Purpose**: Network-wide DNS server and ad blocker + +**Type**: LXC (unprivileged) +**OS**: Ubuntu 22.04 +**IP**: 10.10.10.10 +**Storage**: rpool + +**Access**: +- Web UI: http://10.10.10.10/admin +- Public URL: https://pihole.htsn.io + +**Configuration**: +- Upstream DNS: Cloudflare (1.1.1.1) +- DHCP: Disabled (router handles DHCP) +- Interface: All interfaces + +**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking + +--- + +### 202 - Traefik (Reverse Proxy) + +**Purpose**: Primary reverse proxy for all public-facing services + +**Type**: LXC (unprivileged) +**OS**: Ubuntu 22.04 +**IP**: 10.10.10.250 +**Storage**: rpool + +**Configuration**: `/etc/traefik/` +**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml` + +**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation + +**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101). + +--- + +### 205 - FindShyt (Custom App) + +**Purpose**: Custom application (details TBD) + +**Type**: LXC (unprivileged) +**OS**: Ubuntu 22.04 +**IP**: 10.10.10.8 +**Storage**: rpool + +**Access**: https://findshyt.htsn.io + +--- + +## VM Startup Order & Dependencies + +### Power-On Sequence + +When servers boot (after power failure or restart), VMs/CTs start in this order: + +#### PVE (10.10.10.120) + +| Order | Wait | VMID | Name | Reason | +|-------|------|------|------|--------| +| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS | +| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media | +| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies | +| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly | + +**Configure startup order** (already set): +```bash +# View current config +ssh pve 'qm config 100 | grep -E "startup|onboot"' + +# Set startup order (example) +ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30' +ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60' +``` + +#### PVE2 (10.10.10.102) + +| Order | Wait | VMID | Name | +|-------|------|------|------| +| **1** | 10s | 300, 301 | All VMs | + +**Less critical** - no dependencies between PVE2 VMs. + +--- + +## Resource Allocation Summary + +### Total Allocated (PVE) + +| Resource | Allocated | Physical | % Used | +|----------|-----------|----------|--------| +| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% | +| **RAM** | 98 GB | 128 GB | 77% | + +**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously) + +### Total Allocated (PVE2) + +| Resource | Allocated | Physical | % Used | +|----------|-----------|----------|--------| +| **vCPUs** | 18 | 64 | 28% | +| **RAM** | 36 GB | 128 GB | 28% | + +**PVE2** has significant headroom for additional VMs. + +--- + +## Adding a New VM + +### Quick Template + +```bash +# Create VM +ssh pve 'qm create VMID \ + --name myvm \ + --memory 4096 \ + --cores 2 \ + --net0 virtio,bridge=vmbr0 \ + --scsihw virtio-scsi-pci \ + --scsi0 nvme-mirror1:32 \ + --boot order=scsi0 \ + --ostype l26 \ + --agent enabled=1' + +# Attach ISO for installation +ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom' + +# Start VM +ssh pve 'qm start VMID' + +# Access console +ssh pve 'qm vncproxy VMID' # Then connect with VNC client +# Or via Proxmox web UI +``` + +### Cloud-Init Template (Faster) + +Use cloud-init for automated VM deployment: + +```bash +# Download cloud image +ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img' + +# Create VM +ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0' + +# Import disk +ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1' + +# Attach disk +ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0' + +# Add cloud-init drive +ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit' + +# Set boot disk +ssh pve 'qm set VMID --boot order=scsi0' + +# Configure cloud-init (user, SSH key, network) +ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1' + +# Enable QEMU agent +ssh pve 'qm set VMID --agent enabled=1' + +# Resize disk (cloud images are small by default) +ssh pve 'qm resize VMID scsi0 +30G' + +# Start VM +ssh pve 'qm start VMID' +``` + +**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured. + +--- + +## Adding a New LXC Container + +```bash +# Download template (if not already downloaded) +ssh pve 'pveam update' +ssh pve 'pveam available | grep ubuntu' +ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst' + +# Create container +ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \ + --hostname mycontainer \ + --memory 2048 \ + --cores 2 \ + --net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \ + --rootfs local-zfs:8 \ + --unprivileged 1 \ + --features nesting=1 \ + --start 1' + +# Set root password +ssh pve 'pct exec CTID -- passwd' + +# Add SSH key +ssh pve 'pct exec CTID -- mkdir -p /root/.ssh' +ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"' +ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys' +``` + +--- + +## GPU Passthrough Configuration + +### Current GPU Assignments + +| GPU | Location | Passed To | VMID | Purpose | +|-----|----------|-----------|------|---------| +| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) | +| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) | +| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) | + +### How to Pass GPU to VM + +1. **Identify GPU PCI ID**: + ```bash + ssh pve 'lspci | grep -i nvidia' + # Example output: + # 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1) + # 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1) + ``` + +2. **Pass GPU to VM** (include both VGA and Audio): + ```bash + ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1' + # If multi-function device (GPU + Audio), use: + ssh pve 'qm set VMID -hostpci0 81:00,pcie=1' + ``` + +3. **Configure VM for GPU**: + ```bash + # Set machine type to q35 + ssh pve 'qm set VMID --machine q35' + + # Set BIOS to OVMF (UEFI) + ssh pve 'qm set VMID --bios ovmf' + + # Add EFI disk + ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1' + ``` + +4. **Reboot VM** and install NVIDIA drivers inside the VM + +**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide + +--- + +## Backup Priority + +See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan. + +### Critical VMs (Must Backup) + +| Priority | VMID | Name | Reason | +|----------|------|------|--------| +| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost | +| 🟡 **HIGH** | 101 | saltbox | Complex media stack config | +| 🟡 **HIGH** | 110 | homeassistant | Home automation config | +| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) | +| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models | + +### Medium Priority + +| VMID | Name | Notes | +|------|------|-------| +| 200 | pihole | Easy to rebuild, but DNS config valuable | +| 202 | traefik | Config files backed up separately | + +### Low Priority (Ephemeral/Rebuildable) + +| VMID | Name | Notes | +|------|------|-------| +| 105 | fs-dev | Development - code is in Git | +| 111 | lmdev1 | Ephemeral development | +| 201 | copyparty | Simple app, easy to redeploy | +| 206 | docker-host | Docker Compose files backed up separately | + +--- + +## Quick Reference Commands + +```bash +# List all VMs +ssh pve 'qm list' +ssh pve2 'qm list' + +# List all containers +ssh pve 'pct list' + +# Start/stop VM +ssh pve 'qm start VMID' +ssh pve 'qm stop VMID' +ssh pve 'qm shutdown VMID' # Graceful + +# Start/stop container +ssh pve 'pct start CTID' +ssh pve 'pct stop CTID' +ssh pve 'pct shutdown CTID' # Graceful + +# VM console +ssh pve 'qm terminal VMID' + +# Container console +ssh pve 'pct enter CTID' + +# Clone VM +ssh pve 'qm clone VMID NEW_VMID --name newvm' + +# Delete VM +ssh pve 'qm destroy VMID' + +# Delete container +ssh pve 'pct destroy CTID' +``` + +--- + +## Related Documentation + +- [STORAGE.md](STORAGE.md) - Storage pool assignments +- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs +- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy +- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization +- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs + +--- + +**Last Updated**: 2025-12-22