Compare commits

...

3 Commits

Author SHA1 Message Date
Hutson
eddd98c57f Auto-sync: 20260105-122831 2026-01-05 12:28:33 -05:00
Hutson
56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
Hutson
23e9df68c9 Update Happy Coder docs with complete setup flow and troubleshooting
- Expand Mobile Access Setup with full authentication steps
  (HAPPY_SERVER_URL, happy auth login, happy connect claude, local claude login)
- Fix launchd path: ~/Library/LaunchAgents/ not /Library/LaunchDaemons/
- Add Common Issues troubleshooting table with fixes for:
  - Invalid API key (Claude not logged in locally)
  - Failed to start daemon (stale lock files)
  - Sessions not showing (missing HAPPY_SERVER_URL)
  - Slow responses (Cloudflare proxy enabled)
- Update DNS note: Cloudflare proxy disabled for WebSocket performance
- Add .zshrc to Files & Configuration table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 13:28:30 -05:00
25 changed files with 8079 additions and 1017 deletions

View File

@@ -0,0 +1,5 @@
# This directory is a Syncthing folder marker.
# Do not delete.
folderID: homelab
created: 2025-12-23T00:39:52-05:00

358
BACKUP-STRATEGY.md Normal file
View File

@@ -0,0 +1,358 @@
# Backup Strategy
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
## Executive Summary
### What We Have ✅
- **Syncthing**: File synchronization across 5+ devices
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
- **Proxmox**: Built-in backup capabilities (not yet configured)
### What We DON'T Have 🚨
- ❌ No documented VM/CT backups
- ❌ No ZFS snapshot schedule
- ❌ No offsite backups
- ❌ No disaster recovery plan
- ❌ No tested restore procedures
- ❌ No configuration backups
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
---
## Current State Analysis
### Syncthing (File Synchronization)
**What it is**: Real-time file sync across devices
**What it is NOT**: A backup solution
| Folder | Devices | Size | Protected? |
|--------|---------|------|------------|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
**Limitations**:
- ❌ Accidental deletion → deleted everywhere
- ❌ Ransomware/corruption → spreads everywhere
- ❌ No point-in-time recovery
- ❌ No version history (unless file versioning enabled - not documented)
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
### ZFS on TrueNAS (Potential Backup Target)
**Current Status**: ❓ Unknown - snapshots may or may not be configured
**Needs Investigation**:
```bash
# Check if snapshots exist
ssh truenas 'zfs list -t snapshot'
# Check if automated snapshots are configured
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
# Check snapshot schedule via TrueNAS API/UI
```
**If configured**, ZFS snapshots provide:
- ✅ Point-in-time recovery
- ✅ Protection against accidental deletion
- ✅ Fast rollback capability
- ⚠️ Still single location (no offsite protection)
### Proxmox VM/CT Backups
**Current Status**: ❓ Unknown - no backup jobs documented
**Needs Investigation**:
```bash
# Check backup configuration
ssh pve 'pvesh get /cluster/backup'
# Check if any backups exist
ssh pve 'ls -lh /var/lib/vz/dump/'
ssh pve2 'ls -lh /var/lib/vz/dump/'
```
**Critical VMs Needing Backup**:
| VM/CT | VMID | Priority | Notes |
|-------|------|----------|-------|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
| homeassistant | 110 | 🟡 HIGH | Home automation config |
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
---
## Recommended Backup Strategy
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
**ZFS Snapshots on TrueNAS**
Schedule automatic snapshots for all datasets:
| Dataset | Frequency | Retention |
|---------|-----------|-----------|
| vault/documents | Every 15 min | 1 hour |
| vault/documents | Hourly | 24 hours |
| vault/documents | Daily | 30 days |
| vault/documents | Weekly | 12 weeks |
| vault/documents | Monthly | 12 months |
**Implementation**:
```bash
# Via TrueNAS UI: Storage → Snapshots → Add
# Or via CLI:
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
```
**Proxmox VM Backups**
Configure weekly backups to local storage:
```bash
# Create backup job via Proxmox UI:
# Datacenter → Backup → Add
# - Schedule: Weekly (Sunday 2 AM)
# - Storage: local-zfs or nvme-mirror1
# - Mode: Snapshot (fast)
# - Retention: 4 backups
```
**Or via CLI**:
```bash
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
```
### Tier 2: Offsite Backups (CRITICAL GAP)
**Option A: Cloud Storage (Recommended)**
Use **rclone** or **restic** to sync critical data to cloud:
| Provider | Cost | Pros | Cons |
|----------|------|------|------|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
**Implementation Example (Backblaze B2)**:
```bash
# Install on TrueNAS
ssh truenas 'pkg install rclone restic'
# Configure B2
rclone config # Follow prompts for B2
# Daily backup critical folders
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
```
**Option B: Offsite TrueNAS Replication**
- Set up second TrueNAS at friend/family member's house
- Use ZFS replication to sync snapshots
- Requires: Static IP or Tailscale, trust
**Option C: USB Drive Rotation**
- Weekly backup to external USB drive
- Rotate 2-3 drives (one always offsite)
- Manual but simple
### Tier 3: Configuration Backups
**Proxmox Configuration**
```bash
# Backup /etc/pve (configs are already in cluster filesystem)
# But also backup to external location:
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
# Copy to safe location
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
```
**VM-Specific Configs**
- Traefik configs: `/etc/traefik/` on CT 202
- Saltbox configs: `/srv/git/saltbox/` on VM 101
- Home Assistant: `/config/` on VM 110
**Script to backup all configs**:
```bash
#!/bin/bash
# Save as ~/bin/backup-homelab-configs.sh
DATE=$(date +%Y%m%d)
BACKUP_DIR=~/Backups/homelab-configs/$DATE
mkdir -p $BACKUP_DIR
# Proxmox configs
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
# Traefik
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
# Saltbox
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
# Home Assistant
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
echo "Configs backed up to $BACKUP_DIR"
```
---
## Disaster Recovery Scenarios
### Scenario 1: Single VM Failure
**Impact**: Medium
**Recovery Time**: 30-60 minutes
1. Restore from Proxmox backup:
```bash
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
```
2. Start VM and verify
3. Update IP if needed
### Scenario 2: TrueNAS Failure
**Impact**: CATASTROPHIC (all storage lost)
**Recovery Time**: Unknown - NO PLAN
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backup of critical datasets
- Documented ZFS pool creation steps
- Share configuration export
### Scenario 3: Complete PVE Server Failure
**Impact**: SEVERE
**Recovery Time**: 4-8 hours
**Current State**: ⚠️ PARTIALLY RECOVERABLE
**Needed**:
- VM backups stored on TrueNAS or PVE2
- Proxmox reinstall procedure
- Network config documentation
### Scenario 4: Complete Site Disaster (Fire/Flood)
**Impact**: TOTAL LOSS
**Recovery Time**: Unknown
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backups (cloud or physical)
- Critical data prioritization
- Restore procedures
---
## Action Plan
### Immediate (Next 7 Days)
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
```bash
ssh truenas 'zfs list -t snapshot'
ssh pve 'ls -lh /var/lib/vz/dump/'
```
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
### Short-term (Next 30 Days)
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
- [ ] **Install backup tools**: rclone or restic on TrueNAS
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
### Long-term (Next 90 Days)
- [ ] **Implement monitoring**: Alerts for backup failures
- [ ] **Quarterly restore test**: Verify backups actually work
- [ ] **Backup rotation policy**: Automate old backup cleanup
- [ ] **Configuration backup automation**: Weekly cron job
---
## Monitoring & Validation
### Backup Health Checks
```bash
# Check last ZFS snapshot
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
# Check Proxmox backup status
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
# Check cloud sync status (if using rclone)
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
```
### Alerts to Set Up
- Email alert if no snapshot created in 24 hours
- Email alert if Proxmox backup fails
- Email alert if cloud sync fails
- Weekly backup status report
---
## Cost Estimate
**Monthly Backup Costs**:
| Component | Cost | Notes |
|-----------|------|-------|
| Local storage (already owned) | $0 | Using existing TrueNAS |
| Proxmox backups (local) | $0 | Using existing storage |
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
**One-time**:
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
- [VMS.md](VMS.md) - VM inventory and prioritization
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED

1275
CLAUDE.md

File diff suppressed because it is too large Load Diff

339
GATEWAY.md Normal file
View File

@@ -0,0 +1,339 @@
# UniFi Gateway (UCG-Fiber)
Documentation for the UniFi Cloud Gateway Fiber (10.10.10.1) - the primary network gateway and router.
## Overview
| Property | Value |
|----------|-------|
| **Device** | UniFi Cloud Gateway Fiber (UCG-Fiber) |
| **IP Address** | 10.10.10.1 |
| **SSH User** | root |
| **SSH Auth** | SSH key (`~/.ssh/id_ed25519`) |
| **Host Aliases** | `ucg-fiber`, `gateway` |
| **Firmware** | v4.4.9 (as of 2026-01-02) |
| **UniFi Core** | 4.4.19 |
| **RAM** | 2.9 GB (shared with UniFi apps) |
---
## SSH Access
SSH key authentication is configured. Use host aliases:
```bash
# Quick access
ssh ucg-fiber 'hostname'
ssh gateway 'free -m'
# Or use IP directly
ssh root@10.10.10.1 'uptime'
```
**Note**: SSH key may need re-deployment after firmware updates if UniFi clears authorized_keys.
---
## Monitoring Services
Two custom monitoring services run on the gateway to prevent and diagnose issues.
### Internet Watchdog Service
**Purpose**: Auto-reboots gateway if internet connectivity is lost for 5+ minutes
**Location**: `/data/scripts/internet-watchdog.sh`
**How it works**:
1. Pings 1.1.1.1, 8.8.8.8, 208.67.222.222 every 60 seconds
2. If all three fail, increments failure counter
3. After 5 consecutive failures (~5 minutes), triggers reboot
4. Logs all activity to `/var/log/internet-watchdog.log`
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog'
# View recent logs
ssh ucg-fiber 'tail -50 /var/log/internet-watchdog.log'
# Stop temporarily (if troubleshooting)
ssh ucg-fiber 'systemctl stop internet-watchdog'
# Restart
ssh ucg-fiber 'systemctl restart internet-watchdog'
```
**Log Format**:
```
2026-01-02 22:45:01 - Watchdog started
2026-01-02 22:46:01 - Internet check failed (1/5)
2026-01-02 22:47:01 - Internet restored after 1 failures
```
---
### Memory Monitor Service
**Purpose**: Logs memory usage and top processes every 10 minutes for diagnostics
**Location**: `/data/scripts/memory-monitor.sh`
**Log File**: `/data/logs/memory-history.log`
**How it works**:
1. Every 10 minutes, logs current memory usage (`free -m`)
2. Logs top 12 memory-consuming processes
3. Auto-rotates log when it exceeds 10MB (keeps one .old file)
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status memory-monitor'
# View recent memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Check current memory usage
ssh ucg-fiber 'free -m'
# See top memory consumers right now
ssh ucg-fiber 'ps -eo pid,rss,comm --sort=-rss | head -12'
```
**Log Format**:
```
========== 2026-01-02 22:30:00 ==========
--- MEMORY ---
total used free shared buff/cache available
Mem: 2892 1890 102 456 899 1002
Swap: 512 88 424
--- TOP MEMORY PROCESSES ---
PID RSS COMMAND
1234 327456 unifi-protect
2345 252108 mongod
3456 236544 java
...
```
---
## Known Memory Consumers
| Process | Typical Memory | Purpose |
|---------|----------------|---------|
| unifi-protect | ~320 MB | Camera/NVR management |
| mongod | ~250 MB | UniFi configuration database |
| java (controller) | ~230 MB | UniFi Network controller |
| postgres | ~180 MB | PostgreSQL database |
| unifi-core | ~150 MB | UniFi OS core |
| tailscaled | ~80 MB | Tailscale VPN |
**Total available**: ~2.9 GB
**Typical usage**: ~1.8-2.0 GB (leaves ~1 GB free)
**Warning threshold**: <500 MB free
**Critical**: <200 MB free or swap >50% used
---
## Disabled Services
The following services were disabled to reduce memory usage:
| Service | Memory Saved | Reason Disabled |
|---------|--------------|-----------------|
| UniFi Connect | ~200 MB | Not needed (cameras use Protect) |
To re-enable if needed:
```bash
ssh ucg-fiber 'systemctl enable unifi-connect && systemctl start unifi-connect'
```
---
## Common Issues
### Gateway Freeze / Network Loss
**Symptoms**:
- All devices lose internet
- Cannot ping 10.10.10.1
- Physical reboot required
**Root Cause**: Memory exhaustion causing soft lockup
**Prevention**:
1. Internet watchdog auto-reboots after 5 min outage
2. Memory monitor logs help identify runaway processes
3. UniFi Connect disabled to free ~200 MB
**Post-Incident Analysis**:
```bash
# Check memory history for spike before freeze
ssh ucg-fiber 'grep -B5 "Swap:" /data/logs/memory-history.log | tail -50'
# Check watchdog logs
ssh ucg-fiber 'cat /var/log/internet-watchdog.log'
# Check system logs for errors
ssh ucg-fiber 'dmesg | tail -100'
ssh ucg-fiber 'journalctl -p err --since "1 hour ago"'
```
---
### High Memory Usage
**Check current state**:
```bash
ssh ucg-fiber 'free -m && echo "---" && ps -eo pid,rss,comm --sort=-rss | head -15'
```
**If swap is heavily used**:
```bash
# Check swap usage
ssh ucg-fiber 'cat /proc/swaps'
# See what's in swap
ssh ucg-fiber 'for pid in $(ls /proc | grep -E "^[0-9]+$"); do
swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk "{print \$2}");
[ "$swap" -gt 10000 ] 2>/dev/null && echo "$pid: ${swap}kB - $(cat /proc/$pid/comm)";
done | sort -t: -k2 -rn | head -10'
```
**Consider reboot if**:
- Available memory <200 MB
- Swap usage >300 MB
- System becoming unresponsive
---
### Tailscale Issues
**Check Tailscale status**:
```bash
ssh ucg-fiber 'tailscale status'
```
**Common errors and fixes**:
| Error | Fix |
|-------|-----|
| `DNS resolution failed` | Check upstream DNS (Pi-hole at 10.10.10.10) |
| `TLS handshake failed` | Usually temporary; Tailscale auto-reconnects |
| `Not connected` | `ssh ucg-fiber 'tailscale up'` |
---
## Firmware Updates
**Check current version**:
```bash
ssh ucg-fiber 'ubnt-systool version'
```
**Update process**:
1. Check UniFi site for latest stable firmware
2. Download via UI or CLI
3. Schedule update during low-usage time
**After update**:
- Verify SSH key still works
- Check custom services still running
- Verify Tailscale reconnects
**Re-deploy SSH key if needed**:
```bash
ssh-copy-id -i ~/.ssh/id_ed25519 root@10.10.10.1
```
---
## Service Locations
| File | Purpose |
|------|---------|
| `/data/scripts/internet-watchdog.sh` | Watchdog script |
| `/data/scripts/memory-monitor.sh` | Memory monitor script |
| `/etc/systemd/system/internet-watchdog.service` | Watchdog systemd unit |
| `/etc/systemd/system/memory-monitor.service` | Memory monitor systemd unit |
| `/var/log/internet-watchdog.log` | Watchdog log |
| `/data/logs/memory-history.log` | Memory history log |
**Note**: `/data/` persists across firmware updates. `/var/log/` may not.
---
## Quick Reference Commands
```bash
# System status
ssh ucg-fiber 'uptime && free -m'
# Check both monitoring services
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# Memory history (last hour)
ssh ucg-fiber 'tail -60 /data/logs/memory-history.log'
# Watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# Network devices (ARP table)
ssh ucg-fiber 'cat /proc/net/arp'
# Tailscale status
ssh ucg-fiber 'tailscale status'
# System logs
ssh ucg-fiber 'journalctl -p warning --since "1 hour ago" | head -50'
```
---
## Backup Considerations
Custom services in `/data/scripts/` persist across firmware updates but may need:
- Systemd services re-enabled after major updates
- Script permissions re-applied if wiped
**Backup critical files**:
```bash
# Copy scripts locally for reference
scp ucg-fiber:/data/scripts/*.sh ~/Projects/homelab/data/scripts/
```
---
## Related Documentation
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH configuration and host aliases
- [NETWORK.md](NETWORK.md) - Network architecture
- [MONITORING.md](MONITORING.md) - Overall monitoring strategy
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant integration
---
## Incident History
### 2025-12-27 to 2025-12-29: Gateway Freeze
**Timeline**:
- Dec 7: Firmware update to v4.4.9
- Dec 24: Last healthy system logs
- Dec 27-29: "No internet detected" errors in logs
- Dec 29+: Complete silence (gateway frozen)
- Jan 2: Physical reboot restored access
**Root Cause**: Memory exhaustion causing soft lockup (no crash dump saved)
**Resolution**:
- Deployed internet-watchdog service
- Deployed memory-monitor service
- Disabled UniFi Connect (~200 MB saved)
- Configured SSH key auth
---
**Last Updated**: 2026-01-02

455
HARDWARE.md Normal file
View File

@@ -0,0 +1,455 @@
# Hardware Inventory
Complete hardware specifications for all homelab equipment.
## Servers
### PVE (10.10.10.120) - Primary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Cores**: 32 cores / 64 threads
- **Base Clock**: 3.5 GHz
- **Boost Clock**: 4.2 GHz
- **TDP**: 280W
- **Architecture**: Zen 2 (7nm)
- **Socket**: sTRX4
- **Features**: ECC support, PCIe 4.0
#### RAM
- **Capacity**: 128 GB
- **Type**: DDR4 ECC Registered
- **Speed**: Unknown (needs investigation)
- **Channels**: 8-channel (quad-channel per socket)
- **Idle Power**: ~30-40W
#### Storage
**OS/VM Storage:**
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
**Total Storage**: ~9 TB usable
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
**Total GPU Power**: 75W + 280W = 355W (under load)
#### Network Cards
| Interface | Model | Speed | Purpose | Bridge |
|-----------|-------|-------|---------|--------|
| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
#### Storage Controllers
| Model | Interface | Purpose |
|-------|-----------|---------|
| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
#### Motherboard
- **Model**: Unknown - needs investigation
- **Chipset**: AMD TRX40
- **Form Factor**: ATX/EATX
- **PCIe Slots**: Multiple PCIe 4.0 slots
- **Features**: IOMMU support, ECC memory
#### Power Supply
- **Model**: Unknown
- **Wattage**: Likely 1000W+ (needs investigation)
- **Type**: ATX, 80+ certification unknown
#### Cooling
- **CPU Cooler**: Unknown - likely large tower or AIO
- **Case Fans**: Unknown quantity
- **Note**: CPU temps 70-80°C under load (healthy)
---
### PVE2 (10.10.10.102) - Secondary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Specs**: Same as PVE (32C/64T, 280W TDP)
#### RAM
- **Capacity**: 128 GB DDR4 ECC
- **Same specs as PVE**
#### Storage
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
#### Network Cards
| Interface | Model | Speed | Purpose |
|-----------|-------|-------|---------|
| nic1 | Unknown (onboard) | 1 Gb | Management |
**Note**: MTU set to 9000 for jumbo frames
#### Motherboard
- **Model**: Unknown
- **Chipset**: AMD TRX40
- **Similar to PVE**
---
## Network Equipment
### UniFi Dream Machine Pro (UCG-Fiber)
- **Model**: UniFi Cloud Gateway Fiber
- **IP**: 10.10.10.1
- **Ports**: Multiple 1Gb + SFP+ uplink
- **Features**: Router, firewall, VPN, IDS/IPS
- **MTU**: 9216 (supports jumbo frames)
- **Tailscale**: Installed for VPN failover
### Switches
**Details needed** - investigate current switch setup:
- 10Gb switch for high-speed connections?
- 1Gb switch for general devices?
- PoE capabilities?
```bash
# Check what's connected to 10Gb interfaces
ssh pve 'ip link show enp35s0f0'
ssh pve 'ip link show enp35s0f1'
```
---
## Storage Hardware
### EMC Storage Enclosure
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
- **Model**: EMC KTN-STL4 (or similar)
- **Form Factor**: 4U rackmount
- **Drive Bays**: 25x 3.5" SAS/SATA
- **Controllers**: Dual LCC (Link Control Cards)
- **Connection**: SAS via LSI SAS2308 HBA
- **Passed to**: TrueNAS VM (VMID 100)
**Current Status**:
- LCC A: Active (working)
- LCC B: Failed (replacement ordered)
**Drive Inventory**: Unknown - needs audit
```bash
# Get drive list from TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'lsblk'
```
### NVMe Drives
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
### SSDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
### HDDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
---
## UPS
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Input** | NEMA 5-15P (rewired from 5-20P) |
| **Outlets** | 2x 5-20R + 6x 5-15R |
| **Output** | PFC Sinewave |
| **Runtime** | ~15-20 min @ 33% load |
| **Interface** | USB (connected to PVE) |
**See [UPS.md](UPS.md) for configuration details**
---
## Client Devices
### Mac Mini (Hutson's Workstation)
- **Model**: Unknown generation
- **CPU**: Unknown
- **RAM**: Unknown
- **Storage**: Unknown
- **Network**: 1Gb Ethernet (en0) - MTU 9000
- **Tailscale IP**: 100.108.89.58
- **Local IP**: 10.10.10.125 (static)
- **Purpose**: Primary workstation, Happy Coder daemon host
### MacBook (Mobile)
- **Model**: Unknown
- **Network**: Wi-Fi + Ethernet adapter
- **Tailscale IP**: Unknown
- **Purpose**: Mobile work, development
### Windows PC
- **Model**: Unknown
- **CPU**: Unknown
- **Network**: 1Gb Ethernet
- **IP**: 10.10.10.150
- **Purpose**: Gaming, Windows development, Syncthing node
### Phone (Android)
- **Model**: Unknown
- **IP**: 10.10.10.54 (when on Wi-Fi)
- **Purpose**: Syncthing mobile node, Happy Coder client
---
## Rack Layout (If Applicable)
**Needs documentation** - Current rack configuration unknown
Suggested format:
```
U42: Blank panel
U41: UPS (CyberPower 2U)
U40: UPS (CyberPower 2U)
U39: Switch (10Gb)
U38-U35: EMC Storage Enclosure (4U)
U34: PVE Server
U33: PVE2 Server
...
```
---
## Power Consumption
### Measured Power Draw
| Component | Idle | Typical | Peak | Notes |
|-----------|------|---------|------|-------|
| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
| Network Gear | ~50W | ~50W | ~50W | Router + switches |
| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
**UPS Capacity**: 1320W
**Typical Load**: 33-50% (safe margin)
**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
### Power Optimizations Applied
**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
- KSMD disabled: ~60-80W saved
- CPU governors: ~60-120W saved
- Syncthing rescans: ~60-80W saved
- HDD spindown: ~10-16W saved when idle
- **Total savings**: ~150-300W
---
## Thermal Management
### CPU Cooling
**PVE & PVE2**:
- CPU cooler: Unknown model
- Thermal paste: Unknown, likely needs refresh if temps >85°C
- Target temp: 70-80°C under load
- Max safe: 90°C Tctl (Threadripper PRO spec)
### GPU Cooling
All GPUs are passively managed (stock coolers):
- TITAN RTX: 2-3W idle, 280W load
- RTX A6000: 11W idle, 300W load
- Quadro P2000: 25W constant (Plex active)
### Case Airflow
**Unknown** - needs investigation:
- Case model?
- Fan configuration?
- Positive or negative pressure?
---
## Cable Management
### Network Cables
| Connection | Type | Length | Speed |
|------------|------|--------|-------|
| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
| PVE2 → Router | Cat6 | Unknown | 1Gb |
| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
### Power Cables
**Critical**: All servers on UPS battery-backed outlets
---
## Maintenance Schedule
### Annual Maintenance
- [ ] Clean dust from servers (every 6-12 months)
- [ ] Check thermal paste on CPUs (every 2-3 years)
- [ ] Test UPS battery runtime (annually)
- [ ] Verify all fans operational
- [ ] Check for bulging capacitors on PSUs
### Drive Health
```bash
# Check SMART status on all drives
ssh pve 'smartctl -a /dev/nvme0'
ssh pve2 'smartctl -a /dev/sda'
ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
```
### Temperature Monitoring
```bash
# Check all temps (needs lm-sensors installed)
ssh pve 'sensors'
ssh pve2 'sensors'
```
---
## Warranty & Purchase Info
**Needs documentation**:
- When were servers purchased?
- Where were components bought?
- Any warranties still active?
- Replacement part sources?
---
## Upgrade Path
### Short-term Upgrades (< 6 months)
- [ ] 20A circuit for UPS (restore original 5-20P plug)
- [ ] Document missing hardware specs
- [ ] Label all cables
- [ ] Create rack diagram
### Medium-term Upgrades (6-12 months)
- [ ] Additional 10Gb NIC for PVE2?
- [ ] More NVMe storage?
- [ ] Upgrade network switches?
- [ ] Replace EMC enclosure with newer model?
### Long-term Upgrades (1-2 years)
- [ ] CPU upgrade to newer Threadripper?
- [ ] RAM expansion to 256GB?
- [ ] Additional GPU for AI workloads?
- [ ] Migrate to PCIe 5.0 storage?
---
## Investigation Needed
High-priority items to document:
- [ ] Get exact motherboard model (both servers)
- [ ] Get PSU model and wattage
- [ ] CPU cooler models
- [ ] Network switch models and configuration
- [ ] Complete drive inventory in EMC enclosure
- [ ] RAM speed and timings
- [ ] Case models
- [ ] Exact NVMe models for all drives
**Commands to gather info**:
```bash
# Motherboard
ssh pve 'dmidecode -t baseboard'
# CPU details
ssh pve 'lscpu'
# RAM details
ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
# Storage devices
ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
# Network cards
ssh pve 'lspci | grep -i network'
# GPU details
ssh pve 'lspci | grep -i vga'
ssh pve 'nvidia-smi -L' # If nvidia-smi available
```
---
## Related Documentation
- [VMS.md](VMS.md) - VM resource allocation
- [STORAGE.md](STORAGE.md) - Storage pools and usage
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
- [UPS.md](UPS.md) - UPS configuration
- [NETWORK.md](NETWORK.md) - Network configuration
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many specs need investigation

View File

@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
- **Philips Hue** - Lights - **Philips Hue** - Lights
- **Sonos** - Speakers - **Sonos** - Speakers
- **Motion Sensors** - Various locations - **Motion Sensors** - Various locations
- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
### NUT / UPS Integration
Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
**Connection:**
- Host: 10.10.10.120
- Port: 3493
- Username: upsmon
- Password: upsmon123
**Entities:**
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery percentage |
| `sensor.cyberpower_load` | Current load % |
| `sensor.cyberpower_input_voltage` | Input voltage |
| `sensor.cyberpower_output_voltage` | Output voltage |
| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
**Dashboard Card Example:**
```yaml
type: entities
title: UPS Status
entities:
- entity: sensor.cyberpower_status
name: Status
- entity: sensor.cyberpower_battery_charge
name: Battery
- entity: sensor.cyberpower_load
name: Load
- entity: sensor.cyberpower_input_voltage
name: Input Voltage
```
## Automations ## Automations

View File

@@ -45,6 +45,7 @@ This document tracks all IP addresses in the homelab infrastructure.
|------|------|------------|---------|--------| |------|------|------------|---------|--------|
| 300 | gitea-vm | 10.10.10.220 | Git server | Running | | 300 | gitea-vm | 10.10.10.220 | Git server | Running |
| 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running | | 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running |
| 302 | docker-host2 | 10.10.10.207 | Docker services (n8n, future apps) | Running |
## Workstations & Personal Devices ## Workstations & Personal Devices
@@ -69,6 +70,9 @@ This document tracks all IP addresses in the homelab infrastructure.
| CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary | | CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary |
| LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary | | LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary |
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary | | Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary |
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 | Traefik-Primary |
| n8n | n8n.htsn.io | 10.10.10.207:5678 | Traefik-Primary |
| Crafty Controller | mc.htsn.io | 10.10.10.207:8443 | Traefik-Primary |
| Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox | | Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox |
| Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox | | Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox |
| Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox | | Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox |
@@ -92,6 +96,7 @@ This document tracks all IP addresses in the homelab infrastructure.
- .200 - TrueNAS - .200 - TrueNAS
- .201 - CopyParty - .201 - CopyParty
- .206 - Docker-host - .206 - Docker-host
- .207 - Docker-host2
- .220 - Gitea - .220 - Gitea
- .221 - Trading VM - .221 - Trading VM
- .250 - Traefik-Primary - .250 - Traefik-Primary
@@ -110,7 +115,7 @@ This document tracks all IP addresses in the homelab infrastructure.
- 10.10.10.148 - 10.10.10.149 (2 IPs) - 10.10.10.148 - 10.10.10.149 (2 IPs)
- 10.10.10.151 - 10.10.10.199 (49 IPs) - 10.10.10.151 - 10.10.10.199 (49 IPs)
- 10.10.10.202 - 10.10.10.205 (4 IPs) - 10.10.10.202 - 10.10.10.205 (4 IPs)
- 10.10.10.207 - 10.10.10.219 (13 IPs) - 10.10.10.208 - 10.10.10.219 (12 IPs)
- 10.10.10.222 - 10.10.10.249 (28 IPs) - 10.10.10.222 - 10.10.10.249 (28 IPs)
- 10.10.10.251 - 10.10.10.254 (4 IPs) - 10.10.10.251 - 10.10.10.254 (4 IPs)
@@ -123,6 +128,18 @@ This document tracks all IP addresses in the homelab infrastructure.
| Portainer Agent | 9001 | Remote management from other Portainer | | Portainer Agent | 9001 | Remote management from other Portainer |
| Gotenberg | 3000 | PDF generation API | | Gotenberg | 3000 | PDF generation API |
## Docker Host 2 Services (10.10.10.207) - PVE2
| Service | Port | Purpose |
|---------|------|---------|
| MetaMCP | 12008 | MCP Aggregator/Gateway (metamcp.htsn.io) |
| n8n | 5678 | Workflow automation |
| Crafty Controller | 8443 | Minecraft server management (mc.htsn.io) |
| Minecraft Java | 25565 | Minecraft Java Edition server |
| Minecraft Bedrock | 19132/udp | Minecraft Bedrock Edition (Geyser) |
| Trading Redis | 6379 | Redis for trading platform |
| Trading TimescaleDB | 5433 | TimescaleDB for trading platform |
## Syncthing API Endpoints ## Syncthing API Endpoints
| Device | IP | Port | API Key | | Device | IP | Port | API Key |

618
MAINTENANCE.md Normal file
View File

@@ -0,0 +1,618 @@
# Maintenance Procedures and Schedules
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
## Overview
| Frequency | Tasks | Estimated Time |
|-----------|-------|----------------|
| **Daily** | Quick health check | 2-5 min |
| **Weekly** | Service status, logs review | 15-30 min |
| **Monthly** | Updates, backups verification | 1-2 hours |
| **Quarterly** | Full system audit, testing | 2-4 hours |
| **Annual** | Hardware maintenance, planning | 4-8 hours |
---
## Daily Maintenance (Automated)
### Quick Health Check Script
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily via cron**:
```bash
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
## Weekly Maintenance
### Service Status Review
**Check all critical services**:
```bash
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'
# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
```
### Log Review
**Check for errors in critical logs**:
```bash
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'
# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
```
### Syncthing Sync Status
**Check for sync errors**:
```bash
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
echo "=== $folder ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
```
**See**: [SYNCTHING.md](SYNCTHING.md)
---
## Monthly Maintenance
### System Updates
#### Proxmox Updates
**Check for updates**:
```bash
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
```
**Apply updates**:
```bash
# PVE
ssh pve 'apt update && apt dist-upgrade -y'
# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'
# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
```
**⚠️ Important**:
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart
#### Container Updates (LXC)
```bash
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
```
#### VM Updates
**Update VMs individually via SSH**:
```bash
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'
# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
```
### ZFS Scrubs
**Schedule**: Run monthly on all pools
**PVE**:
```bash
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'
# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
```
**PVE2**:
```bash
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
```
**TrueNAS**:
```bash
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
```
**Automate scrubs**:
```bash
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
```
**See**: [STORAGE.md](STORAGE.md) for pool details
### SMART Tests
**Run extended SMART tests monthly**:
```bash
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
```
**Automate SMART tests**:
```bash
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
```
### Certificate Renewal Verification
**Check SSL certificate expiry**:
```bash
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
**Certificates should auto-renew 30 days before expiry via Traefik**
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
### Backup Verification
**⚠️ TODO**: No backup strategy currently in place
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
---
## Quarterly Maintenance
### Full System Audit
**Check all systems comprehensively**:
1. **ZFS Pool Health**:
```bash
ssh pve 'zpool status -v'
ssh pve2 'zpool status -v'
ssh truenas 'zpool status -v vault'
```
Look for: errors, degraded vdevs, resilver operations
2. **SMART Health**:
```bash
# Run SMART health check script
~/bin/smart-health-check.sh
```
Look for: reallocated sectors, pending sectors, failures
3. **Disk Space Trends**:
```bash
# Check growth rate
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
ssh truenas 'df -h /mnt/vault'
```
Plan for expansion if >80% full
4. **VM Resource Usage**:
```bash
# Check if VMs need more/less resources
ssh pve 'qm list'
ssh pve 'pvesh get /nodes/pve/status'
```
5. **Network Performance**:
```bash
# Test bandwidth between critical nodes
iperf3 -s # On one host
iperf3 -c 10.10.10.120 # From another
```
6. **Temperature Monitoring**:
```bash
# Check max temps over past quarter
# TODO: Set up Prometheus/Grafana for historical data
ssh pve 'sensors'
ssh pve2 'sensors'
```
### Service Dependency Testing
**Test critical paths**:
1. **Power failure recovery** (if safe to test):
- See [UPS.md](UPS.md) for full procedure
- Verify VM startup order works
- Confirm all services come back online
2. **Failover testing**:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
3. **Backup restoration** (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning
### Documentation Review
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
- [ ] Update any changed procedures in this document
---
## Annual Maintenance
### Hardware Maintenance
**Physical cleaning**:
```bash
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'
# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans
# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
```
**Thermal paste inspection** (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C
**See**: [HARDWARE.md](HARDWARE.md) for component details
### UPS Battery Test
**Runtime test**:
```bash
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'
# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers
# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
```
**See**: [UPS.md](UPS.md) for full UPS details
### Drive Replacement Planning
**Check drive age and health**:
```bash
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
```
**Replace drives if**:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives
**Budget for replacements**:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
### Capacity Planning
**Review growth trends**:
```bash
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'
# Network bandwidth (if monitoring in place)
# Review Grafana dashboards
# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
```
**Plan expansions**:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%
### License and Subscription Review
**Proxmox subscription** (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features
**Service subscriptions**:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)
---
## Update Schedules
### Proxmox
| Component | Frequency | Notes |
|-----------|-----------|-------|
| Security patches | Weekly | Via `apt upgrade` |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |
**Update procedure**:
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
2. Backup VM configs: `vzdump --dumpdir /tmp`
3. Update: `apt update && apt dist-upgrade`
4. Reboot if kernel changed: `reboot`
5. Verify VMs auto-started: `qm list`
### Containers (LXC)
| Container | Update Frequency | Package Manager |
|-----------|------------------|-----------------|
| Pi-hole (200) | Weekly | `apt` |
| Traefik (202) | Monthly | `apt` |
| FindShyt (205) | As needed | `apt` |
**Update command**:
```bash
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
```
### VMs
| VM | Update Frequency | Notes |
|----|------------------|-------|
| TrueNAS | Monthly | Via web UI or `apt` |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | `apt` + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + `apt` |
**Docker image updates**:
```bash
ssh docker-host 'docker-compose pull && docker-compose up -d'
```
### Firmware Updates
| Component | Check Frequency | Update Method |
|-----------|----------------|---------------|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | `nvidia-smi` or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you
---
## Testing Checklists
### Pre-Update Checklist
Before ANY system update:
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
- [ ] Verify backups are current (when backup system in place)
- [ ] Check for critical VMs/services that can't have downtime
- [ ] Review update changelog/release notes
- [ ] Test on non-critical system first (PVE2 or test VM)
- [ ] Plan rollback strategy if update fails
- [ ] Notify users if downtime expected
### Post-Update Checklist
After system update:
- [ ] Verify system booted correctly: `uptime`
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
- [ ] Test critical services:
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
- [ ] NFS/SMB shares: Test mount from VM
- [ ] Syncthing sync: Check all devices connected
- [ ] Review logs for errors: `journalctl -p err -b`
- [ ] Check temperatures: `sensors`
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
### Disaster Recovery Test
**Quarterly test** (when backup system in place):
- [ ] Simulate VM failure: Restore from backup
- [ ] Simulate storage failure: Import pool on different system
- [ ] Simulate network failure: Verify Tailscale failover
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
- [ ] Document recovery time and issues
---
## Log Rotation
**System logs** are automatically rotated by systemd-journald and logrotate.
**Check log sizes**:
```bash
# Journalctl size
ssh pve 'journalctl --disk-usage'
# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
```
**Configure retention**:
```bash
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
```
**Traefik log rotation** (already configured):
```bash
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
```
---
## Monitoring Integration
**TODO**: Set up automated monitoring for these procedures
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)
---
## Related Documentation
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
- [UPS.md](UPS.md) - Power failure procedures
- [STORAGE.md](STORAGE.md) - ZFS pool management
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
- [SERVICES.md](SERVICES.md) - Service inventory
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Manual procedures only - monitoring automation needed

478
MINECRAFT.md Normal file
View File

@@ -0,0 +1,478 @@
# Minecraft Server - Hutworld
Minecraft server running on docker-host2 via Crafty Controller 4.
---
## Quick Reference
| Setting | Value |
|---------|-------|
| **Web GUI** | https://mc.htsn.io |
| **Game Server (Java)** | hutworld.htsn.io:25565 |
| **Game Server (Bedrock)** | hutworld.htsn.io:19132 |
| **Host** | docker-host2 (10.10.10.207) |
| **Server Type** | Paper 1.21.11 |
| **World Name** | hutworld |
| **Memory** | 2GB min / 4GB max |
---
## Crafty Controller Access
| Setting | Value |
|---------|-------|
| **URL** | https://mc.htsn.io |
| **Username** | admin |
| **Password** | See `/crafty/data/config/default-creds.txt` on docker-host2 |
**Get password:**
```bash
ssh docker-host2 'cat ~/crafty/data/config/default-creds.txt'
```
---
## Current Status
### Completed
- [x] Crafty Controller 4.4.7 deployed on docker-host2
- [x] Traefik reverse proxy configured (mc.htsn.io → 10.10.10.207:8443)
- [x] DNS A record created for hutworld.htsn.io (non-proxied, points to public IP)
- [x] Port forwarding configured via UniFi API:
- TCP/UDP 25565 → 10.10.10.207 (Java Edition)
- UDP 19132 → 10.10.10.207 (Bedrock via Geyser)
- [x] Server files transferred from Windows PC (D:\Minecraft\mcss\servers\hutworld)
- [x] Server imported into Crafty and running
- [x] Paper upgraded from 1.21.5 to 1.21.11
- [x] Plugins updated (GSit 3.1.1, LuckPerms 5.5.22)
- [x] Orphaned plugin data cleaned up
- [x] LuckPerms database restored with original permissions
- [x] Automated backups to TrueNAS configured (every 6 hours)
### Pending
- [ ] Change Crafty admin password to something memorable
- [ ] Test external connectivity from outside network
---
## Import Instructions
To import the hutworld server in Crafty:
1. Go to **Servers** → Click **+ Create New Server**
2. Select **Import Server** tab
3. Fill in:
- **Server Name:** `Hutworld`
- **Import Path:** `/crafty/import/hutworld`
- **Server JAR:** `paper.jar`
- **Min RAM:** `2048` (2GB)
- **Max RAM:** `6144` (6GB)
- **Server Port:** `25565`
4. Click **Import Server**
5. Go to server → Click **Start**
---
## Server Configuration
### World Data
| World | Description |
|-------|-------------|
| hutworld | Main overworld |
| hutworld_nether | Nether dimension |
| hutworld_the_end | End dimension |
### Installed Plugins
| Plugin | Version | Purpose |
|--------|---------|---------|
| EssentialsX | 2.20.1 | Core server commands |
| EssentialsXChat | 2.20.1 | Chat formatting |
| EssentialsXSpawn | 2.20.1 | Spawn management |
| Geyser-Spigot | Latest | Bedrock Edition support |
| floodgate | Latest | Bedrock authentication |
| GSit | 3.1.1 | Sit/lay/crawl animations |
| LuckPerms | 5.5.22 | Permissions management |
| PluginPortal | 2.2.2 | Plugin management |
| Vault | 1.7.3 | Economy/permissions API |
| ViaVersion | Latest | Multi-version support |
| ViaBackwards | Latest | Older client support |
| randomtp | Latest | Random teleportation |
**Removed plugins** (cleaned up 2026-01-03):
- GriefPrevention, Multiverse-Core, Multiverse-Portals, ProtocolLib, WorldEdit, WorldGuard (disabled/orphaned)
---
## Docker Configuration
**Location:** `~/crafty/docker-compose.yml` on docker-host2
```yaml
services:
crafty:
image: registry.gitlab.com/crafty-controller/crafty-4:4.4.7
container_name: crafty
restart: unless-stopped
environment:
- TZ=America/New_York
ports:
- "8443:8443" # Web GUI (HTTPS)
- "8123:8123" # Dynmap (if used)
- "25565:25565" # Minecraft Java
- "25566:25566" # Additional server
- "19132:19132/udp" # Minecraft Bedrock (Geyser)
volumes:
- ./data/backups:/crafty/backups
- ./data/logs:/crafty/logs
- ./data/servers:/crafty/servers
- ./data/config:/crafty/app/config
- ./data/import:/crafty/import
```
---
## Traefik Configuration
**File:** `/etc/traefik/conf.d/crafty.yaml` on CT 202 (10.10.10.250)
```yaml
http:
routers:
crafty-secure:
entryPoints:
- websecure
rule: "Host(`mc.htsn.io`)"
service: crafty
tls:
certResolver: cloudflare
priority: 50
services:
crafty:
loadBalancer:
servers:
- url: "https://10.10.10.207:8443"
serversTransport: crafty-transport@file
serversTransports:
crafty-transport:
insecureSkipVerify: true
```
---
## Port Forwarding (UniFi)
Configured via UniFi API on UCG-Fiber (10.10.10.1):
| Rule Name | Port | Protocol | Destination |
|-----------|------|----------|-------------|
| Minecraft Java | 25565 | TCP/UDP | 10.10.10.207:25565 |
| Minecraft Bedrock | 19132 | UDP | 10.10.10.207:19132 |
---
## DNS Records (Cloudflare)
| Record | Type | Value | Proxied |
|--------|------|-------|---------|
| mc.htsn.io | CNAME | htsn.io | Yes (for web GUI) |
| hutworld.htsn.io | A | 70.237.94.174 | No (direct for game traffic) |
**Note:** Game traffic (25565, 19132) cannot be proxied through Cloudflare - only HTTP/HTTPS works with Cloudflare proxy.
---
## LuckPerms Web Editor
After server is running:
1. Open Crafty console for Hutworld server
2. Run command: `/lp editor`
3. A unique URL will be generated (cloud-hosted by LuckPerms)
4. Open the URL in browser to manage permissions
The editor is hosted by LuckPerms, so no additional port forwarding is needed.
---
## Backup Configuration
### Automated Backups to TrueNAS
Backups run automatically every 6 hours and are stored on TrueNAS.
| Setting | Value |
|---------|-------|
| **Destination** | TrueNAS (10.10.10.200) |
| **Path** | `/mnt/vault/users/backups/minecraft/` |
| **Frequency** | Every 6 hours (12am, 6am, 12pm, 6pm) |
| **Retention** | 14 backups (~3.5 days of history) |
| **Size** | ~2.3 GB per backup |
| **Script** | `/home/hutson/minecraft-backup.sh` on docker-host2 |
| **Log** | `/home/hutson/minecraft-backup.log` on docker-host2 |
### Backup Script
**Location:** `~/minecraft-backup.sh` on docker-host2
```bash
#!/bin/bash
# Minecraft Server Backup Script
# Backs up Crafty server data to TrueNAS
BACKUP_SRC="$HOME/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60"
BACKUP_DEST="hutson@10.10.10.200:/mnt/vault/users/backups/minecraft"
DATE=$(date +%Y-%m-%d_%H%M)
BACKUP_NAME="hutworld-$DATE.tar.gz"
LOCAL_BACKUP="/tmp/$BACKUP_NAME"
# Create compressed backup (exclude large unnecessary files)
tar -czf "$LOCAL_BACKUP" \
--exclude="*.jar" \
--exclude="cache" \
--exclude="libraries" \
--exclude=".paper-remapped" \
-C "$HOME/crafty/data/servers" \
19f604a9-f037-442d-9283-0761c73cfd60
# Transfer to TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$LOCAL_BACKUP" "$BACKUP_DEST/"
# Clean up local temp file
rm -f "$LOCAL_BACKUP"
# Keep only last 14 backups on TrueNAS
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 '
cd /mnt/vault/users/backups/minecraft
ls -t hutworld-*.tar.gz 2>/dev/null | tail -n +15 | xargs -r rm -f
'
```
### Cron Schedule
```bash
# View current schedule
ssh docker-host2 'crontab -l | grep minecraft'
# Output: 0 */6 * * * /home/hutson/minecraft-backup.sh >> /home/hutson/minecraft-backup.log 2>&1
```
### Manual Backup Commands
```bash
# Run backup manually
ssh docker-host2 '~/minecraft-backup.sh'
# Check backup log
ssh docker-host2 'tail -20 ~/minecraft-backup.log'
# List backups on TrueNAS
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lh /mnt/vault/users/backups/minecraft/'
```
### Restore from Backup
```bash
# 1. Stop the server in Crafty web UI
# 2. Copy backup from TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no \
hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz \
/tmp/
# 3. Extract to server directory (backup existing first)
ssh docker-host2 'cd ~/crafty/data/servers && \
mv 19f604a9-f037-442d-9283-0761c73cfd60 19f604a9-f037-442d-9283-0761c73cfd60.old && \
tar -xzf /tmp/hutworld-YYYY-MM-DD_HHMM.tar.gz'
# 4. Start server in Crafty web UI
```
---
## Common Tasks
### Start/Stop Server
Via Crafty web UI at https://mc.htsn.io, or:
```bash
# Check Crafty container status
ssh docker-host2 'docker ps | grep crafty'
# Restart Crafty container
ssh docker-host2 'cd ~/crafty && docker compose restart'
# View Crafty logs
ssh docker-host2 'docker logs -f crafty'
```
### Backup Server
See [Backup Configuration](#backup-configuration) for full details.
```bash
# Run backup manually
ssh docker-host2 '~/minecraft-backup.sh'
# Check recent backups
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lht /mnt/vault/users/backups/minecraft/ | head -5'
```
### Update Plugins
1. Download new plugin JAR
2. Upload via Crafty Files tab, or:
```bash
scp plugin.jar docker-host2:~/crafty/data/servers/hutworld/plugins/
```
3. Restart server in Crafty
### Check Server Logs
Via Crafty web UI (Logs tab), or:
```bash
ssh docker-host2 'tail -f ~/crafty/data/servers/hutworld/logs/latest.log'
```
---
## Troubleshooting
### Server won't start
```bash
# Check Crafty container logs
ssh docker-host2 'docker logs crafty --tail 50'
# Check server logs
ssh docker-host2 'cat ~/crafty/data/servers/hutworld/logs/latest.log | tail -100'
# Check Java version in container
ssh docker-host2 'docker exec crafty java -version'
```
### Can't connect externally
1. Verify port forwarding is active:
```bash
ssh root@10.10.10.1 'iptables -t nat -L -n | grep 25565'
```
2. Test from external network:
```bash
nc -zv hutworld.htsn.io 25565
```
3. Check if server is listening:
```bash
ssh docker-host2 'netstat -tlnp | grep 25565'
```
### Bedrock players can't connect
1. Verify Geyser plugin is installed and enabled
2. Check Geyser config: `~/crafty/data/servers/hutworld/plugins/Geyser-Spigot/config.yml`
3. Ensure UDP 19132 is forwarded and not blocked
### LuckPerms missing users/permissions
If LuckPerms shows a fresh database (missing users like Suwan):
1. **Check if original database exists:**
```bash
ssh docker-host2 'ls -la ~/crafty/data/import/hutworld/plugins/LuckPerms/*.db'
```
2. **Restore from import backup:**
```bash
# Stop server in Crafty UI first
ssh docker-host2 'cp ~/crafty/data/import/hutworld/plugins/LuckPerms/luckperms-h2-v2.mv.db \
~/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60/plugins/LuckPerms/'
```
3. **Or restore from TrueNAS backup:**
```bash
# List available backups
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lt /mnt/vault/users/backups/minecraft/'
# Extract LuckPerms database from backup
sshpass -p 'GrilledCh33s3#' scp hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz /tmp/
tar -xzf /tmp/hutworld-*.tar.gz -C /tmp --strip-components=2 \
'*/plugins/LuckPerms/luckperms-h2-v2.mv.db'
```
4. **Restart server in Crafty UI**
---
## Migration History
### 2026-01-04: Backup System
- Configured automated backups to TrueNAS every 6 hours
- Set 14-backup retention (~3.5 days of recovery points)
- Created backup script with compression and cleanup
- Storage: `/mnt/vault/users/backups/minecraft/`
### 2026-01-03: Server Fixes & Updates
**Updates:**
- Upgraded Paper from 1.21.5 to 1.21.11 (build 69)
- Updated GSit from 2.3.2 to 3.1.1
- Fixed corrupted LuckPerms JAR (re-downloaded 5.5.22)
- Restored original LuckPerms database with user permissions
**Cleanup:**
- Removed disabled plugins: Dynmap, Graves
- Removed orphaned data folders: GriefPreventionData, SilkSpawners_v2, Graves, ViaRewind
**Fixes:**
- Fixed memory allocation (was attempting 2TB, set to 2GB min / 4GB max)
- Fixed file permissions for Docker container access
### 2026-01-03: Initial Migration
**Source:** Windows PC (10.10.10.150) - D:\Minecraft\mcss\servers\hutworld
**Steps completed:**
1. Compressed hutworld folder on Windows (2.4GB zip)
2. Transferred via SCP to docker-host2
3. Unzipped to ~/crafty/data/import/hutworld
4. Downloaded Paper 1.21.5 JAR (later upgraded to 1.21.11)
5. Imported server into Crafty Controller
6. Configured port forwarding (updated existing 25565 rule, added 19132)
7. Created DNS record for hutworld.htsn.io
**Original MCSS config preserved:** `mcss_server_config.json`
---
## Related Documentation
- [IP Assignments](IP-ASSIGNMENTS.md) - Network configuration
- [Traefik](TRAEFIK.md) - Reverse proxy setup
- [VMs](VMS.md) - docker-host2 details
- [Gateway](GATEWAY.md) - UCG-Fiber configuration
---
## Resources
- [Crafty Controller Docs](https://docs.craftycontrol.com/)
- [Paper MC](https://papermc.io/)
- [Geyser MC](https://geysermc.org/)
- [LuckPerms](https://luckperms.net/)
---
**Last Updated:** 2026-01-04

583
MONITORING.md Normal file
View File

@@ -0,0 +1,583 @@
# Monitoring and Alerting
Documentation for system monitoring, health checks, and alerting across the homelab.
## Current Monitoring Status
| Component | Monitored? | Method | Alerts | Notes |
|-----------|------------|--------|--------|-------|
| **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
| **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
| **Services** | ❌ No | - | ❌ No | No health checks |
| **Backups** | ❌ No | - | ❌ No | No verification |
**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual
---
## Existing Monitoring
### UPS Monitoring (NUT)
**Status**: ✅ **Active and working**
**What's monitored**:
- Battery charge percentage
- Runtime remaining (seconds)
- Load percentage
- Input/output voltage
- UPS status (OL/OB/LB)
**Access**:
```bash
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
```
**Home Assistant Integration**:
- Sensors: `sensor.cyberpower_*`
- Can be used for automation/alerts
- Currently: No alerts configured
**See**: [UPS.md](UPS.md)
---
### Gateway Monitoring
**Status**: ✅ **Active with auto-recovery**
Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):
**1. Internet Watchdog** (`internet-watchdog.service`)
- Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
- Auto-reboots gateway after 5 consecutive failures (~5 minutes)
- Logs to `/var/log/internet-watchdog.log`
**2. Memory Monitor** (`memory-monitor.service`)
- Logs memory usage and top processes every 10 minutes
- Logs to `/data/logs/memory-history.log`
- Auto-rotates when log exceeds 10MB
**Quick Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
```
**See**: [GATEWAY.md](GATEWAY.md)
---
### Syncthing Monitoring
**Status**: ⚠️ **Partial** - API available, no automated monitoring
**What's available**:
- Device connection status
- Folder sync status
- Sync errors
- Bandwidth usage
**Manual Checks**:
```bash
# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
```
**Needs**: Automated monitoring script + alerts
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Temperature Monitoring
**Status**: ⚠️ **Manual only**
**Current Method**:
```bash
# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Thresholds**:
- Healthy: 70-80°C under load
- Warning: >85°C
- Critical: >90°C (throttling)
**Needs**: Automated monitoring + alert if >85°C
---
### Proxmox VM Monitoring
**Status**: ⚠️ **Manual only**
**Current Access**:
- Proxmox Web UI: Node → Summary
- CLI: `ssh pve 'qm list'`
**Metrics Available** (via Proxmox):
- CPU usage per VM
- RAM usage per VM
- Disk I/O
- Network I/O
- VM uptime
**Needs**: API-based monitoring + alerts for VM down
---
## Recommended Monitoring Stack
### Option 1: Prometheus + Grafana (Recommended)
**Why**:
- Industry standard
- Extensive integrations
- Beautiful dashboards
- Flexible alerting
**Architecture**:
```
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
Alertmanager (alerts)
```
**Required Exporters**:
| Exporter | Monitors | Install On |
|----------|----------|------------|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
| nut_exporter | UPS metrics | PVE |
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
| cadvisor | Docker containers | Saltbox, docker-host |
**Deployment**:
```bash
# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
--net0 virtio,bridge=vmbr0'
# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml
```
**Estimated Setup Time**: 4-6 hours
---
### Option 2: Uptime Kuma (Simpler Alternative)
**Why**:
- Lightweight
- Easy to set up
- Web-based dashboard
- Built-in alerts (email, Slack, etc.)
**What it monitors**:
- HTTP/HTTPS endpoints
- Ping (ICMP)
- Ports (TCP)
- Docker containers
**Deployment**:
```bash
ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
ports:
- "3001:3001"
volumes:
- ./data:/app/data
restart: unless-stopped
EOF
# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io
```
**Estimated Setup Time**: 1-2 hours
---
### Option 3: Netdata (Real-time Monitoring)
**Why**:
- Real-time metrics (1-second granularity)
- Auto-discovers services
- Low overhead
- Beautiful web UI
**Deployment**:
```bash
# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)
```
**Parent-Child Setup** (optional):
- Configure PVE as parent
- Stream metrics from PVE2 → PVE
- Single dashboard for both servers
**Estimated Setup Time**: 1 hour
---
## Critical Metrics to Monitor
### Server Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **CPU usage** | >90% for 5 min | Alert |
| **CPU temp** | >85°C | Alert |
| **CPU temp** | >90°C | Critical alert |
| **RAM usage** | >95% | Alert |
| **Disk space** | >80% | Warning |
| **Disk space** | >90% | Alert |
| **Load average** | >CPU count | Alert |
### Storage Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **ZFS pool errors** | >0 | Alert immediately |
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
| **ZFS scrub failed** | Last scrub error | Alert |
| **SMART reallocated sectors** | >0 | Warning |
| **SMART pending sectors** | >0 | Alert |
| **SMART failure** | Pre-fail | Critical - replace drive |
### UPS
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Battery charge** | <20% | Warning |
| **Battery charge** | <10% | Alert |
| **On battery** | >5 min | Alert |
| **Runtime** | <5 min | Critical |
### Network
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Device unreachable** | >2 min down | Alert |
| **High packet loss** | >5% | Warning |
| **Bandwidth saturation** | >90% | Warning |
### VMs/Services
| Metric | Threshold | Action |
|--------|-----------|--------|
| **VM stopped** | Critical VM down | Alert immediately |
| **Service unreachable** | HTTP 5xx or timeout | Alert |
| **Backup failed** | Any backup failure | Alert |
| **Certificate expiry** | <30 days | Warning |
| **Certificate expiry** | <7 days | Alert |
---
## Alert Destinations
### Email Alerts
**Recommended**: Set up SMTP relay for email alerts
**Options**:
1. Gmail SMTP (free, rate-limited)
2. SendGrid (free tier: 100 emails/day)
3. Mailgun (free tier available)
4. Self-hosted mail server (complex)
**Configuration Example** (Prometheus Alertmanager):
```yaml
# /etc/alertmanager/alertmanager.yml
receivers:
- name: 'email'
email_configs:
- to: 'hutson@example.com'
from: 'alerts@htsn.io'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alerts@htsn.io'
auth_password: 'app-password-here'
```
---
### Push Notifications
**Options**:
- **Pushover**: $5 one-time, reliable
- **Pushbullet**: Free tier available
- **Telegram Bot**: Free
- **Discord Webhook**: Free
- **Slack**: Free tier available
**Recommended**: Pushover or Telegram for mobile alerts
---
### Home Assistant Alerts
Since Home Assistant is already running, use it for alerts:
**Automation Example**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_charge
below: 20
action:
- service: notify.mobile_app
data:
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
- alias: "Server High Temperature"
trigger:
- platform: template
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
action:
- service: notify.mobile_app
data:
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
```
**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
---
## Monitoring Scripts
### Daily Health Check
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily**:
```cron
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
### ZFS Scrub Checker
```bash
#!/bin/bash
# Check last ZFS scrub status
echo "=== ZFS Scrub Status ==="
for host in pve pve2; do
echo "--- $host ---"
ssh $host 'zpool status | grep -A1 scrub'
echo ""
done
echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'
```
---
### SMART Health Checker
```bash
#!/bin/bash
# Check SMART health on all drives
echo "=== SMART Health Check ==="
echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'
echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'
```
---
## Dashboard Recommendations
### Grafana Dashboard Layout
**Page 1: Overview**
- Server uptime
- CPU usage (all servers)
- RAM usage (all servers)
- Disk space (all pools)
- Network traffic
- UPS status
**Page 2: Storage**
- ZFS pool health
- SMART status for all drives
- I/O latency
- Scrub progress
- Disk temperatures
**Page 3: VMs**
- VM status (up/down)
- VM resource usage
- VM disk I/O
- VM network traffic
**Page 4: Services**
- Service health checks
- HTTP response times
- Certificate expiry dates
- Syncthing sync status
---
## Implementation Plan
### Phase 1: Basic Monitoring (Week 1)
- [ ] Install Uptime Kuma or Netdata
- [ ] Add HTTP checks for all services
- [ ] Configure UPS alerts in Home Assistant
- [ ] Set up daily health check email
**Estimated Time**: 4-6 hours
---
### Phase 2: Advanced Monitoring (Week 2-3)
- [ ] Install Prometheus + Grafana
- [ ] Deploy node_exporter on all servers
- [ ] Deploy zfs_exporter
- [ ] Deploy smartmon_exporter
- [ ] Create Grafana dashboards
**Estimated Time**: 8-12 hours
---
### Phase 3: Alerting (Week 4)
- [ ] Configure Alertmanager
- [ ] Set up email/push notifications
- [ ] Create alert rules for all critical metrics
- [ ] Test all alert paths
- [ ] Document alert procedures
**Estimated Time**: 4-6 hours
---
## Related Documentation
- [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting
- [UPS.md](UPS.md) - UPS monitoring details
- [STORAGE.md](STORAGE.md) - ZFS health checks
- [SERVICES.md](SERVICES.md) - Service inventory
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
---
**Last Updated**: 2026-01-02
**Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**

382
N8N-INTEGRATIONS.md Normal file
View File

@@ -0,0 +1,382 @@
# n8n Homelab Integrations - Quick Start Guide
n8n is running on your homelab network (10.10.10.207) and can access all local services. This guide sets up useful automations.
---
## Network Access Verified
n8n can connect to:
-**Home Assistant** (10.10.10.110:8123)
-**Prometheus** (10.10.10.206:9090)
-**Grafana** (10.10.10.206:3001)
-**Syncthing** (10.10.10.200:8384)
-**PiHole** (10.10.10.10)
-**Gitea** (10.10.10.220:3000)
-**Proxmox** (10.10.10.120:8006, 10.10.10.102:8006)
-**TrueNAS** (10.10.10.200)
-**All external APIs** (via internet)
---
## Initial Setup (First-Time)
1. Open **https://n8n.htsn.io**
2. Complete the setup wizard:
- **Owner Email:** hutson@htsn.io
- **Owner Name:** Hutson
- **Password:** (choose secure password)
3. Skip data sharing (optional)
---
## Credentials to Add in n8n
Go to **Settings → Credentials** and add:
### 1. Home Assistant
| Field | Value |
|-------|-------|
| **Credential Type** | Home Assistant API |
| **Host** | `http://10.10.10.110:8123` |
| **Access Token** | (get from Home Assistant) |
**Get Token:** Home Assistant → Profile → Long-Lived Access Tokens → Create Token
---
### 2. Prometheus
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.206:9090` |
| **Authentication** | None |
---
### 3. Grafana
| Field | Value |
|-------|-------|
| **Credential Type** | Grafana API |
| **URL** | `http://10.10.10.206:3001` |
| **API Key** | (create in Grafana) |
**Get API Key:** Grafana → Administration → Service Accounts → Create → Add Token
---
### 4. Syncthing
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.200:8384` |
| **Header Name** | `X-API-Key` |
| **Header Value** | `VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs` |
---
### 5. Telegram Bot
| Field | Value |
|-------|-------|
| **Credential Type** | Telegram API |
| **Access Token** | `8450212653:AAHoVBlNUuA0vtrVPMNUfSgJh_gmFMxlrBg` |
**Your Chat ID:** `1004084736`
---
### 6. Proxmox
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.120:8006` |
| **Authentication** | API Token |
| **Token** | (use monitoring@pve token if needed) |
---
## Starter Workflows
### Workflow 1: Homelab Health Check (Every Hour)
**Nodes:**
1. **Schedule Trigger** (every hour)
2. **HTTP Request** → Prometheus query for down hosts
- URL: `http://10.10.10.206:9090/api/v1/query`
- Query param: `query=up{job=~"node.*"} == 0`
3. **If** → Check if any hosts are down
4. **Telegram** → Send alert if hosts down
**PromQL Query:**
```
up{job=~"node.*"} == 0
```
---
### Workflow 2: Daily Backup Status
**Nodes:**
1. **Schedule Trigger** (8am daily)
2. **HTTP Request** → Query Syncthing sync status
- URL: `http://10.10.10.200:8384/rest/db/status?folder=backup`
- Header: `X-API-Key: VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs`
3. **Function** → Check if folder is syncing
4. **Telegram** → Send daily status report
---
### Workflow 3: High CPU Alert
**Nodes:**
1. **Schedule Trigger** (every 5 minutes)
2. **HTTP Request** → Prometheus CPU query
- URL: `http://10.10.10.206:9090/api/v1/query`
- Query: `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
3. **If** → CPU > 90%
4. **Telegram** → Send alert
---
### Workflow 4: UPS Power Event
**Webhook Trigger Setup:**
1. Create webhook trigger in n8n
2. Get webhook URL: `https://n8n.htsn.io/webhook/ups-alert`
3. Configure NUT to call webhook on power events
**Nodes:**
1. **Webhook Trigger** → Receive UPS event
2. **Switch** → Route by event type (on battery, low battery, online)
3. **Telegram** → Send appropriate alert
---
### Workflow 5: Gitea → Deploy on Push
**Nodes:**
1. **Webhook Trigger** → Gitea push event
2. **If** → Check if branch is `main`
3. **SSH** → Connect to target server
4. **Execute Command**`git pull && docker-compose up -d`
5. **Telegram** → Notify deployment complete
---
### Workflow 6: Syncthing Folder Behind Alert
**Nodes:**
1. **Schedule Trigger** (every 30 minutes)
2. **HTTP Request** → Get all folder statuses
- URL: `http://10.10.10.200:8384/rest/stats/folder`
3. **Function** → Check if any folder has errors or is significantly behind
4. **If** → Errors found
5. **Telegram** → Alert with folder name and status
---
### Workflow 7: Grafana Alert Forwarder
**Purpose:** Forward Grafana alerts to Telegram
**Nodes:**
1. **Webhook Trigger** → Grafana webhook
2. **Function** → Parse alert data
3. **Telegram** → Format and send alert
**Grafana Setup:**
- Contact Point → Add webhook: `https://n8n.htsn.io/webhook/grafana-alerts`
---
### Workflow 8: Daily Homelab Summary
**Nodes:**
1. **Schedule Trigger** (9am daily)
2. **Multiple HTTP Requests in parallel:**
- Prometheus: System uptime
- Prometheus: Average CPU usage (24h)
- Prometheus: Disk usage
- Syncthing: Sync status (all folders)
- PiHole: Queries blocked (24h)
3. **Function** → Format data as summary
4. **Telegram** → Send daily report
**Example Output:**
```
🏠 Homelab Daily Summary
✅ All systems operational
⏱️ Uptime: 14 days
📊 Avg CPU: 12%
💾 Disk: 45% used
🔄 Syncthing: All folders in sync
🛡️ PiHole: 2,341 queries blocked
Last updated: 2025-12-27 09:00
```
---
### Workflow 9: VM State Change Monitor
**Nodes:**
1. **Schedule Trigger** (every 1 minute)
2. **HTTP Request** → Query Proxmox API for VM list
3. **Function** → Compare with previous state (use Set node)
4. **If** → VM state changed
5. **Telegram** → Notify VM started/stopped
---
### Workflow 10: Internet Speed Test Alert
**Nodes:**
1. **Schedule Trigger** (every 6 hours)
2. **HTTP Request** → Prometheus speedtest exporter
3. **If** → Download speed < 500 Mbps
4. **Telegram** → Alert about slow internet
---
## Advanced Integration Ideas
### Home Assistant Automations
- Turn on lights when server room temperature > 80°F
- Trigger workflows from HA button press
- Send sensor data to external services
### Proxmox Automation
- Auto-snapshot VMs before updates
- Clone VMs for testing
- Monitor resource usage and rebalance
### Media Management
- Notify when new Plex content added
- Auto-organize downloads
- Send weekly watch statistics
### Backup Monitoring
- Verify all Syncthing folders synced
- Alert on ZFS scrub errors
- Monitor snapshot ages
### Security
- Alert on failed SSH attempts (from logs)
- Monitor SSL certificate expiration
- Track unusual network traffic patterns
---
## n8n Best Practices
1. **Error Handling:** Always add error workflows to catch failures
2. **Rate Limiting:** Don't query APIs too frequently
3. **Credentials:** Never hardcode - always use credential store
4. **Testing:** Use manual trigger during development
5. **Logging:** Add Set nodes to track workflow state
6. **Backups:** Export workflows regularly (Settings → Export)
---
## Useful PromQL Queries for n8n
**CPU Usage:**
```promql
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
**Memory Usage:**
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
**Disk Usage:**
```promql
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
```
**Hosts Down:**
```promql
up{job=~"node.*"} == 0
```
**Syncthing Disconnected:**
```promql
up{job=~"syncthing.*"} == 0
```
---
## Webhook URLs
After creating webhooks in n8n, you'll get URLs like:
- `https://n8n.htsn.io/webhook/your-webhook-name`
These can be called from:
- Grafana alerts
- Home Assistant automations
- Gitea webhooks
- Custom scripts
- UPS monitoring (NUT)
---
## Testing Credentials
Test each credential after adding:
1. Create simple workflow with manual trigger
2. Add HTTP Request node with credential
3. Execute and check response
4. Verify data returned correctly
---
## Troubleshooting
**Can't reach local service:**
- Verify service IP and port
- Check if service requires HTTPS
- Test with `curl` from docker-host2 first
**Webhook not triggering:**
- Check n8n is accessible: `curl https://n8n.htsn.io/webhook/test`
- Verify webhook URL in external service
- Check n8n execution logs
**Workflow fails silently:**
- Enable "Execute on Error" workflow
- Check workflow execution list
- Add Function nodes to log data
**API authentication fails:**
- Verify credential is saved
- Check API token hasn't expired
- Test with curl manually first
---
## Next Steps
1. **Add Credentials** - Start with Telegram and Prometheus
2. **Create Test Workflow** - Simple hourly health check
3. **Test Telegram** - Verify messages arrive
4. **Build Gradually** - Add one workflow at a time
5. **Export Backups** - Save workflows regularly
---
## Resources
- **n8n Docs:** https://docs.n8n.io
- **Community Workflows:** https://n8n.io/workflows
- **Your n8n:** https://n8n.htsn.io
- **Your API Docs:** [N8N.md](N8N.md)
**Last Updated:** 2025-12-27

308
N8N.md Normal file
View File

@@ -0,0 +1,308 @@
# n8n - Workflow Automation
n8n is an extendable workflow automation tool deployed on docker-host2 for automating tasks across your homelab and external services.
---
## Quick Reference
| Setting | Value |
|---------|-------|
| **URL** | https://n8n.htsn.io |
| **Local IP** | 10.10.10.207:5678 |
| **Server** | docker-host2 (PVE2 VMID 302) |
| **Database** | PostgreSQL (containerized) |
| **API Endpoint** | http://10.10.10.207:5678/api/v1/ |
---
## Claude Code Integration (MCP)
### n8n-MCP Server
The n8n-MCP server gives Claude Code deep knowledge of all 545+ n8n nodes, enabling it to build complete workflows from natural language descriptions.
**Installation:** Already configured in `~/Library/Application Support/Claude/claude_desktop_config.json`
```json
{
"mcpServers": {
"n8n-nodes": {
"command": "npx",
"args": ["-y", "@czlonkowski/n8n-mcp"]
}
}
}
```
**What This Enables:**
- ✅ Build n8n workflows from natural language
- ✅ Get detailed help with node parameters and options
- ✅ Best practices for n8n node usage
- ✅ Debug workflow issues with full node context
**Example Prompts:**
```
"Create an n8n workflow to monitor Prometheus and send Telegram alerts"
"Build a workflow that triggers when Syncthing has errors"
"What's the best n8n node to parse JSON responses?"
```
**How It Works:**
- MCP server provides offline documentation for all n8n nodes
- No connection to your n8n instance required
- Claude builds workflows that you can then import into https://n8n.htsn.io
**Resources:**
- [n8n-MCP GitHub](https://github.com/czlonkowski/n8n-mcp)
- [MCP Documentation](https://docs.n8n.io/advanced-ai/accessing-n8n-mcp-server/)
---
## API Access
### API Key
```
X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo
```
### API Examples
**List Workflows:**
```bash
curl -H "X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo" \
http://10.10.10.207:5678/api/v1/workflows
```
**Get Workflow by ID:**
```bash
curl -H "X-N8N-API-KEY: YOUR_API_KEY" \
http://10.10.10.207:5678/api/v1/workflows/{id}
```
**Trigger Workflow:**
```bash
curl -X POST \
-H "X-N8N-API-KEY: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"data": {"key": "value"}}' \
http://10.10.10.207:5678/api/v1/workflows/{id}/execute
```
**API Documentation:** https://docs.n8n.io/api/
---
## Deployment Details
### Docker Compose
**Location:** `/opt/n8n/docker-compose.yml` on docker-host2
**Services:**
- `n8n` - Main application (port 5678)
- `postgres` - Database backend
**Volumes:**
- `n8n_data` - Workflow data, credentials, settings
- `postgres_data` - Database storage
### Environment Configuration
```yaml
N8N_HOST: n8n.htsn.io
N8N_PORT: 5678
N8N_PROTOCOL: https
NODE_ENV: production
WEBHOOK_URL: https://n8n.htsn.io/
GENERIC_TIMEZONE: America/Los_Angeles
DB_TYPE: postgresdb
DB_POSTGRESDB_HOST: postgres
DB_POSTGRESDB_DATABASE: n8n
DB_POSTGRESDB_USER: n8n
DB_POSTGRESDB_PASSWORD: n8n_secure_password_2024
```
### Resource Limits
- **Memory**: 512MB-1GB (soft/hard)
- **CPU**: Shared (4 vCPUs on host)
---
## Common Tasks
### Restart n8n
```bash
ssh docker-host2 'cd /opt/n8n && docker compose restart n8n'
```
### View Logs
```bash
ssh docker-host2 'docker logs -f n8n'
```
### Backup Workflows
Workflows are stored in PostgreSQL. To backup:
```bash
ssh docker-host2 'docker exec n8n-postgres pg_dump -U n8n n8n > /tmp/n8n-backup-$(date +%Y%m%d).sql'
```
### Update n8n
```bash
ssh docker-host2 'cd /opt/n8n && docker compose pull n8n && docker compose up -d n8n'
```
---
## Traefik Configuration
**File:** `/etc/traefik/conf.d/n8n.yaml` on CT 202
```yaml
http:
routers:
n8n-secure:
entryPoints:
- websecure
rule: "Host(`n8n.htsn.io`)"
service: n8n
tls:
certResolver: cloudflare
priority: 50
n8n-redirect:
entryPoints:
- web
rule: "Host(`n8n.htsn.io`)"
middlewares:
- n8n-https-redirect
service: n8n
priority: 50
services:
n8n:
loadBalancer:
servers:
- url: "http://10.10.10.207:5678"
middlewares:
n8n-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
---
## Monitoring
### Prometheus
n8n exposes metrics at `http://10.10.10.207:5678/metrics` (if enabled)
### Grafana
n8n metrics can be visualized in Grafana dashboards
### Uptime Monitoring
Add to Pulse: https://pulse.htsn.io
- Monitor: https://n8n.htsn.io
- Check interval: 60s
---
## Troubleshooting
### n8n won't start
```bash
ssh docker-host2 'docker logs n8n | tail -50'
ssh docker-host2 'docker logs n8n-postgres | tail -50'
```
### Database connection issues
```bash
# Check postgres health
ssh docker-host2 'docker exec n8n-postgres pg_isready -U n8n'
# Restart postgres
ssh docker-host2 'cd /opt/n8n && docker compose restart postgres'
```
### SSL/HTTPS issues
```bash
# Check Traefik config
ssh root@10.10.10.250 'cat /etc/traefik/conf.d/n8n.yaml'
# Reload Traefik
ssh root@10.10.10.250 'systemctl reload traefik'
```
### API not responding
```bash
# Test API locally
curl -H "X-N8N-API-KEY: YOUR_KEY" http://10.10.10.207:5678/api/v1/workflows
# Check if n8n container is healthy
ssh docker-host2 'docker ps | grep n8n'
```
---
## Integration Examples
### Homelab Automation Ideas
1. **Backup Notifications** - Send Telegram alerts when backups complete
2. **Server Monitoring** - Query Prometheus and alert on high CPU/memory
3. **Media Management** - Trigger Sonarr/Radarr downloads
4. **Home Assistant Integration** - Automate smart home workflows
5. **Git Webhooks** - Deploy changes from Gitea automatically
6. **Syncthing Monitoring** - Alert when sync folders get behind
7. **UPS Alerts** - Notify on power events from NUT
---
## Security Notes
- API key provides full access to all workflows and data
- Store API key securely (added to this doc for homelab reference)
- n8n credentials are encrypted at rest in PostgreSQL
- HTTPS enforced via Traefik
- No public internet exposure (only via Tailscale)
---
## Quick Start
**New to n8n?** Start here: **[N8N-INTEGRATIONS.md](N8N-INTEGRATIONS.md)** ⭐
This guide includes:
- ✅ Network access verification
- ✅ Credential setup for all homelab services
- ✅ 10 ready-to-use starter workflows
- ✅ Home Assistant, Prometheus, Syncthing, Telegram integrations
- ✅ Troubleshooting tips
---
## Related Documentation
- [n8n Homelab Integrations Guide](N8N-INTEGRATIONS.md) - **START HERE**
- [docker-host2 VM details](VMS.md)
- [Traefik reverse proxy](TRAEFIK.md)
- [IP Assignments](IP-ASSIGNMENTS.md)
- [Pulse Setup](PULSE-SETUP.md)
**Last Updated:** 2025-12-26

509
POWER-MANAGEMENT.md Normal file
View File

@@ -0,0 +1,509 @@
# Power Management and Optimization
Documentation of power optimizations applied to reduce idle power consumption and heat generation.
## Overview
Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
---
## Power Draw Estimates
### PVE (10.10.10.120)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
| NVIDIA Quadro P2000 | 25W | 70W | 75W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
| HBAs, fans, misc | 20-30W | 20-30W | - |
| **Total** | **250-350W** | **800-940W** | - |
### PVE2 (10.10.10.102)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA RTX A6000 | 11W | 280W | 300W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
| Fans, misc | 15-20W | 15-20W | - |
| **Total** | **226-330W** | **765-890W** | - |
### Combined
| Metric | Idle | Load |
|--------|------|------|
| Servers | 476-680W | 1565-1830W |
| Network gear | ~50W | ~50W |
| **Total** | **~530-730W** | **~1615-1880W** |
| **UPS Load** | 40-55% | 120-140% ⚠️ |
**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
---
## Optimizations Applied
### 1. KSMD Disabled (2024-12-17)
**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
**Problem**:
- KSMD was consuming 44-57% CPU continuously on PVE
- Caused CPU temp to rise from 74°C to 83°C
- **Negative profit**: More power spent scanning than saved from deduplication
**Solution**: Disabled KSM permanently
**Configuration**:
**Systemd service**: `/etc/systemd/system/disable-ksm.service`
```ini
[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable and start**:
```bash
systemctl daemon-reload
systemctl enable --now disable-ksm
systemctl mask ksmtuned # Prevent re-enabling
```
**Verify**:
```bash
# KSM should be disabled (run=0)
cat /sys/kernel/mm/ksm/run # Should output: 0
# ksmd should show 0% CPU
ps aux | grep ksmd
```
**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
```bash
cat /sys/kernel/mm/ksm/run
# If 1, disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned
```
---
### 2. CPU Governor Optimization (2024-12-16)
Default CPU governor keeps cores at max frequency even when idle, wasting power.
#### PVE: `amd-pstate-epp` Driver
**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
**Governor**: `powersave`
**EPP**: `balance_power`
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
# Check governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: powersave
# Check EPP
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
# Output: balance_power
# Check current frequency (should be low when idle)
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
```
#### PVE2: `acpi-cpufreq` Driver
**Driver**: `acpi-cpufreq` (older ACPI driver)
**Governor**: `schedutil` (adaptive, better than powersave for this driver)
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to schedutil
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: schedutil
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle
```
**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
**Performance impact**: Minimal - CPU still boosts to max frequency under load
---
### 3. GPU Power States (2024-12-16)
GPUs automatically enter low-power states when idle. Verified optimal.
| GPU | Location | Idle Power | P-State | Notes |
|-----|----------|------------|---------|-------|
| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
**Check GPU power state**:
```bash
# Via nvidia-smi (if installed in VM)
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
# Expected output:
# name, power.draw [W], pstate
# NVIDIA TITAN RTX, 2.50 W, P8
# Via lspci (from Proxmox host - shows link speed, not power)
ssh pve 'lspci | grep -i nvidia'
```
**P-States**:
- **P0**: Maximum performance
- **P8**: Minimum power (idle)
**No action needed** - GPUs automatically manage power states.
**Savings**: N/A (already optimal)
---
### 4. Syncthing Rescan Intervals (2024-12-16)
Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
**Changed**:
- Large folders: 60s → **3600s** (1 hour)
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
**Configuration**: Via Syncthing UI on each device
- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
- Still acceptable for most use cases
- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
---
### 5. ksmtuned Disabled (2024-12-16)
**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
**Solution**: Stopped and disabled on both servers
```bash
systemctl stop ksmtuned
systemctl disable ksmtuned
systemctl mask ksmtuned # Prevent re-enabling
```
**Savings**: ~2-5W
---
### 6. HDD Spindown on PVE2 (2024-12-16)
**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
**Solution**: Configure 30-minute spindown timeout
**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
```udev
# Spin down WD Red 6TB drives after 30 minutes idle
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
```
**hdparm value**: 241 = 30 minutes
- Formula: `value * 5 seconds = timeout`
- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
# Verify drives have spindown set
hdparm -I /dev/sda | grep -i standby
hdparm -I /dev/sdb | grep -i standby
```
**Check if drives are spun down**:
```bash
hdparm -C /dev/sda
# Output: drive state is: standby (spun down)
# or: drive state is: active/idle (spinning)
```
**Savings**: ~10-16W when spun down (8W per drive)
**Trade-off**: 5-10 second delay when accessing pool after spindown
---
## Potential Optimizations (Not Yet Applied)
### PCIe ASPM (Active State Power Management)
**Benefit**: Reduce power of idle PCIe devices
**Risk**: May cause stability issues with some devices
**Estimated savings**: 5-15W
**Test**:
```bash
# Check current ASPM state
lspci -vv | grep -i aspm
# Enable ASPM (test first)
# Add to kernel cmdline: pcie_aspm=force
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
# Update grub
update-grub
reboot
```
### NMI Watchdog Disable
**Benefit**: Reduce CPU wakeups
**Risk**: Harder to debug kernel hangs
**Estimated savings**: 1-3W
**Test**:
```bash
# Disable NMI watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
# Make permanent (add to kernel cmdline)
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
update-grub
reboot
```
---
## Monitoring
### CPU Frequency
```bash
# Current frequency on all cores
ssh pve 'grep MHz /proc/cpuinfo | head -10'
# Governor
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
# Available governors
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
```
### CPU Temperature
```bash
# PVE
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
# PVE2
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy temps**: 70-80°C under load
**Warning**: >85°C
**Throttle**: 90°C (Tctl max for Threadripper PRO)
### GPU Power Draw
```bash
# If nvidia-smi installed in VM
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
# Sample output:
# name, power.draw [W], power.limit [W], pstate
# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
```
### Power Consumption (UPS)
```bash
# Check UPS load percentage
ssh pve 'upsc cyberpower@localhost ups.load'
# Battery runtime (seconds)
ssh pve 'upsc cyberpower@localhost battery.runtime'
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
```
See [UPS.md](UPS.md) for more UPS monitoring details.
### ZFS ARC Memory Usage
```bash
# PVE
ssh pve 'arc_summary | grep -A5 "ARC size"'
# TrueNAS
ssh truenas 'arc_summary | grep -A5 "ARC size"'
```
**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
```bash
# Limit ARC to 32 GB (example)
# Edit /etc/modprobe.d/zfs.conf:
options zfs zfs_arc_max=34359738368
# Apply (reboot required)
update-initramfs -u
reboot
```
---
## Troubleshooting
### CPU Not Downclocking
```bash
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: powersave (PVE) or schedutil (PVE2)
# If not, systemd service may have failed
# Check service status
systemctl status cpu-powersave
# Manually set governor (temporary)
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Check frequency
grep MHz /proc/cpuinfo | head -5
```
### High Idle Power After Update
**Common causes**:
1. **KSM re-enabled** after Proxmox update
- Check: `cat /sys/kernel/mm/ksm/run`
- Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
2. **CPU governor reset** to default
- Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
- Fix: `systemctl restart cpu-powersave`
3. **GPU stuck in high-performance mode**
- Check: `nvidia-smi --query-gpu=pstate --format=csv`
- Fix: Restart VM or power cycle GPU
### HDDs Won't Spin Down
```bash
# Check spindown setting
hdparm -I /dev/sda | grep -i standby
# Set spindown manually (temporary)
hdparm -S 241 /dev/sda
# Check if drive is idle (ZFS may keep it active)
zpool iostat -v 1 5 # Watch for activity
# Check what's accessing the drive
lsof | grep /mnt/pool
```
---
## Power Optimization Summary
| Optimization | Savings | Applied | Notes |
|--------------|---------|---------|-------|
| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
| PCIe ASPM | 5-15W | ❌ | Not yet tested |
| NMI watchdog | 1-3W | ❌ | Not yet tested |
| **Total savings** | **~150-300W** | - | Significant reduction |
---
## Related Documentation
- [UPS.md](UPS.md) - UPS capacity and power monitoring
- [STORAGE.md](STORAGE.md) - HDD spindown configuration
- [VMS.md](VMS.md) - VM resource allocation
---
**Last Updated**: 2025-12-22

69
PULSE-SETUP.md Normal file
View File

@@ -0,0 +1,69 @@
# Add n8n and docker-host2 to Pulse Monitoring
Pulse automatically monitors based on Prometheus targets, but you can also add custom HTTP monitors.
## Quick Steps
1. Open **https://pulse.htsn.io** in your browser
2. Login if required
3. Click **"+ Add Monitor"** or **"New Monitor"**
---
## Monitor: n8n
| Field | Value |
|-------|-------|
| **Name** | n8n Workflow Automation |
| **URL** | https://n8n.htsn.io |
| **Check Interval** | 60 seconds |
| **Monitor Type** | HTTP/HTTPS |
| **Expected Status** | 200 |
| **Timeout** | 10 seconds |
| **Alert After** | 2 failed checks |
---
## Monitor: docker-host2
| Field | Value |
|-------|-------|
| **Name** | docker-host2 (node_exporter) |
| **URL** | http://10.10.10.207:9100/metrics |
| **Check Interval** | 60 seconds |
| **Monitor Type** | HTTP |
| **Expected Status** | 200 |
| **Expected Content** | `node_exporter` |
| **Timeout** | 5 seconds |
| **Alert After** | 2 failed checks |
---
## Optional: docker-host2 SSH
| Field | Value |
|-------|-------|
| **Name** | docker-host2 SSH |
| **Host** | 10.10.10.207 |
| **Port** | 22 |
| **Monitor Type** | TCP Port |
| **Check Interval** | 60 seconds |
| **Timeout** | 5 seconds |
---
## Verification
After adding monitors, you should see:
- ✅ Green status for both monitors
- Response time graphs
- Uptime percentage
- Alert history (should be empty)
Access Pulse dashboard: **https://pulse.htsn.io**
---
**Note:** Pulse may already be monitoring these services via Prometheus integration. Check existing monitors before adding duplicates.
**Last Updated:** 2025-12-27

149
README.md Normal file
View File

@@ -0,0 +1,149 @@
# Homelab Documentation
Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
## 🚀 Quick Start
**New to this homelab?** Start here:
1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
4. [SERVICES.md](SERVICES.md) - What services are running
**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
## 📚 Documentation Index
### Infrastructure
| Document | Description |
|----------|-------------|
| [GATEWAY.md](GATEWAY.md) | UniFi gateway monitoring, watchdog services, troubleshooting |
| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
### Services & Applications
| Document | Description |
|----------|-------------|
| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
### Access & Security
| Document | Description |
|----------|-------------|
| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
### Operations
| Document | Description |
|----------|-------------|
| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
### Reference
| Document | Description |
|----------|-------------|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
## 🖥️ System Overview
### Servers
- **PVE** (10.10.10.120) - Primary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA Quadro P2000 + TITAN RTX
- **PVE2** (10.10.10.102) - Secondary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA RTX A6000
### Key Services
| Service | Location | URL |
|---------|----------|-----|
| **Proxmox** | PVE | https://pve.htsn.io |
| **TrueNAS** | VM 100 | https://truenas.htsn.io |
| **Plex** | Saltbox VM | https://plex.htsn.io |
| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
| **Gitea** | VM 300 | https://git.htsn.io |
| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
| **Traefik** | CT 202 | http://10.10.10.250:8080 |
[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
## 🔥 Emergency Procedures
### Power Failure
1. UPS provides ~15 min runtime at typical load
2. At 2 min remaining, NUT triggers graceful VM shutdown
3. When power returns, servers auto-boot and start VMs in order
See [UPS.md](UPS.md) for details.
### Service Down
```bash
# Quick health check (run from Mac Mini)
ssh pve 'qm list' # Check VMs on PVE
ssh pve2 'qm list' # Check VMs on PVE2
ssh pve 'pct list' # Check containers
# Syncthing status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections"
# Restart a VM
ssh pve 'qm stop VMID && qm start VMID'
```
See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
## 📞 Getting Help
**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
**Key Contacts**:
- Homelab Owner: Hutson
- Git Repo: https://git.htsn.io/hutson/homelab-docs
- Local Path: `~/Projects/homelab`
## 🔄 Recent Changes
See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
## 📝 Contributing
When updating docs:
1. Keep CLAUDE.md as quick reference only
2. Move detailed content to specialized docs
3. Update cross-references
4. Test all commands before committing
5. Add entries to changelog
```bash
cd ~/Projects/homelab
git add -A
git commit -m "Update documentation: <description>"
git push
```
---
**Last Updated**: 2026-01-02

591
SERVICES.md Normal file
View File

@@ -0,0 +1,591 @@
# Services Inventory
Complete inventory of all services running across the homelab infrastructure.
## Overview
| Category | Services | Location | Access |
|----------|----------|----------|--------|
| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
| **Development** | Gitea, Docker services | VMs | Web UI |
| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
**Total Services**: 25+ running services
---
## Service URLs Quick Reference
| Service | URL | Authentication | Purpose |
|---------|-----|----------------|---------|
| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
---
## Infrastructure Services
### Proxmox VE (PVE & PVE2)
**Purpose**: Virtualization platform, VM/CT host
**Location**: Physical servers (10.10.10.120, 10.10.10.102)
**Access**: https://pve.htsn.io:8006, SSH
**Version**: Unknown (check: `pveversion`)
**Key Features**:
- Web-based management
- VM and LXC container support
- ZFS storage pools
- Clustering (2-node)
- API access
**Common Operations**:
```bash
# List VMs
ssh pve 'qm list'
# Create VM
ssh pve 'qm create VMID --name myvm ...'
# Backup VM
ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
```
**See**: [VMS.md](VMS.md)
---
### TrueNAS SCALE (VM 100)
**Purpose**: Central file storage, NFS/SMB shares
**Location**: VM on PVE (10.10.10.200)
**Access**: https://truenas.htsn.io, SSH
**Version**: TrueNAS SCALE (check version in UI)
**Key Features**:
- ZFS storage management
- NFS exports
- SMB shares
- Syncthing hub
- Snapshot management
**Storage Pools**:
- `vault`: Main data pool on EMC enclosure
**Shares** (needs documentation):
- NFS exports for Saltbox media
- SMB shares for Windows access
- Syncthing sync folders
**See**: [STORAGE.md](STORAGE.md)
---
### Pi-hole (CT 200)
**Purpose**: Network-wide DNS server and ad blocker
**Location**: LXC on PVE (10.10.10.10)
**Access**: http://10.10.10.10/admin
**Version**: Unknown
**Configuration**:
- **Upstream DNS**: Cloudflare (1.1.1.1)
- **Blocklists**: Unknown count
- **Queries**: All network DNS traffic
- **DHCP**: Disabled (router handles DHCP)
**Stats** (example):
```bash
ssh pihole 'pihole -c -e' # Stats
ssh pihole 'pihole status' # Status
```
**Common Tasks**:
- Update blocklists: `ssh pihole 'pihole -g'`
- Whitelist domain: `ssh pihole 'pihole -w example.com'`
- View logs: `ssh pihole 'pihole -t'`
---
### Traefik (CT 202)
**Purpose**: Reverse proxy for all public-facing services
**Location**: LXC on PVE (10.10.10.250)
**Access**: http://10.10.10.250:8080/dashboard/
**Version**: Unknown (check: `traefik version`)
**Managed Services**:
- All *.htsn.io domains (except Saltbox services)
- SSL/TLS certificates via Let's Encrypt
- HTTP → HTTPS redirects
**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
---
## Media Services (Saltbox VM)
All media services run in Docker on the Saltbox VM (10.10.10.100).
### Plex Media Server
**Purpose**: Media streaming platform
**URL**: https://plex.htsn.io
**Access**: Plex account
**Features**:
- Hardware transcoding (TITAN RTX)
- Libraries: Movies, TV, Music
- Remote access enabled
- Managed by Saltbox
**Media Storage**:
- Source: TrueNAS NFS mounts
- Location: `/mnt/unionfs/`
**Common Tasks**:
```bash
# View Plex status
ssh saltbox 'docker logs -f plex'
# Restart Plex
ssh saltbox 'docker restart plex'
# Scan library
# (via Plex UI: Settings → Library → Scan)
```
---
### *arr Apps (Media Automation)
Running on Saltbox VM, managed via Traefik-Saltbox.
| Service | Purpose | URL | Notes |
|---------|---------|-----|-------|
| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
**Downloaders**:
| Service | Purpose | URL |
|---------|---------|-----|
| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
| **NZBGet** | Usenet downloader | nzbget.htsn.io |
| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
**Indexers**:
| Service | Purpose | URL |
|---------|---------|-----|
| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
---
### Supporting Media Services
| Service | Purpose | URL |
|---------|---------|-----|
| **Tautulli** | Plex statistics | tautulli.htsn.io |
| **Organizr** | Service dashboard | organizr.htsn.io |
| **Authelia** | SSO authentication | auth.htsn.io |
---
## Development Services
### Gitea (VM 300)
**Purpose**: Self-hosted Git server
**Location**: VM on PVE2 (10.10.10.220)
**URL**: https://git.htsn.io
**Access**: Username/password
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
**Common Tasks**:
```bash
# SSH to Gitea VM
ssh gitea-vm
# View logs
ssh gitea-vm 'journalctl -u gitea -f'
# Backup
ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
```
**See**: Gitea documentation for API usage
---
### Docker Services (docker-host VM)
Running on VM 206 (10.10.10.206).
| Service | URL | Purpose | Port |
|---------|-----|---------|------|
| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
**Managing services**:
```bash
ssh docker-host 'docker ps'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
```
---
## Home Automation
### Home Assistant (VM 110)
**Purpose**: Smart home automation platform
**Location**: VM on PVE (10.10.10.110)
**URL**: https://homeassistant.htsn.io
**Access**: Username/password
**Integrations**:
- UPS monitoring (NUT sensors)
- Unknown other integrations (needs documentation)
**Sensors**:
- `sensor.cyberpower_battery_charge`
- `sensor.cyberpower_load`
- `sensor.cyberpower_battery_runtime`
- `sensor.cyberpower_status`
**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
---
### Happy Coder Relay (docker-host)
**Purpose**: Self-hosted relay server for Happy Coder mobile app
**Location**: docker-host (10.10.10.206)
**URL**: https://happy.htsn.io
**Access**: QR code authentication
**Stack**:
- Happy Server (Node.js)
- PostgreSQL (user/session data)
- Redis (real-time events)
- MinIO (file/image storage)
**Clients**:
- Mac Mini (Happy daemon)
- Mobile app (iOS/Android)
**Credentials**:
- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
- PostgreSQL: `happy` / `happypass`
- MinIO: `happyadmin` / `happyadmin123`
---
## File Sync & Storage
### Syncthing
**Purpose**: File synchronization across all devices
**Devices**:
- Mac Mini (10.10.10.125) - Hub
- MacBook - Mobile sync
- TrueNAS (10.10.10.200) - Central storage
- Windows PC (10.10.10.150) - Windows sync
- Phone (10.10.10.54) - Mobile sync
**API Keys**:
- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
**Synced Folders**:
- documents (~11 GB)
- downloads (~38 GB)
- pictures
- notes
- desktop (~7.2 GB)
- config
- movies
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Copyparty (VM 201)
**Purpose**: Simple HTTP file sharing
**Location**: VM on PVE (10.10.10.201)
**URL**: https://copyparty.htsn.io
**Access**: Unknown
**Features**:
- Web-based file upload/download
- Lightweight
---
## Trading & AI Services
### AI Trading Platform (trading-vm)
**Purpose**: Algorithmic trading with AI models
**Location**: VM 301 on PVE2 (10.10.10.221)
**URL**: https://aitrade.htsn.io (if accessible)
**GPU**: RTX A6000 (48GB VRAM)
**Components**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
**Access**: SSH only (no web UI documented)
---
### LM Dev (lmdev1)
**Purpose**: AI/LLM development environment
**Location**: VM 111 on PVE (10.10.10.111)
**URL**: https://lmdev.htsn.io (if accessible)
**GPU**: TITAN RTX (shared with Saltbox)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
## Monitoring & Utilities
### UPS Monitoring (NUT)
**Purpose**: Monitor UPS status and trigger shutdowns
**Location**: PVE (master), PVE2 (slave)
**Access**: Command-line (`upsc`)
**Key Commands**:
```bash
ssh pve 'upsc cyberpower@localhost'
ssh pve 'upsc cyberpower@localhost ups.load'
ssh pve 'upsc cyberpower@localhost battery.runtime'
```
**Home Assistant Integration**: UPS sensors exposed
**See**: [UPS.md](UPS.md)
---
### Pulse Monitoring
**Purpose**: Unknown monitoring dashboard
**Location**: docker-host (10.10.10.206:7655)
**URL**: https://pulse.htsn.io
**Access**: Unknown
**Needs documentation**:
- What does it monitor?
- How to configure?
- Authentication?
---
### Tailscale VPN
**Purpose**: Secure remote access to homelab
**Subnet Routers**:
- PVE (100.113.177.80) - Primary
- UCG-Fiber (100.94.246.32) - Failover
**Devices on Tailscale**:
- Mac Mini: 100.108.89.58
- PVE: 100.113.177.80
- TrueNAS: 100.100.94.71
- Pi-hole: 100.112.59.128
**See**: [NETWORK.md](NETWORK.md)
---
## Custom Applications
### FindShyt (CT 205)
**Purpose**: Unknown custom application
**Location**: LXC on PVE (10.10.10.8)
**URL**: https://findshyt.htsn.io
**Access**: Unknown
**Needs documentation**:
- What is this app?
- How to use it?
- Tech stack?
---
## Service Dependencies
### Critical Dependencies
```
TrueNAS
├── Plex (media files via NFS)
├── *arr apps (downloads via NFS)
├── Syncthing (central storage hub)
└── Backups (if configured)
Traefik (CT 202)
├── All *.htsn.io services
└── SSL certificate management
Pi-hole
└── DNS for entire network
Router
└── Gateway for all services
```
### Startup Order
**See [VMS.md](VMS.md)** for VM boot order configuration:
1. TrueNAS (storage first)
2. Saltbox (depends on TrueNAS NFS)
3. Other VMs
4. Containers
---
## Service Port Reference
### Well-Known Ports
| Port | Service | Protocol | Purpose |
|------|---------|----------|---------|
| 22 | SSH | TCP | Remote access |
| 53 | Pi-hole | UDP | DNS queries |
| 80 | Traefik | TCP | HTTP (redirects to 443) |
| 443 | Traefik | TCP | HTTPS |
| 3000 | Gitea | TCP | Git HTTP/S |
| 8006 | Proxmox | TCP | Web UI |
| 8096 | Plex | TCP | Plex Media Server |
| 8384 | Syncthing | TCP | Web UI |
| 22000 | Syncthing | TCP | Sync protocol |
### Internal Ports
| Port | Service | Purpose |
|------|---------|---------|
| 3002 | Happy Server | Relay backend |
| 5432 | PostgreSQL | Happy Server DB |
| 6379 | Redis | Happy Server cache |
| 7655 | Pulse | Monitoring |
| 8080 | Excalidraw | Whiteboard |
| 8080 | Traefik | Dashboard |
| 9000 | MinIO | Object storage |
---
## Service Health Checks
### Quick Health Check Script
```bash
#!/bin/bash
# Check all critical services
echo "=== Infrastructure ==="
curl -Is https://pve.htsn.io:8006 | head -1
curl -Is https://truenas.htsn.io | head -1
curl -I http://10.10.10.10/admin 2>/dev/null | head -1
echo ""
echo "=== Media Services ==="
curl -Is https://plex.htsn.io | head -1
curl -Is https://sonarr.htsn.io | head -1
curl -Is https://radarr.htsn.io | head -1
echo ""
echo "=== Development ==="
curl -Is https://git.htsn.io | head -1
curl -Is https://excalidraw.htsn.io | head -1
echo ""
echo "=== Home Automation ==="
curl -Is https://homeassistant.htsn.io | head -1
curl -Is https://happy.htsn.io/health | head -1
```
### Service-Specific Checks
```bash
# Proxmox VMs
ssh pve 'qm list | grep running'
# Docker services
ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
# Syncthing
curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/status"
# UPS
ssh pve 'upsc cyberpower@localhost ups.status'
```
---
## Service Credentials
**Location**: See individual service documentation
| Service | Credentials Location | Notes |
|---------|---------------------|-------|
| Proxmox | Proxmox UI | Username + 2FA |
| TrueNAS | TrueNAS UI | Root password |
| Plex | Plex account | Managed externally |
| Gitea | Gitea DB | Self-managed |
| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
---
## Related Documentation
- [VMS.md](VMS.md) - VM/service locations
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [NETWORK.md](NETWORK.md) - Network configuration
- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)

475
SSH-ACCESS.md Normal file
View File

@@ -0,0 +1,475 @@
# SSH Access
Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
## Overview
Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
---
## SSH Key Authentication (Primary Method)
### SSH Key Configuration
SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
**Key file**: `~/.ssh/homelab` (Ed25519 key)
**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
### Host Aliases
Use these convenient aliases instead of IP addresses:
| Host Alias | IP | User | Type | Notes |
|------------|-----|------|------|-------|
| `ucg-fiber` / `gateway` | 10.10.10.1 | root | UniFi Gateway | Router/firewall |
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services (PVE) |
| `docker-host2` | 10.10.10.207 | hutson | VM | Docker services (PVE2) - MetaMCP, n8n |
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
### Usage Examples
```bash
# List VMs on PVE
ssh pve 'qm list'
# Check ZFS pool on TrueNAS
ssh truenas 'zpool status vault'
# List Docker containers on Saltbox
ssh saltbox 'docker ps'
# Check Pi-hole status
ssh pihole 'pihole status'
# View Traefik config
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
```
### SSH Config File
**Location**: `~/.ssh/config`
**Example entries**:
```sshconfig
# Proxmox Servers
Host pve
HostName 10.10.10.120
User root
IdentityFile ~/.ssh/homelab
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
# Post-quantum KEX causes MTU issues - use classic
KexAlgorithms curve25519-sha256
# VMs
Host truenas
HostName 10.10.10.200
User root
IdentityFile ~/.ssh/homelab
Host saltbox
HostName 10.10.10.100
User hutson
IdentityFile ~/.ssh/homelab
Host lmdev1
HostName 10.10.10.111
User hutson
IdentityFile ~/.ssh/homelab
Host docker-host
HostName 10.10.10.206
User hutson
IdentityFile ~/.ssh/homelab
Host docker-host2
HostName 10.10.10.207
User hutson
IdentityFile ~/.ssh/homelab
Host fs-dev
HostName 10.10.10.5
User hutson
IdentityFile ~/.ssh/homelab
Host copyparty
HostName 10.10.10.201
User hutson
IdentityFile ~/.ssh/homelab
Host gitea-vm
HostName 10.10.10.220
User hutson
IdentityFile ~/.ssh/homelab
Host trading-vm
HostName 10.10.10.221
User hutson
IdentityFile ~/.ssh/homelab
# LXC Containers
Host pihole
HostName 10.10.10.10
User root
IdentityFile ~/.ssh/homelab
Host traefik
HostName 10.10.10.250
User root
IdentityFile ~/.ssh/homelab
Host findshyt
HostName 10.10.10.8
User root
IdentityFile ~/.ssh/homelab
```
---
## Password Authentication (Special Cases)
Some systems don't support SSH key auth or have other limitations.
### UniFi Router (10.10.10.1) - NOW USES KEY AUTH
**Host alias**: `ucg-fiber` or `gateway`
**Status**: SSH key authentication now works (as of 2026-01-02)
**Commands**:
```bash
# Run command on router (using SSH key)
ssh ucg-fiber 'hostname'
# Get ARP table (all device IPs)
ssh ucg-fiber 'cat /proc/net/arp'
# Check Tailscale status
ssh ucg-fiber 'tailscale status'
# Check memory usage
ssh ucg-fiber 'free -m'
```
**Note**: Key may need to be re-deployed after firmware updates if UniFi clears authorized_keys.
### Windows PC (10.10.10.150)
**OS**: Windows with OpenSSH server
**User**: `claude`
**Password**: `GrilledCh33s3#`
**Shell**: PowerShell (not bash)
**Commands**:
```bash
# Run PowerShell command
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
# Check Syncthing status
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
# Restart Syncthing
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
```
**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
---
## QEMU Guest Agent
Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
### VMs with QEMU Agent
| VMID | VM Name | Use Case |
|------|---------|----------|
| 100 | truenas | Execute commands, check ZFS |
| 101 | saltbox | Execute commands, Docker mgmt |
| 105 | fs-dev | Execute commands |
| 111 | lmdev1 | Execute commands |
| 201 | copyparty | Execute commands |
| 206 | docker-host | Execute commands |
| 300 | gitea-vm | Execute commands |
| 301 | trading-vm | Execute commands |
### VM WITHOUT QEMU Agent
**VMID 110 (homeassistant)**: No QEMU agent installed
- Access via web UI only
- Or install SSH server manually if needed
### Usage Examples
**Basic syntax**:
```bash
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
```
**Examples**:
```bash
# Check ZFS pool on TrueNAS (without SSH)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
# Check Docker containers on Saltbox
ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
# Run multi-line command
ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
```
**When to use QEMU agent vs SSH**:
- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
---
## Troubleshooting SSH Issues
### Connection Refused
```bash
# Check if SSH service is running
ssh pve 'systemctl status sshd'
# Check if port 22 is open
nc -zv 10.10.10.XXX 22
# Check firewall
ssh pve 'iptables -L -n | grep 22'
```
### Permission Denied (Public Key)
```bash
# Verify key file exists
ls -la ~/.ssh/homelab
# Check key permissions (should be 600)
chmod 600 ~/.ssh/homelab
# Test SSH key auth verbosely
ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
# Check authorized_keys on remote (via QEMU agent if SSH broken)
ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
```
### Slow SSH Connection (PVE2 Issue)
**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
**Fix**: Use classic KEX algorithm instead
**In `~/.ssh/config`**:
```sshconfig
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
KexAlgorithms curve25519-sha256 # Avoid mlkem768x25519-sha256
```
**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
---
## Adding SSH Keys to New Systems
### Linux (VMs/LXCs)
```bash
# Copy public key to new host
ssh-copy-id -i ~/.ssh/homelab user@hostname
# Or manually:
ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
```
### LXC Containers (Root User)
```bash
# Via pct exec from Proxmox host
ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
# Also enable PermitRootLogin in sshd_config
ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
```
### VMs (via QEMU Agent)
```bash
# Add key via QEMU agent (if SSH not working)
ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
```
---
## SSH Key Management
### Rotate SSH Keys (Future)
When rotating SSH keys:
1. Generate new key pair:
```bash
ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
```
2. Deploy new key to all hosts (keep old key for now):
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
ssh-copy-id -i ~/.ssh/homelab-new $host
done
```
3. Update `~/.ssh/config` to use new key:
```sshconfig
IdentityFile ~/.ssh/homelab-new
```
4. Test all connections:
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
echo "Testing $host..."
ssh $host 'hostname'
done
```
5. Remove old key from all hosts once confirmed working
---
## Quick Reference
### Common SSH Operations
```bash
# Execute command on remote host
ssh host 'command'
# Execute multiple commands
ssh host 'command1 && command2'
# Copy file to remote
scp file host:/path/
# Copy file from remote
scp host:/path/file ./
# Execute command on Proxmox VM (via QEMU agent)
ssh pve 'qm guest exec VMID -- bash -c "command"'
# Execute command on LXC
ssh pve 'pct exec CTID -- command'
# Interactive shell
ssh host
# SSH with X11 forwarding
ssh -X host
```
### Troubleshooting Commands
```bash
# Test SSH with verbose output
ssh -vvv host
# Check SSH service status (remote)
ssh host 'systemctl status sshd'
# Check SSH config (local)
ssh -G host
# Test port connectivity
nc -zv hostname 22
```
---
## Security Best Practices
### Current Security Posture
✅ **Good**:
- SSH keys used instead of passwords (where possible)
- Keys use Ed25519 (modern, secure algorithm)
- Root login disabled on VMs (use sudo instead)
- SSH keys have proper permissions (600)
⚠️ **Could Improve**:
- [ ] Disable password authentication on all hosts (force key-only)
- [ ] Use SSH certificate authority instead of individual keys
- [ ] Set up SSH bastion host (jump server)
- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
- [ ] Implement SSH key rotation policy (annually)
### Hardening SSH (Future)
For additional security, consider:
```sshconfig
# /etc/ssh/sshd_config (on remote hosts)
PermitRootLogin prohibit-password # No root password login
PasswordAuthentication no # Disable password auth entirely
PubkeyAuthentication yes # Only allow key auth
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3 # Limit auth attempts
MaxSessions 10 # Limit concurrent sessions
ClientAliveInterval 300 # Timeout idle sessions
ClientAliveCountMax 2 # Drop after 2 keepalives
```
**Apply after editing**:
```bash
systemctl restart sshd
```
---
## Related Documentation
- [VMS.md](VMS.md) - Complete VM/CT inventory
- [NETWORK.md](NETWORK.md) - Network configuration
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
- [SECURITY.md](#) - Security policies (coming soon)
---
**Last Updated**: 2025-12-22

510
STORAGE.md Normal file
View File

@@ -0,0 +1,510 @@
# Storage Architecture
Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
## Overview
### Storage Distribution
| Location | Type | Capacity | Purpose |
|----------|------|----------|---------|
| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
---
## PVE (10.10.10.120) Storage Pools
### nvme-mirror1 (Primary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Sabrent Rocket Q NVMe
- **Capacity**: 3.6 TB usable
- **Purpose**: High-performance VM storage
- **Used By**:
- Critical VMs requiring fast IO
- Database workloads
- Development environments
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror1'
ssh pve 'zpool list nvme-mirror1'
```
### nvme-mirror2 (Secondary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Kingston SFYRD 2TB NVMe
- **Capacity**: 1.8 TB usable
- **Purpose**: Additional fast VM storage
- **Used By**: TBD
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror2'
ssh pve 'zpool list nvme-mirror2'
```
### rpool (Root Pool)
- **Type**: ZFS mirror
- **Devices**: 2x Samsung 870 QVO 4TB SSD
- **Capacity**: 3.6 TB usable
- **Purpose**: Proxmox OS, container storage, VM backups
- **Used By**:
- Proxmox root filesystem
- LXC containers
- Local VM backups
**Check status**:
```bash
ssh pve 'zpool status rpool'
ssh pve 'df -h /var/lib/vz'
```
### Storage Pool Usage Summary (PVE)
**Get current usage**:
```bash
ssh pve 'zpool list'
ssh pve 'pvesm status'
```
---
## PVE2 (10.10.10.102) Storage Pools
### nvme-mirror3 (Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x NVMe (model unknown)
- **Capacity**: Unknown (needs investigation)
- **Purpose**: High-performance VM storage
- **Used By**: Trading VM (301), other VMs
**Check status**:
```bash
ssh pve2 'zpool status nvme-mirror3'
ssh pve2 'zpool list nvme-mirror3'
```
### local-zfs2 (Bulk Storage)
- **Type**: ZFS mirror
- **Devices**: 2x WD Red 6TB HDD
- **Capacity**: ~6 TB usable
- **Purpose**: Bulk/archival storage
- **Power Management**: 30-minute spindown configured
- Saves ~10-16W when idle
- Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
- Command: `hdparm -S 241` (30 min)
**Notes**:
- Pool had only 768 KB used as of 2024-12-16
- Drives configured to spin down after 30 min idle
- Good for archival, NOT for active workloads
**Check status**:
```bash
ssh pve2 'zpool status local-zfs2'
ssh pve2 'zpool list local-zfs2'
# Check if drives are spun down
ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby
```
---
## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
### ZFS Pool: vault
**Primary storage pool** for all shared data.
**Devices**: ❓ Needs investigation
- EMC storage enclosure with multiple drives
- SAS connection via LSI SAS2308 HBA (passed through to VM)
**Capacity**: ❓ Needs investigation
**Check pool status**:
```bash
ssh truenas 'zpool status vault'
ssh truenas 'zpool list vault'
# Get detailed capacity
ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
```
### Datasets (Known)
Based on Syncthing configuration, likely datasets:
| Dataset | Purpose | Synced Devices | Notes |
|---------|---------|----------------|-------|
| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
| vault/desktop | Desktop sync | Unknown | 7.2 GB |
| vault/movies | Movie library | Unknown | Unknown size |
| vault/config | Config files | Mac Mini, MacBook | Unknown size |
**Get complete dataset list**:
```bash
ssh truenas 'zfs list -r vault'
```
### NFS/SMB Shares
**Status**: ❓ Not documented
**Needs investigation**:
```bash
# List NFS exports
ssh truenas 'showmount -e localhost'
# List SMB shares
ssh truenas 'smbclient -L localhost -N'
# Via TrueNAS API/UI
# Sharing → Unix Shares (NFS)
# Sharing → Windows Shares (SMB)
```
**Expected shares**:
- Media libraries for Plex (on Saltbox VM)
- Document storage
- VM backups?
- ISO storage?
### EMC Storage Enclosure
**Model**: EMC KTN-STL4 (or similar)
**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
**Drives**: ❓ Unknown count and capacity
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
- SES commands
- Fan control
- LCC (Link Control Card) troubleshooting
- Maintenance procedures
**Check enclosure status**:
```bash
ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor
ssh truenas 'smartctl --scan' # List all drives
```
---
## Storage Network Architecture
### Internal Storage Network (10.10.10.20.0/24)
**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
**Subnet**: 10.10.10.20.0/24
**DHCP**: No
**Gateway**: No (internal only, no internet)
**Connected VMs**:
- TrueNAS VM (secondary NIC)
- Saltbox VM (secondary NIC) - for NFS mounts
- Other VMs needing storage access
**Configuration**:
```bash
# On TrueNAS VM - check second NIC
ssh truenas 'ip addr show enp6s19'
# On Saltbox - check NFS mounts
ssh saltbox 'mount | grep nfs'
```
**Benefits**:
- Separates storage traffic from general network
- Prevents NFS/SMB from saturating main network
- Better performance for storage-heavy workloads
---
## Storage Capacity Planning
### Current Usage (Estimate)
**Needs actual audit**:
```bash
# PVE pools
ssh pve 'zpool list -o name,size,alloc,free'
# PVE2 pools
ssh pve2 'zpool list -o name,size,alloc,free'
# TrueNAS vault pool
ssh truenas 'zpool list vault'
# Get detailed breakdown
ssh truenas 'zfs list -r vault -o name,used,avail'
```
### Growth Rate
**Needs tracking** - recommend monthly snapshots of capacity:
```bash
#!/bin/bash
# Save as ~/bin/storage-capacity-report.sh
DATE=$(date +%Y-%m-%d)
REPORT=~/Backups/storage-reports/capacity-$DATE.txt
mkdir -p ~/Backups/storage-reports
echo "Storage Capacity Report - $DATE" > $REPORT
echo "================================" >> $REPORT
echo "" >> $REPORT
echo "PVE Pools:" >> $REPORT
ssh pve 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "PVE2 Pools:" >> $REPORT
ssh pve2 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Pools:" >> $REPORT
ssh truenas 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Datasets:" >> $REPORT
ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
echo "Report saved to $REPORT"
```
**Run monthly via cron**:
```cron
0 9 1 * * ~/bin/storage-capacity-report.sh
```
### Expansion Planning
**When to expand**:
- Pool reaches 80% capacity
- Performance degrades
- New workloads require more space
**Expansion options**:
1. Add drives to existing pools (if mirrors, add mirror vdev)
2. Add new NVMe drives to PVE/PVE2
3. Expand EMC enclosure (add more drives)
4. Add second EMC enclosure
**Cost estimates**: TBD
---
## ZFS Health Monitoring
### Daily Health Checks
```bash
# Check for errors on all pools
ssh pve 'zpool status -x' # Shows only unhealthy pools
ssh pve2 'zpool status -x'
ssh truenas 'zpool status -x'
# Check scrub status
ssh pve 'zpool status | grep scrub'
ssh pve2 'zpool status | grep scrub'
ssh truenas 'zpool status | grep scrub'
```
### Scrub Schedule
**Recommended**: Monthly scrub on all pools
**Configure scrub**:
```bash
# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
# Or via cron:
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub rpool
```
**On TrueNAS**:
- Configure via UI: Storage → Pools → Scrub Tasks
- Recommended: 1st of every month at 2 AM
### SMART Monitoring
**Check drive health**:
```bash
# PVE
ssh pve 'smartctl -a /dev/nvme0'
ssh pve 'smartctl -a /dev/sda'
# TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'smartctl -a /dev/sdX' # For each drive
```
**Configure SMART tests**:
- TrueNAS UI: Tasks → S.M.A.R.T. Tests
- Recommended: Weekly short test, monthly long test
### Alerts
**Set up email alerts for**:
- ZFS pool errors
- SMART test failures
- Pool capacity > 80%
- Scrub failures
---
## Storage Performance Tuning
### ZFS ARC (Cache)
**Check ARC usage**:
```bash
ssh pve 'arc_summary'
ssh truenas 'arc_summary'
```
**Tuning** (if needed):
- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
- TrueNAS: Configure via UI (System → Advanced → Tunables)
### NFS Performance
**Mount options** (on clients like Saltbox):
```
rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
```
**Verify NFS mounts**:
```bash
ssh saltbox 'mount | grep nfs'
```
### Record Size Optimization
**Different workloads need different record sizes**:
- VMs: 64K (default, good for VMs)
- Databases: 8K or 16K
- Media files: 1M (large sequential reads)
**Set record size** (on TrueNAS datasets):
```bash
ssh truenas 'zfs set recordsize=1M vault/movies'
```
---
## Disaster Recovery
### Pool Recovery
**If a pool fails to import**:
```bash
# Try importing with different name
zpool import -f -N poolname newpoolname
# Check pool with readonly
zpool import -f -o readonly=on poolname
# Force import (last resort)
zpool import -f -F poolname
```
### Drive Replacement
**When a drive fails**:
```bash
# Identify failed drive
zpool status poolname
# Replace drive
zpool replace poolname old-device new-device
# Monitor resilver
watch zpool status poolname
```
### Data Recovery
**If pool is completely lost**:
1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
2. Recreate pool structure
3. Restore data
**Critical**: This is why we need offsite backups!
---
## Quick Reference
### Common Commands
```bash
# Pool status
zpool status [poolname]
zpool list
# Dataset usage
zfs list
zfs list -r vault
# Check pool health (only unhealthy)
zpool status -x
# Scrub pool
zpool scrub poolname
# Get pool IO stats
zpool iostat -v 1
# Snapshot management
zfs snapshot poolname/dataset@snapname
zfs list -t snapshot
zfs rollback poolname/dataset@snapname
zfs destroy poolname/dataset@snapname
```
### Storage Locations by Use Case
| Use Case | Recommended Storage | Why |
|----------|---------------------|-----|
| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
| Database | nvme-mirror1/2 | Low latency |
| Media files | TrueNAS vault | Large capacity |
| Development | nvme-mirror2 | Fast, mid-tier |
| Containers | rpool | Good performance |
| Backups | TrueNAS or rpool | Large capacity |
| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
---
## Investigation Needed
- [ ] Get complete TrueNAS dataset list
- [ ] Document NFS/SMB share configuration
- [ ] Inventory EMC enclosure drives (count, capacity, model)
- [ ] Document current pool usage percentages
- [ ] Set up monthly capacity reports
- [ ] Configure ZFS scrub schedules
- [ ] Set up storage health alerts
---
## Related Documentation
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
- [VMS.md](VMS.md) - VM storage assignments
- [NETWORK.md](NETWORK.md) - Storage network configuration
---
**Last Updated**: 2025-12-22

673
TRAEFIK.md Normal file
View File

@@ -0,0 +1,673 @@
# Traefik Reverse Proxy
Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
## Overview
There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
| Instance | Location | IP | Purpose | Managed By |
|----------|----------|-----|---------|------------|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
---
## ⚠️ CRITICAL RULE: Which Traefik to Use
### When Adding ANY New Service:
**USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
**DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
### Why This Matters:
- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
- Each Traefik has its own Let's Encrypt certificates
- Mixing them causes certificate conflicts and routing issues
---
## Traefik-Primary (CT 202) - For New Services
### Configuration
**Location**: Container 202 on PVE (10.10.10.250)
**Config Directory**: `/etc/traefik/`
**Main Config**: `/etc/traefik/traefik.yaml`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
### Access Traefik Config
```bash
# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
# View logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Services Using Traefik-Primary
| Service | Domain | Backend |
|---------|--------|---------|
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
| Gitea | git.htsn.io | 10.10.10.220:3000 |
| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
| LM Dev | lmdev.htsn.io | 10.10.10.111 |
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 (docker-host2) |
| Pi-hole | pihole.htsn.io | 10.10.10.200 |
| TrueNAS | truenas.htsn.io | 10.10.10.200 |
| Proxmox | pve.htsn.io | 10.10.10.120 |
| Copyparty | copyparty.htsn.io | 10.10.10.201 |
| AI Trade | aitrade.htsn.io | (trading server) |
| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
---
## Traefik-Saltbox (VM 101) - DO NOT MODIFY
### Configuration
**Location**: `/opt/traefik/` inside Saltbox VM
**Managed By**: Saltbox Ansible playbooks (automatic)
**Docker Mount**: `/opt/traefik``/etc/traefik` in container
### Services Using Traefik-Saltbox
- Plex (plex.htsn.io)
- Sonarr, Radarr, Lidarr
- SABnzbd, NZBGet, qBittorrent
- Overseerr, Tautulli, Organizr
- Jackett, NZBHydra2
- Authelia (SSO authentication)
- All other Saltbox-managed containers
### View Saltbox Traefik (Read-Only)
```bash
# View config (don't edit!)
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
# View logs
ssh saltbox 'docker logs -f traefik'
```
**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
---
## Adding a New Public Service - Complete Workflow
Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
### Step 0: Deploy Your Service
First, deploy your service on the appropriate host.
#### Option A: Docker on docker-host (10.10.10.206)
```bash
ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
myservice:
image: myimage:latest
ports:
- "8080:80"
restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d
```
#### Option B: New LXC Container on PVE
```bash
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname myservice --memory 2048 --cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 --unprivileged 1 --start 1'
```
#### Option C: New VM on PVE
```bash
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
```
### Step 1: Create Traefik Config File
Use this template for new services on **Traefik-Primary (CT 202)**:
#### Basic Template
```yaml
# /etc/traefik/conf.d/myservice.yaml
http:
routers:
# HTTPS router
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
priority: 50
# HTTP → HTTPS redirect
myservice-redirect:
entryPoints:
- web
rule: "Host(`myservice.htsn.io`)"
middlewares:
- myservice-https-redirect
service: myservice
priority: 50
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
middlewares:
myservice-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
#### Deploy the Config
```bash
# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'
# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Step 2: Add Cloudflare DNS Entry
#### Cloudflare Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
| Public IP | 70.237.94.174 |
#### Method 1: Manual (Cloudflare Dashboard)
1. Go to https://dash.cloudflare.com/
2. Select `htsn.io` domain
3. DNS → Add Record
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
#### Method 2: Automated (CLI)
Save this as `~/bin/add-cloudflare-dns.sh`:
```bash
#!/bin/bash
# Add DNS record to Cloudflare for htsn.io
SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
PUBLIC_IP="70.237.94.174"
if [ -z "$SUBDOMAIN" ]; then
echo "Usage: $0 <subdomain>"
echo "Example: $0 myservice # Creates myservice.htsn.io"
exit 1
fi
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data "{
\"type\":\"A\",
\"name\":\"$SUBDOMAIN\",
\"content\":\"$PUBLIC_IP\",
\"ttl\":1,
\"proxied\":true
}" | jq .
```
**Usage**:
```bash
chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh myservice # Creates myservice.htsn.io
```
### Step 3: Testing
```bash
# Check if DNS resolves
dig myservice.htsn.io
# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
# Test HTTP redirect
curl -I http://myservice.htsn.io
# Expected: 301 redirect to https://
# Test HTTPS
curl -I https://myservice.htsn.io
# Expected: 200 OK
# Check Traefik dashboard (if enabled)
# http://10.10.10.250:8080/dashboard/
```
### Step 4: Update Documentation
After deploying, update:
1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
3. **CLAUDE.md** - Update quick reference if needed
---
## SSL Certificates
Traefik has **two certificate resolvers** configured:
| Resolver | Use When | Challenge Type | Notes |
|----------|----------|----------------|-------|
| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
### ⚠️ Important: HTTP Challenge vs DNS Challenge
**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
### Certificate Resolver Configuration
**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
```ini
Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
```
### Certificate Storage
| Resolver | Storage File |
|----------|--------------|
| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
**Permissions**: Must be `600` (read/write owner only)
```bash
# Check permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
# Fix if needed
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
```
### Certificate Renewal
- **Automatic** via Traefik
- Checks every 24 hours
- Renews 30 days before expiry
- No manual intervention needed
### Troubleshooting Certificates
#### Certificate Fails to Issue
```bash
# Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
# Verify Cloudflare API access
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
# Check acme.json permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
```
#### Force Certificate Renewal
```bash
# Delete certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'
# Watch logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
---
## Quick Deployment - One-Liner
For fast deployment, use this all-in-one command:
```bash
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
routers:
myservice-secure:
entryPoints: [websecure]
rule: Host(\\\`myservice.htsn.io\\\`)
service: myservice
tls: {certResolver: cloudflare}
services:
myservice:
loadBalancer:
servers:
- url: http://10.10.10.206:8080
EOF"'
# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io
```
---
## Docker Service with Traefik Labels (Alternative)
If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
**Requirements**:
- Traefik must have access to Docker socket
- Service must be on same Docker network as Traefik
**Example docker-compose.yml**:
```yaml
version: "3.8"
services:
myservice:
image: myimage:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
- "traefik.http.routers.myservice.entrypoints=websecure"
- "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
- "traefik.http.services.myservice.loadbalancer.server.port=8080"
networks:
- traefik
networks:
traefik:
external: true
```
**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
---
## Cloudflare API Reference
### API Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
### Common API Operations
Set credentials:
```bash
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
```
**List all DNS records**:
```bash
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" | jq
```
**Add A record**:
```bash
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{
"type":"A",
"name":"subdomain",
"content":"70.237.94.174",
"proxied":true
}'
```
**Delete record**:
```bash
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY"
```
**Update record** (toggle proxy):
```bash
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{"proxied":false}'
```
---
## Troubleshooting
### Service Not Accessible
```bash
# 1. Check if DNS resolves
dig myservice.htsn.io
# 2. Check if backend is reachable
curl -I http://10.10.10.XXX:PORT
# 3. Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
# 4. Check Traefik config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
# 5. Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
### Certificate Issues
```bash
# Check certificate status in acme.json
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
# Check certificate expiry
echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
### 502 Bad Gateway
**Cause**: Backend service is down or unreachable
```bash
# Check if backend is running
ssh backend-host 'systemctl status myservice'
# Check if port is open
nc -zv 10.10.10.XXX PORT
# Check firewall
ssh backend-host 'iptables -L -n | grep PORT'
```
### 404 Not Found
**Cause**: Traefik can't match the request to a router
```bash
# Check router rule matches domain
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
# Should be: rule: "Host(`myservice.htsn.io`)"
# Check DNS is pointing to correct IP
dig myservice.htsn.io
# Restart Traefik to reload config
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
---
## Advanced Configuration Examples
### WebSocket Support
For services that use WebSockets (like Home Assistant):
```yaml
http:
routers:
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
# No special config needed - WebSockets work by default in Traefik v2+
```
### Custom Headers
Add custom headers (e.g., security headers):
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-headers
middlewares:
myservice-headers:
headers:
customResponseHeaders:
X-Frame-Options: "DENY"
X-Content-Type-Options: "nosniff"
Referrer-Policy: "strict-origin-when-cross-origin"
```
### Basic Authentication
Protect a service with basic auth:
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-auth
middlewares:
myservice-auth:
basicAuth:
users:
- "user:$apr1$..." # Generate with: htpasswd -nb user password
```
---
## Maintenance
### Monthly Checks
```bash
# Check Traefik status
ssh pve 'pct exec 202 -- systemctl status traefik'
# Review logs for errors
ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
# Check certificate expiry dates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
# Verify all services responding
for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
echo "Testing $domain..."
curl -sI https://$domain | head -1
done
```
### Backup Traefik Config
```bash
# Backup all configs
ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
# Copy to safe location
scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
```
---
## Related Documentation
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
- [SERVICES.md](#) - Complete service inventory (coming soon)
---
**Last Updated**: 2025-12-22

605
UPS.md Normal file
View File

@@ -0,0 +1,605 @@
# UPS and Power Management
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
## Hardware
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
| **Installed** | 2025-12-21 |
| **Status** | Active |
### ⚠️ Temporary Wiring Modification
**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
**Current draw**: ~1000-1350W total (safe margin)
**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
### Previous UPS
| Model | Capacity | Issue | Replaced |
|-------|----------|-------|----------|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
---
## Power Draw Estimates
### Typical Load
| Component | Idle | Load | Notes |
|-----------|------|------|-------|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
| Network gear | ~50W | ~50W | Router, switches |
| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
**UPS Load**: ~33-50% typical, 70-80% under heavy load
### Runtime Calculation
At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
At **660W load** (50%): ~10-12 min estimated
At **1000W load** (75%): ~6-8 min estimated
**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
---
## NUT (Network UPS Tools) Configuration
### Architecture
```
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
└──> Home Assistant (monitoring only)
```
**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
### NUT Server Configuration (PVE)
#### 1. UPS Driver Config: `/etc/nut/ups.conf`
```ini
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower OR2200PFCRT2U"
override.battery.charge.low = 20
override.battery.runtime.low = 120
```
**Key settings**:
- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
- `port = auto`: Auto-detect USB device
- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
#### 2. NUT Server Config: `/etc/nut/upsd.conf`
```ini
LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493
```
**Listens on**:
- Localhost (for local monitoring)
- LAN IP (for PVE2 to connect)
#### 3. User Config: `/etc/nut/upsd.users`
```ini
[admin]
password = upsadmin123
actions = SET
instcmds = ALL
[upsmon]
password = upsmon123
upsmon master
```
**Users**:
- `admin`: Full control, can run commands
- `upsmon`: Monitoring only (used by PVE2)
#### 4. Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONLINE "UPS %s on line power"
NOTIFYMSG ONBATT "UPS %s on battery"
NOTIFYMSG LOWBATT "UPS %s battery is low"
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK "Communications with UPS %s established"
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM "UPS %s is unavailable"
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
NOTIFYFLAG ONLINE SYSLOG+WALL
NOTIFYFLAG ONBATT SYSLOG+WALL
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG
```
**Key settings**:
- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
- `POLLFREQ 5`: Check UPS every 5 seconds
#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
```udev
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
```
**Purpose**: Ensure NUT can access USB UPS device
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
```
### NUT Client Configuration (PVE2)
#### Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
# Same NOTIFYMSG and NOTIFYFLAG as PVE
```
**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
---
## Custom Shutdown Script
### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
```bash
#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low
LOG="/var/log/ups-shutdown.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}
log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"
# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
log "Stopping VM $VMID..."
qm shutdown $VMID
done
# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
log "Stopping CT $CTID..."
pct shutdown $CTID
done
# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60
# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
log "Stopping TrueNAS (VM 100) last..."
qm shutdown 100
sleep 30
fi
log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="
```
**Make executable**:
```bash
chmod +x /usr/local/bin/ups-shutdown.sh
```
**Script behavior**:
1. Stops all VMs (except TrueNAS)
2. Stops all containers
3. Waits 60 seconds
4. Stops TrueNAS last (storage must be cleanly unmounted)
5. **Does NOT shut down Proxmox hosts** - intentionally left running
**Why not shut down hosts?**
- BIOS configured to "Restore on AC Power Loss"
- When power returns, servers auto-boot and start VMs in order
- Avoids need for manual intervention
---
## Power Failure Behavior
### When Power Fails
1. **UPS switches to battery** (`OB DISCHRG` status)
2. **NUT monitors runtime** - polls every 5 seconds
3. **At 120 seconds (2 min) remaining**:
- NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
- Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
4. **Hosts remain running** until UPS battery depletes
5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
### When Power Returns
1. **UPS charges battery**, power returns to servers
2. **BIOS "Restore on AC Power Loss"** boots both servers
3. **Proxmox starts** and auto-starts VMs in configured order:
| Order | Wait | VMs/CTs | Reason |
|-------|------|---------|--------|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
| 4 | 5s | pihole, traefik, findshyt | Containers |
PVE2 VMs: order=1, wait=10s
**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
---
## UPS Status Codes
| Code | Meaning | Action |
|------|---------|--------|
| `OL` | Online (AC power) | Normal operation |
| `OB` | On Battery | Power outage - monitor runtime |
| `LB` | Low Battery | <2 min remaining - shutdown imminent |
| `CHRG` | Charging | Battery charging after power restored |
| `DISCHRG` | Discharging | On battery, draining |
| `FSD` | Forced Shutdown | NUT triggered shutdown |
---
## Monitoring & Commands
### Check UPS Status
```bash
# Full status
ssh pve 'upsc cyberpower@localhost'
# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
# Example output:
# battery.charge: 100
# battery.runtime: 1234 (seconds remaining)
# ups.load: 33 (% load)
# ups.status: OL (online)
```
### Control UPS Beeper
```bash
# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
```
### Test Shutdown Procedure
**Simulate low battery** (careful - this will shut down VMs!):
```bash
# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'
# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
```
**Better test**: Run shutdown script manually without actually triggering NUT:
```bash
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
---
## Home Assistant Integration
UPS metrics are exposed to Home Assistant via NUT integration.
### Available Sensors
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
| `sensor.cyberpower_load` | Load % (0-100) |
| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
### Configuration
**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
### Example Automations
**Send notification when on battery**:
```yaml
automation:
- alias: "UPS On Battery Alert"
trigger:
- platform: state
entity_id: sensor.cyberpower_status
to: "OB"
action:
- service: notify.mobile_app
data:
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
```
**Alert when battery low**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_runtime
below: 300
action:
- service: notify.mobile_app
data:
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
```
---
## Testing Results
### Full Power Failure Test (2025-12-21)
Complete end-to-end test of power failure and recovery:
| Event | Time | Duration | Notes |
|-------|------|----------|-------|
| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
| **Power restored** | ~22:47 | - | Plugged back in |
| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
| **Total recovery** | - | **~7 min** | From power return to fully operational |
**Results**:
✅ VMs shut down gracefully
✅ Hosts remained running until UPS died (as intended)
✅ Auto-boot on power restoration worked
✅ VMs started in correct order with appropriate delays
✅ No data corruption or issues
**Runtime calculation**:
- Load: ~33% (440W estimated)
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
- Matches manufacturer estimate for 33% load
---
## Proxmox Cluster Quorum Fix
### Problem
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
### Solution
Modified `/etc/pve/corosync.conf` to enable 2-node mode:
```
quorum {
provider: corosync_votequorum
two_node: 1
}
```
**Effect**:
- Either node can operate independently if the other is down
- No more waiting for quorum when one server is offline
- Both nodes visible in single Proxmox interface when both up
**Applied**: 2025-12-21
---
## Maintenance
### Monthly Checks
```bash
# Check UPS status
ssh pve 'upsc cyberpower@localhost'
# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'
# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'
# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'
# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'
```
### Battery Health
**Check battery stats**:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery'
# Key metrics:
# battery.charge: 100 (should be near 100 when on AC)
# battery.runtime: 1200+ (seconds at current load)
# battery.voltage: ~24V (normal for 24V battery system)
```
**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
```
CyberPower batteries typically last 3-5 years.
### Firmware Updates
Check CyberPower website for firmware updates:
https://www.cyberpowersystems.com/support/firmware/
---
## Troubleshooting
### UPS Not Detected
```bash
# Check USB connection
ssh pve 'lsusb | grep Cyber'
# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'
```
### PVE2 Can't Connect
```bash
# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'
# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'
# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'
```
### Shutdown Script Not Running
```bash
# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
# Should be: -rwxr-xr-x (executable)
# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'
# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
### UPS Status Shows UNKNOWN
```bash
# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'
# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb
# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'
```
---
## Future Improvements
- [ ] Add email alerts for UPS events (power fail, low battery)
- [ ] Log runtime statistics to track battery degradation
- [ ] Set up Grafana dashboard for UPS metrics
- [ ] Test battery runtime at different load levels
- [ ] Upgrade to 20A circuit, restore original 5-20P plug
- [ ] Consider adding network management card for out-of-band UPS access
---
## Related Documentation
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
- [VMS.md](VMS.md) - VM startup order configuration
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
---
**Last Updated**: 2025-12-22

580
VMS.md Normal file
View File

@@ -0,0 +1,580 @@
# VMs and Containers
Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
## Overview
| Server | VMs | LXCs | Total |
|--------|-----|------|-------|
| **PVE** (10.10.10.120) | 6 | 3 | 9 |
| **PVE2** (10.10.10.102) | 3 | 0 | 3 |
| **Total** | **9** | **3** | **12** |
---
## PVE (10.10.10.120) - Primary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
### LXC Containers
| CTID | Name | IP | RAM | Storage | Purpose |
|------|------|-----|-----|---------|---------|
| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
---
## PVE2 (10.10.10.102) - Secondary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
| **302** | docker-host2 | 10.10.10.207 | 4 | 8GB | nvme-mirror3 | Docker host (n8n, automation) | - | ✅ Yes |
### LXC Containers
None on PVE2.
---
## VM Details
### 100 - TrueNAS (Storage Server)
**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
**Specs**:
- **OS**: TrueNAS SCALE
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
- **Network**:
- Primary: 10 Gb (vmbr2)
- Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
**Hardware Passthrough**:
- LSI SAS2308 HBA (for EMC enclosure drives)
- Samsung NVMe (for ZFS caching)
**ZFS Pools**:
- `vault`: Main storage pool on EMC drives
- Boot pool on passed-through NVMe
**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
---
### 101 - Saltbox (Media Automation)
**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 16 GB
- **Storage**: nvme-mirror1
- **Network**: 10 Gb (vmbr2)
**GPU Passthrough**:
- NVIDIA TITAN RTX (for Plex hardware transcoding)
**Services**:
- Plex Media Server (plex.htsn.io)
- Sonarr, Radarr, Lidarr (TV/movie/music automation)
- SABnzbd, NZBGet (downloaders)
- Overseerr (request management)
- Tautulli (Plex stats)
- Organizr (dashboard)
- Authelia (SSO authentication)
- Traefik (reverse proxy - separate from CT 202)
**Managed By**: Saltbox Ansible playbooks
**See**: [SALTBOX.md](#) (coming soon)
---
### 105 - fs-dev (Development Environment)
**Purpose**: General development work, testing, prototyping
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 10
- **RAM**: 8 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
---
### 110 - Home Assistant (Home Automation)
**Purpose**: Smart home automation platform
**Specs**:
- **OS**: Home Assistant OS
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**:
- Web UI: https://homeassistant.htsn.io
- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
**Special Notes**:
- ❌ No QEMU agent (Home Assistant OS doesn't support it)
- No SSH server by default (access via web terminal)
---
### 111 - lmdev1 (AI/LLM Development)
**Purpose**: AI model development, fine-tuning, inference
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
### 201 - Copyparty (File Sharing)
**Purpose**: Simple HTTP file sharing server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**: https://copyparty.htsn.io
---
### 206 - docker-host (Docker Services)
**Purpose**: General-purpose Docker host for miscellaneous services
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
- **CPU**: `host` passthrough (for x86-64-v3 support)
**Services Running**:
- Excalidraw (excalidraw.htsn.io) - Whiteboard
- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
- Pulse (pulse.htsn.io) - Monitoring dashboard
**Docker Compose Files**: `/opt/*/docker-compose.yml`
---
### 300 - gitea-vm (Git Server)
**Purpose**: Self-hosted Git server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**Access**: https://git.htsn.io
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
---
### 301 - trading-vm (AI Trading Platform)
**Purpose**: Algorithmic trading system with AI models
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 32 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
**Software**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
---
## LXC Container Details
### 200 - Pi-hole (DNS & Ad Blocking)
**Purpose**: Network-wide DNS server and ad blocker
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.10
**Storage**: rpool
**Access**:
- Web UI: http://10.10.10.10/admin
- Public URL: https://pihole.htsn.io
**Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1)
- DHCP: Disabled (router handles DHCP)
- Interface: All interfaces
**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
---
### 202 - Traefik (Reverse Proxy)
**Purpose**: Primary reverse proxy for all public-facing services
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.250
**Storage**: rpool
**Configuration**: `/etc/traefik/`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
---
### 205 - FindShyt (Custom App)
**Purpose**: Custom application (details TBD)
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.8
**Storage**: rpool
**Access**: https://findshyt.htsn.io
---
## VM Startup Order & Dependencies
### Power-On Sequence
When servers boot (after power failure or restart), VMs/CTs start in this order:
#### PVE (10.10.10.120)
| Order | Wait | VMID | Name | Reason |
|-------|------|------|------|--------|
| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
**Configure startup order** (already set):
```bash
# View current config
ssh pve 'qm config 100 | grep -E "startup|onboot"'
# Set startup order (example)
ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
```
#### PVE2 (10.10.10.102)
| Order | Wait | VMID | Name |
|-------|------|------|------|
| **1** | 10s | 300, 301 | All VMs |
**Less critical** - no dependencies between PVE2 VMs.
---
## Resource Allocation Summary
### Total Allocated (PVE)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
| **RAM** | 98 GB | 128 GB | 77% |
**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
### Total Allocated (PVE2)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 18 | 64 | 28% |
| **RAM** | 36 GB | 128 GB | 28% |
**PVE2** has significant headroom for additional VMs.
---
## Adding a New VM
### Quick Template
```bash
# Create VM
ssh pve 'qm create VMID \
--name myvm \
--memory 4096 \
--cores 2 \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-pci \
--scsi0 nvme-mirror1:32 \
--boot order=scsi0 \
--ostype l26 \
--agent enabled=1'
# Attach ISO for installation
ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
# Start VM
ssh pve 'qm start VMID'
# Access console
ssh pve 'qm vncproxy VMID' # Then connect with VNC client
# Or via Proxmox web UI
```
### Cloud-Init Template (Faster)
Use cloud-init for automated VM deployment:
```bash
# Download cloud image
ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
# Create VM
ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
# Import disk
ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
# Attach disk
ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
# Add cloud-init drive
ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
# Set boot disk
ssh pve 'qm set VMID --boot order=scsi0'
# Configure cloud-init (user, SSH key, network)
ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
# Enable QEMU agent
ssh pve 'qm set VMID --agent enabled=1'
# Resize disk (cloud images are small by default)
ssh pve 'qm resize VMID scsi0 +30G'
# Start VM
ssh pve 'qm start VMID'
```
**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
---
## Adding a New LXC Container
```bash
# Download template (if not already downloaded)
ssh pve 'pveam update'
ssh pve 'pveam available | grep ubuntu'
ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
# Create container
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname mycontainer \
--memory 2048 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 \
--unprivileged 1 \
--features nesting=1 \
--start 1'
# Set root password
ssh pve 'pct exec CTID -- passwd'
# Add SSH key
ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
```
---
## GPU Passthrough Configuration
### Current GPU Assignments
| GPU | Location | Passed To | VMID | Purpose |
|-----|----------|-----------|------|---------|
| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
### How to Pass GPU to VM
1. **Identify GPU PCI ID**:
```bash
ssh pve 'lspci | grep -i nvidia'
# Example output:
# 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
# 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
```
2. **Pass GPU to VM** (include both VGA and Audio):
```bash
ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
# If multi-function device (GPU + Audio), use:
ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
```
3. **Configure VM for GPU**:
```bash
# Set machine type to q35
ssh pve 'qm set VMID --machine q35'
# Set BIOS to OVMF (UEFI)
ssh pve 'qm set VMID --bios ovmf'
# Add EFI disk
ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
```
4. **Reboot VM** and install NVIDIA drivers inside the VM
**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
---
## Backup Priority
See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
### Critical VMs (Must Backup)
| Priority | VMID | Name | Reason |
|----------|------|------|--------|
| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
### Medium Priority
| VMID | Name | Notes |
|------|------|-------|
| 200 | pihole | Easy to rebuild, but DNS config valuable |
| 202 | traefik | Config files backed up separately |
### Low Priority (Ephemeral/Rebuildable)
| VMID | Name | Notes |
|------|------|-------|
| 105 | fs-dev | Development - code is in Git |
| 111 | lmdev1 | Ephemeral development |
| 201 | copyparty | Simple app, easy to redeploy |
| 206 | docker-host | Docker Compose files backed up separately |
---
## Quick Reference Commands
```bash
# List all VMs
ssh pve 'qm list'
ssh pve2 'qm list'
# List all containers
ssh pve 'pct list'
# Start/stop VM
ssh pve 'qm start VMID'
ssh pve 'qm stop VMID'
ssh pve 'qm shutdown VMID' # Graceful
# Start/stop container
ssh pve 'pct start CTID'
ssh pve 'pct stop CTID'
ssh pve 'pct shutdown CTID' # Graceful
# VM console
ssh pve 'qm terminal VMID'
# Container console
ssh pve 'pct enter CTID'
# Clone VM
ssh pve 'qm clone VMID NEW_VMID --name newvm'
# Delete VM
ssh pve 'qm destroy VMID'
# Delete container
ssh pve 'pct destroy CTID'
```
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - Storage pool assignments
- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
---
**Last Updated**: 2025-12-22

View File

@@ -0,0 +1 @@
{"web":{"client_id":"693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com","project_id":"spheric-method-482514-f8","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["https://my.home-assistant.io/redirect/oauth"]}}

View File

@@ -0,0 +1,41 @@
#!/bin/bash
# Internet Watchdog - Reboots if internet is unreachable for 5 minutes
LOG_FILE="/var/log/internet-watchdog.log"
FAIL_COUNT=0
MAX_FAILS=5
CHECK_INTERVAL=60
log() {
echo "$(date "+%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
}
check_internet() {
for endpoint in 1.1.1.1 8.8.8.8 208.67.222.222; do
if ping -c 1 -W 5 "$endpoint" > /dev/null 2>&1; then
return 0
fi
done
return 1
}
log "Watchdog started"
while true; do
if check_internet; then
if [ $FAIL_COUNT -gt 0 ]; then
log "Internet restored after $FAIL_COUNT failures"
fi
FAIL_COUNT=0
else
FAIL_COUNT=$((FAIL_COUNT + 1))
log "Internet check failed ($FAIL_COUNT/$MAX_FAILS)"
if [ $FAIL_COUNT -ge $MAX_FAILS ]; then
log "CRITICAL: $MAX_FAILS consecutive failures - REBOOTING"
sync
sleep 2
reboot
fi
fi
sleep $CHECK_INTERVAL
done

View File

@@ -0,0 +1,23 @@
#!/bin/bash
LOG_DIR="/data/logs"
LOG_FILE="$LOG_DIR/memory-history.log"
mkdir -p "$LOG_DIR"
while true; do
# Rotate if over 10MB
if [ -f "$LOG_FILE" ]; then
SIZE=$(wc -c < "$LOG_FILE" 2>/dev/null || echo 0)
if [ "$SIZE" -gt 10485760 ]; then
mv "$LOG_FILE" "$LOG_FILE.old"
fi
fi
echo "========== $(date +%Y-%m-%d\ %H:%M:%S) ==========" >> "$LOG_FILE"
echo "--- MEMORY ---" >> "$LOG_FILE"
free -m >> "$LOG_FILE"
echo "--- TOP MEMORY PROCESSES ---" >> "$LOG_FILE"
ps -eo pid,rss,comm --sort=-rss | head -12 >> "$LOG_FILE"
echo "" >> "$LOG_FILE"
sleep 600
done