Compare commits
3 Commits
65b7c48348
...
eddd98c57f
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
eddd98c57f | ||
|
|
56b82df497 | ||
|
|
23e9df68c9 |
5
.stfolder/syncthing-folder-8be0b5.txt
Normal file
5
.stfolder/syncthing-folder-8be0b5.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
# This directory is a Syncthing folder marker.
|
||||||
|
# Do not delete.
|
||||||
|
|
||||||
|
folderID: homelab
|
||||||
|
created: 2025-12-23T00:39:52-05:00
|
||||||
358
BACKUP-STRATEGY.md
Normal file
358
BACKUP-STRATEGY.md
Normal file
@@ -0,0 +1,358 @@
|
|||||||
|
# Backup Strategy
|
||||||
|
|
||||||
|
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
|
||||||
|
|
||||||
|
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
### What We Have ✅
|
||||||
|
- **Syncthing**: File synchronization across 5+ devices
|
||||||
|
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
|
||||||
|
- **Proxmox**: Built-in backup capabilities (not yet configured)
|
||||||
|
|
||||||
|
### What We DON'T Have 🚨
|
||||||
|
- ❌ No documented VM/CT backups
|
||||||
|
- ❌ No ZFS snapshot schedule
|
||||||
|
- ❌ No offsite backups
|
||||||
|
- ❌ No disaster recovery plan
|
||||||
|
- ❌ No tested restore procedures
|
||||||
|
- ❌ No configuration backups
|
||||||
|
|
||||||
|
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current State Analysis
|
||||||
|
|
||||||
|
### Syncthing (File Synchronization)
|
||||||
|
|
||||||
|
**What it is**: Real-time file sync across devices
|
||||||
|
**What it is NOT**: A backup solution
|
||||||
|
|
||||||
|
| Folder | Devices | Size | Protected? |
|
||||||
|
|--------|---------|------|------------|
|
||||||
|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
|
||||||
|
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
|
||||||
|
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
|
||||||
|
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
|
||||||
|
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
|
||||||
|
|
||||||
|
**Limitations**:
|
||||||
|
- ❌ Accidental deletion → deleted everywhere
|
||||||
|
- ❌ Ransomware/corruption → spreads everywhere
|
||||||
|
- ❌ No point-in-time recovery
|
||||||
|
- ❌ No version history (unless file versioning enabled - not documented)
|
||||||
|
|
||||||
|
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
|
||||||
|
|
||||||
|
### ZFS on TrueNAS (Potential Backup Target)
|
||||||
|
|
||||||
|
**Current Status**: ❓ Unknown - snapshots may or may not be configured
|
||||||
|
|
||||||
|
**Needs Investigation**:
|
||||||
|
```bash
|
||||||
|
# Check if snapshots exist
|
||||||
|
ssh truenas 'zfs list -t snapshot'
|
||||||
|
|
||||||
|
# Check if automated snapshots are configured
|
||||||
|
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
|
||||||
|
|
||||||
|
# Check snapshot schedule via TrueNAS API/UI
|
||||||
|
```
|
||||||
|
|
||||||
|
**If configured**, ZFS snapshots provide:
|
||||||
|
- ✅ Point-in-time recovery
|
||||||
|
- ✅ Protection against accidental deletion
|
||||||
|
- ✅ Fast rollback capability
|
||||||
|
- ⚠️ Still single location (no offsite protection)
|
||||||
|
|
||||||
|
### Proxmox VM/CT Backups
|
||||||
|
|
||||||
|
**Current Status**: ❓ Unknown - no backup jobs documented
|
||||||
|
|
||||||
|
**Needs Investigation**:
|
||||||
|
```bash
|
||||||
|
# Check backup configuration
|
||||||
|
ssh pve 'pvesh get /cluster/backup'
|
||||||
|
|
||||||
|
# Check if any backups exist
|
||||||
|
ssh pve 'ls -lh /var/lib/vz/dump/'
|
||||||
|
ssh pve2 'ls -lh /var/lib/vz/dump/'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical VMs Needing Backup**:
|
||||||
|
| VM/CT | VMID | Priority | Notes |
|
||||||
|
|-------|------|----------|-------|
|
||||||
|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
|
||||||
|
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
|
||||||
|
| homeassistant | 110 | 🟡 HIGH | Home automation config |
|
||||||
|
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
|
||||||
|
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
|
||||||
|
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
|
||||||
|
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
|
||||||
|
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Backup Strategy
|
||||||
|
|
||||||
|
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
|
||||||
|
|
||||||
|
**ZFS Snapshots on TrueNAS**
|
||||||
|
|
||||||
|
Schedule automatic snapshots for all datasets:
|
||||||
|
|
||||||
|
| Dataset | Frequency | Retention |
|
||||||
|
|---------|-----------|-----------|
|
||||||
|
| vault/documents | Every 15 min | 1 hour |
|
||||||
|
| vault/documents | Hourly | 24 hours |
|
||||||
|
| vault/documents | Daily | 30 days |
|
||||||
|
| vault/documents | Weekly | 12 weeks |
|
||||||
|
| vault/documents | Monthly | 12 months |
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```bash
|
||||||
|
# Via TrueNAS UI: Storage → Snapshots → Add
|
||||||
|
# Or via CLI:
|
||||||
|
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proxmox VM Backups**
|
||||||
|
|
||||||
|
Configure weekly backups to local storage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create backup job via Proxmox UI:
|
||||||
|
# Datacenter → Backup → Add
|
||||||
|
# - Schedule: Weekly (Sunday 2 AM)
|
||||||
|
# - Storage: local-zfs or nvme-mirror1
|
||||||
|
# - Mode: Snapshot (fast)
|
||||||
|
# - Retention: 4 backups
|
||||||
|
```
|
||||||
|
|
||||||
|
**Or via CLI**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tier 2: Offsite Backups (CRITICAL GAP)
|
||||||
|
|
||||||
|
**Option A: Cloud Storage (Recommended)**
|
||||||
|
|
||||||
|
Use **rclone** or **restic** to sync critical data to cloud:
|
||||||
|
|
||||||
|
| Provider | Cost | Pros | Cons |
|
||||||
|
|----------|------|------|------|
|
||||||
|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
|
||||||
|
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
|
||||||
|
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
|
||||||
|
|
||||||
|
**Implementation Example (Backblaze B2)**:
|
||||||
|
```bash
|
||||||
|
# Install on TrueNAS
|
||||||
|
ssh truenas 'pkg install rclone restic'
|
||||||
|
|
||||||
|
# Configure B2
|
||||||
|
rclone config # Follow prompts for B2
|
||||||
|
|
||||||
|
# Daily backup critical folders
|
||||||
|
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B: Offsite TrueNAS Replication**
|
||||||
|
|
||||||
|
- Set up second TrueNAS at friend/family member's house
|
||||||
|
- Use ZFS replication to sync snapshots
|
||||||
|
- Requires: Static IP or Tailscale, trust
|
||||||
|
|
||||||
|
**Option C: USB Drive Rotation**
|
||||||
|
|
||||||
|
- Weekly backup to external USB drive
|
||||||
|
- Rotate 2-3 drives (one always offsite)
|
||||||
|
- Manual but simple
|
||||||
|
|
||||||
|
### Tier 3: Configuration Backups
|
||||||
|
|
||||||
|
**Proxmox Configuration**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup /etc/pve (configs are already in cluster filesystem)
|
||||||
|
# But also backup to external location:
|
||||||
|
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
|
||||||
|
|
||||||
|
# Copy to safe location
|
||||||
|
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
|
||||||
|
```
|
||||||
|
|
||||||
|
**VM-Specific Configs**
|
||||||
|
|
||||||
|
- Traefik configs: `/etc/traefik/` on CT 202
|
||||||
|
- Saltbox configs: `/srv/git/saltbox/` on VM 101
|
||||||
|
- Home Assistant: `/config/` on VM 110
|
||||||
|
|
||||||
|
**Script to backup all configs**:
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Save as ~/bin/backup-homelab-configs.sh
|
||||||
|
|
||||||
|
DATE=$(date +%Y%m%d)
|
||||||
|
BACKUP_DIR=~/Backups/homelab-configs/$DATE
|
||||||
|
|
||||||
|
mkdir -p $BACKUP_DIR
|
||||||
|
|
||||||
|
# Proxmox configs
|
||||||
|
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
|
||||||
|
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
|
||||||
|
|
||||||
|
# Traefik
|
||||||
|
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
|
||||||
|
|
||||||
|
# Saltbox
|
||||||
|
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
|
||||||
|
|
||||||
|
# Home Assistant
|
||||||
|
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
|
||||||
|
|
||||||
|
echo "Configs backed up to $BACKUP_DIR"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Disaster Recovery Scenarios
|
||||||
|
|
||||||
|
### Scenario 1: Single VM Failure
|
||||||
|
|
||||||
|
**Impact**: Medium
|
||||||
|
**Recovery Time**: 30-60 minutes
|
||||||
|
|
||||||
|
1. Restore from Proxmox backup:
|
||||||
|
```bash
|
||||||
|
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
|
||||||
|
```
|
||||||
|
2. Start VM and verify
|
||||||
|
3. Update IP if needed
|
||||||
|
|
||||||
|
### Scenario 2: TrueNAS Failure
|
||||||
|
|
||||||
|
**Impact**: CATASTROPHIC (all storage lost)
|
||||||
|
**Recovery Time**: Unknown - NO PLAN
|
||||||
|
|
||||||
|
**Current State**: 🚨 NO RECOVERY PLAN
|
||||||
|
**Needed**:
|
||||||
|
- Offsite backup of critical datasets
|
||||||
|
- Documented ZFS pool creation steps
|
||||||
|
- Share configuration export
|
||||||
|
|
||||||
|
### Scenario 3: Complete PVE Server Failure
|
||||||
|
|
||||||
|
**Impact**: SEVERE
|
||||||
|
**Recovery Time**: 4-8 hours
|
||||||
|
|
||||||
|
**Current State**: ⚠️ PARTIALLY RECOVERABLE
|
||||||
|
**Needed**:
|
||||||
|
- VM backups stored on TrueNAS or PVE2
|
||||||
|
- Proxmox reinstall procedure
|
||||||
|
- Network config documentation
|
||||||
|
|
||||||
|
### Scenario 4: Complete Site Disaster (Fire/Flood)
|
||||||
|
|
||||||
|
**Impact**: TOTAL LOSS
|
||||||
|
**Recovery Time**: Unknown
|
||||||
|
|
||||||
|
**Current State**: 🚨 NO RECOVERY PLAN
|
||||||
|
**Needed**:
|
||||||
|
- Offsite backups (cloud or physical)
|
||||||
|
- Critical data prioritization
|
||||||
|
- Restore procedures
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Plan
|
||||||
|
|
||||||
|
### Immediate (Next 7 Days)
|
||||||
|
|
||||||
|
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
|
||||||
|
```bash
|
||||||
|
ssh truenas 'zfs list -t snapshot'
|
||||||
|
ssh pve 'ls -lh /var/lib/vz/dump/'
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
|
||||||
|
|
||||||
|
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
|
||||||
|
|
||||||
|
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
|
||||||
|
|
||||||
|
### Short-term (Next 30 Days)
|
||||||
|
|
||||||
|
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
|
||||||
|
|
||||||
|
- [ ] **Install backup tools**: rclone or restic on TrueNAS
|
||||||
|
|
||||||
|
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
|
||||||
|
|
||||||
|
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
|
||||||
|
|
||||||
|
### Long-term (Next 90 Days)
|
||||||
|
|
||||||
|
- [ ] **Implement monitoring**: Alerts for backup failures
|
||||||
|
|
||||||
|
- [ ] **Quarterly restore test**: Verify backups actually work
|
||||||
|
|
||||||
|
- [ ] **Backup rotation policy**: Automate old backup cleanup
|
||||||
|
|
||||||
|
- [ ] **Configuration backup automation**: Weekly cron job
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring & Validation
|
||||||
|
|
||||||
|
### Backup Health Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check last ZFS snapshot
|
||||||
|
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
|
||||||
|
|
||||||
|
# Check Proxmox backup status
|
||||||
|
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
|
||||||
|
|
||||||
|
# Check cloud sync status (if using rclone)
|
||||||
|
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Alerts to Set Up
|
||||||
|
|
||||||
|
- Email alert if no snapshot created in 24 hours
|
||||||
|
- Email alert if Proxmox backup fails
|
||||||
|
- Email alert if cloud sync fails
|
||||||
|
- Weekly backup status report
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Estimate
|
||||||
|
|
||||||
|
**Monthly Backup Costs**:
|
||||||
|
|
||||||
|
| Component | Cost | Notes |
|
||||||
|
|-----------|------|-------|
|
||||||
|
| Local storage (already owned) | $0 | Using existing TrueNAS |
|
||||||
|
| Proxmox backups (local) | $0 | Using existing storage |
|
||||||
|
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
|
||||||
|
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
|
||||||
|
|
||||||
|
**One-time**:
|
||||||
|
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
|
||||||
|
- [VMS.md](VMS.md) - VM inventory and prioritization
|
||||||
|
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
|
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED
|
||||||
14
CHANGELOG.md
14
CHANGELOG.md
@@ -36,12 +36,12 @@ Investigated UPS power limit issues across both Proxmox servers.
|
|||||||
[Unit]
|
[Unit]
|
||||||
Description=Disable KSM (Kernel Same-page Merging)
|
Description=Disable KSM (Kernel Same-page Merging)
|
||||||
After=multi-user.target
|
After=multi-user.target
|
||||||
|
|
||||||
[Service]
|
[Service]
|
||||||
Type=oneshot
|
Type=oneshot
|
||||||
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
|
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
|
||||||
RemainAfterExit=yes
|
RemainAfterExit=yes
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
WantedBy=multi-user.target
|
WantedBy=multi-user.target
|
||||||
```
|
```
|
||||||
@@ -108,12 +108,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
|
|||||||
[Unit]
|
[Unit]
|
||||||
Description=Set CPU governor to powersave with balance_power EPP
|
Description=Set CPU governor to powersave with balance_power EPP
|
||||||
After=multi-user.target
|
After=multi-user.target
|
||||||
|
|
||||||
[Service]
|
[Service]
|
||||||
Type=oneshot
|
Type=oneshot
|
||||||
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
|
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
|
||||||
RemainAfterExit=yes
|
RemainAfterExit=yes
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
WantedBy=multi-user.target
|
WantedBy=multi-user.target
|
||||||
```
|
```
|
||||||
@@ -127,12 +127,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
|
|||||||
[Unit]
|
[Unit]
|
||||||
Description=Set CPU governor to schedutil for power savings
|
Description=Set CPU governor to schedutil for power savings
|
||||||
After=multi-user.target
|
After=multi-user.target
|
||||||
|
|
||||||
[Service]
|
[Service]
|
||||||
Type=oneshot
|
Type=oneshot
|
||||||
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
|
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
|
||||||
RemainAfterExit=yes
|
RemainAfterExit=yes
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
WantedBy=multi-user.target
|
WantedBy=multi-user.target
|
||||||
```
|
```
|
||||||
@@ -194,4 +194,4 @@ Not useful when:
|
|||||||
- `general_profit` is negative
|
- `general_profit` is negative
|
||||||
|
|
||||||
### What is Memory Ballooning?
|
### What is Memory Ballooning?
|
||||||
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.
|
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.
|
||||||
|
|||||||
339
GATEWAY.md
Normal file
339
GATEWAY.md
Normal file
@@ -0,0 +1,339 @@
|
|||||||
|
# UniFi Gateway (UCG-Fiber)
|
||||||
|
|
||||||
|
Documentation for the UniFi Cloud Gateway Fiber (10.10.10.1) - the primary network gateway and router.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
| Property | Value |
|
||||||
|
|----------|-------|
|
||||||
|
| **Device** | UniFi Cloud Gateway Fiber (UCG-Fiber) |
|
||||||
|
| **IP Address** | 10.10.10.1 |
|
||||||
|
| **SSH User** | root |
|
||||||
|
| **SSH Auth** | SSH key (`~/.ssh/id_ed25519`) |
|
||||||
|
| **Host Aliases** | `ucg-fiber`, `gateway` |
|
||||||
|
| **Firmware** | v4.4.9 (as of 2026-01-02) |
|
||||||
|
| **UniFi Core** | 4.4.19 |
|
||||||
|
| **RAM** | 2.9 GB (shared with UniFi apps) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SSH Access
|
||||||
|
|
||||||
|
SSH key authentication is configured. Use host aliases:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Quick access
|
||||||
|
ssh ucg-fiber 'hostname'
|
||||||
|
ssh gateway 'free -m'
|
||||||
|
|
||||||
|
# Or use IP directly
|
||||||
|
ssh root@10.10.10.1 'uptime'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: SSH key may need re-deployment after firmware updates if UniFi clears authorized_keys.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Services
|
||||||
|
|
||||||
|
Two custom monitoring services run on the gateway to prevent and diagnose issues.
|
||||||
|
|
||||||
|
### Internet Watchdog Service
|
||||||
|
|
||||||
|
**Purpose**: Auto-reboots gateway if internet connectivity is lost for 5+ minutes
|
||||||
|
|
||||||
|
**Location**: `/data/scripts/internet-watchdog.sh`
|
||||||
|
|
||||||
|
**How it works**:
|
||||||
|
1. Pings 1.1.1.1, 8.8.8.8, 208.67.222.222 every 60 seconds
|
||||||
|
2. If all three fail, increments failure counter
|
||||||
|
3. After 5 consecutive failures (~5 minutes), triggers reboot
|
||||||
|
4. Logs all activity to `/var/log/internet-watchdog.log`
|
||||||
|
|
||||||
|
**Commands**:
|
||||||
|
```bash
|
||||||
|
# Check service status
|
||||||
|
ssh ucg-fiber 'systemctl status internet-watchdog'
|
||||||
|
|
||||||
|
# View recent logs
|
||||||
|
ssh ucg-fiber 'tail -50 /var/log/internet-watchdog.log'
|
||||||
|
|
||||||
|
# Stop temporarily (if troubleshooting)
|
||||||
|
ssh ucg-fiber 'systemctl stop internet-watchdog'
|
||||||
|
|
||||||
|
# Restart
|
||||||
|
ssh ucg-fiber 'systemctl restart internet-watchdog'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Log Format**:
|
||||||
|
```
|
||||||
|
2026-01-02 22:45:01 - Watchdog started
|
||||||
|
2026-01-02 22:46:01 - Internet check failed (1/5)
|
||||||
|
2026-01-02 22:47:01 - Internet restored after 1 failures
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Memory Monitor Service
|
||||||
|
|
||||||
|
**Purpose**: Logs memory usage and top processes every 10 minutes for diagnostics
|
||||||
|
|
||||||
|
**Location**: `/data/scripts/memory-monitor.sh`
|
||||||
|
|
||||||
|
**Log File**: `/data/logs/memory-history.log`
|
||||||
|
|
||||||
|
**How it works**:
|
||||||
|
1. Every 10 minutes, logs current memory usage (`free -m`)
|
||||||
|
2. Logs top 12 memory-consuming processes
|
||||||
|
3. Auto-rotates log when it exceeds 10MB (keeps one .old file)
|
||||||
|
|
||||||
|
**Commands**:
|
||||||
|
```bash
|
||||||
|
# Check service status
|
||||||
|
ssh ucg-fiber 'systemctl status memory-monitor'
|
||||||
|
|
||||||
|
# View recent memory history
|
||||||
|
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
|
||||||
|
|
||||||
|
# Check current memory usage
|
||||||
|
ssh ucg-fiber 'free -m'
|
||||||
|
|
||||||
|
# See top memory consumers right now
|
||||||
|
ssh ucg-fiber 'ps -eo pid,rss,comm --sort=-rss | head -12'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Log Format**:
|
||||||
|
```
|
||||||
|
========== 2026-01-02 22:30:00 ==========
|
||||||
|
--- MEMORY ---
|
||||||
|
total used free shared buff/cache available
|
||||||
|
Mem: 2892 1890 102 456 899 1002
|
||||||
|
Swap: 512 88 424
|
||||||
|
--- TOP MEMORY PROCESSES ---
|
||||||
|
PID RSS COMMAND
|
||||||
|
1234 327456 unifi-protect
|
||||||
|
2345 252108 mongod
|
||||||
|
3456 236544 java
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known Memory Consumers
|
||||||
|
|
||||||
|
| Process | Typical Memory | Purpose |
|
||||||
|
|---------|----------------|---------|
|
||||||
|
| unifi-protect | ~320 MB | Camera/NVR management |
|
||||||
|
| mongod | ~250 MB | UniFi configuration database |
|
||||||
|
| java (controller) | ~230 MB | UniFi Network controller |
|
||||||
|
| postgres | ~180 MB | PostgreSQL database |
|
||||||
|
| unifi-core | ~150 MB | UniFi OS core |
|
||||||
|
| tailscaled | ~80 MB | Tailscale VPN |
|
||||||
|
|
||||||
|
**Total available**: ~2.9 GB
|
||||||
|
**Typical usage**: ~1.8-2.0 GB (leaves ~1 GB free)
|
||||||
|
**Warning threshold**: <500 MB free
|
||||||
|
**Critical**: <200 MB free or swap >50% used
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Disabled Services
|
||||||
|
|
||||||
|
The following services were disabled to reduce memory usage:
|
||||||
|
|
||||||
|
| Service | Memory Saved | Reason Disabled |
|
||||||
|
|---------|--------------|-----------------|
|
||||||
|
| UniFi Connect | ~200 MB | Not needed (cameras use Protect) |
|
||||||
|
|
||||||
|
To re-enable if needed:
|
||||||
|
```bash
|
||||||
|
ssh ucg-fiber 'systemctl enable unifi-connect && systemctl start unifi-connect'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Issues
|
||||||
|
|
||||||
|
### Gateway Freeze / Network Loss
|
||||||
|
|
||||||
|
**Symptoms**:
|
||||||
|
- All devices lose internet
|
||||||
|
- Cannot ping 10.10.10.1
|
||||||
|
- Physical reboot required
|
||||||
|
|
||||||
|
**Root Cause**: Memory exhaustion causing soft lockup
|
||||||
|
|
||||||
|
**Prevention**:
|
||||||
|
1. Internet watchdog auto-reboots after 5 min outage
|
||||||
|
2. Memory monitor logs help identify runaway processes
|
||||||
|
3. UniFi Connect disabled to free ~200 MB
|
||||||
|
|
||||||
|
**Post-Incident Analysis**:
|
||||||
|
```bash
|
||||||
|
# Check memory history for spike before freeze
|
||||||
|
ssh ucg-fiber 'grep -B5 "Swap:" /data/logs/memory-history.log | tail -50'
|
||||||
|
|
||||||
|
# Check watchdog logs
|
||||||
|
ssh ucg-fiber 'cat /var/log/internet-watchdog.log'
|
||||||
|
|
||||||
|
# Check system logs for errors
|
||||||
|
ssh ucg-fiber 'dmesg | tail -100'
|
||||||
|
ssh ucg-fiber 'journalctl -p err --since "1 hour ago"'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### High Memory Usage
|
||||||
|
|
||||||
|
**Check current state**:
|
||||||
|
```bash
|
||||||
|
ssh ucg-fiber 'free -m && echo "---" && ps -eo pid,rss,comm --sort=-rss | head -15'
|
||||||
|
```
|
||||||
|
|
||||||
|
**If swap is heavily used**:
|
||||||
|
```bash
|
||||||
|
# Check swap usage
|
||||||
|
ssh ucg-fiber 'cat /proc/swaps'
|
||||||
|
|
||||||
|
# See what's in swap
|
||||||
|
ssh ucg-fiber 'for pid in $(ls /proc | grep -E "^[0-9]+$"); do
|
||||||
|
swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk "{print \$2}");
|
||||||
|
[ "$swap" -gt 10000 ] 2>/dev/null && echo "$pid: ${swap}kB - $(cat /proc/$pid/comm)";
|
||||||
|
done | sort -t: -k2 -rn | head -10'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Consider reboot if**:
|
||||||
|
- Available memory <200 MB
|
||||||
|
- Swap usage >300 MB
|
||||||
|
- System becoming unresponsive
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Tailscale Issues
|
||||||
|
|
||||||
|
**Check Tailscale status**:
|
||||||
|
```bash
|
||||||
|
ssh ucg-fiber 'tailscale status'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common errors and fixes**:
|
||||||
|
|
||||||
|
| Error | Fix |
|
||||||
|
|-------|-----|
|
||||||
|
| `DNS resolution failed` | Check upstream DNS (Pi-hole at 10.10.10.10) |
|
||||||
|
| `TLS handshake failed` | Usually temporary; Tailscale auto-reconnects |
|
||||||
|
| `Not connected` | `ssh ucg-fiber 'tailscale up'` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Firmware Updates
|
||||||
|
|
||||||
|
**Check current version**:
|
||||||
|
```bash
|
||||||
|
ssh ucg-fiber 'ubnt-systool version'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Update process**:
|
||||||
|
1. Check UniFi site for latest stable firmware
|
||||||
|
2. Download via UI or CLI
|
||||||
|
3. Schedule update during low-usage time
|
||||||
|
|
||||||
|
**After update**:
|
||||||
|
- Verify SSH key still works
|
||||||
|
- Check custom services still running
|
||||||
|
- Verify Tailscale reconnects
|
||||||
|
|
||||||
|
**Re-deploy SSH key if needed**:
|
||||||
|
```bash
|
||||||
|
ssh-copy-id -i ~/.ssh/id_ed25519 root@10.10.10.1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Locations
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `/data/scripts/internet-watchdog.sh` | Watchdog script |
|
||||||
|
| `/data/scripts/memory-monitor.sh` | Memory monitor script |
|
||||||
|
| `/etc/systemd/system/internet-watchdog.service` | Watchdog systemd unit |
|
||||||
|
| `/etc/systemd/system/memory-monitor.service` | Memory monitor systemd unit |
|
||||||
|
| `/var/log/internet-watchdog.log` | Watchdog log |
|
||||||
|
| `/data/logs/memory-history.log` | Memory history log |
|
||||||
|
|
||||||
|
**Note**: `/data/` persists across firmware updates. `/var/log/` may not.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# System status
|
||||||
|
ssh ucg-fiber 'uptime && free -m'
|
||||||
|
|
||||||
|
# Check both monitoring services
|
||||||
|
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
|
||||||
|
|
||||||
|
# Memory history (last hour)
|
||||||
|
ssh ucg-fiber 'tail -60 /data/logs/memory-history.log'
|
||||||
|
|
||||||
|
# Watchdog activity
|
||||||
|
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
|
||||||
|
|
||||||
|
# Network devices (ARP table)
|
||||||
|
ssh ucg-fiber 'cat /proc/net/arp'
|
||||||
|
|
||||||
|
# Tailscale status
|
||||||
|
ssh ucg-fiber 'tailscale status'
|
||||||
|
|
||||||
|
# System logs
|
||||||
|
ssh ucg-fiber 'journalctl -p warning --since "1 hour ago" | head -50'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup Considerations
|
||||||
|
|
||||||
|
Custom services in `/data/scripts/` persist across firmware updates but may need:
|
||||||
|
- Systemd services re-enabled after major updates
|
||||||
|
- Script permissions re-applied if wiped
|
||||||
|
|
||||||
|
**Backup critical files**:
|
||||||
|
```bash
|
||||||
|
# Copy scripts locally for reference
|
||||||
|
scp ucg-fiber:/data/scripts/*.sh ~/Projects/homelab/data/scripts/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH configuration and host aliases
|
||||||
|
- [NETWORK.md](NETWORK.md) - Network architecture
|
||||||
|
- [MONITORING.md](MONITORING.md) - Overall monitoring strategy
|
||||||
|
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant integration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Incident History
|
||||||
|
|
||||||
|
### 2025-12-27 to 2025-12-29: Gateway Freeze
|
||||||
|
|
||||||
|
**Timeline**:
|
||||||
|
- Dec 7: Firmware update to v4.4.9
|
||||||
|
- Dec 24: Last healthy system logs
|
||||||
|
- Dec 27-29: "No internet detected" errors in logs
|
||||||
|
- Dec 29+: Complete silence (gateway frozen)
|
||||||
|
- Jan 2: Physical reboot restored access
|
||||||
|
|
||||||
|
**Root Cause**: Memory exhaustion causing soft lockup (no crash dump saved)
|
||||||
|
|
||||||
|
**Resolution**:
|
||||||
|
- Deployed internet-watchdog service
|
||||||
|
- Deployed memory-monitor service
|
||||||
|
- Disabled UniFi Connect (~200 MB saved)
|
||||||
|
- Configured SSH key auth
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2026-01-02
|
||||||
455
HARDWARE.md
Normal file
455
HARDWARE.md
Normal file
@@ -0,0 +1,455 @@
|
|||||||
|
# Hardware Inventory
|
||||||
|
|
||||||
|
Complete hardware specifications for all homelab equipment.
|
||||||
|
|
||||||
|
## Servers
|
||||||
|
|
||||||
|
### PVE (10.10.10.120) - Primary Proxmox Server
|
||||||
|
|
||||||
|
#### CPU
|
||||||
|
- **Model**: AMD Ryzen Threadripper PRO 3975WX
|
||||||
|
- **Cores**: 32 cores / 64 threads
|
||||||
|
- **Base Clock**: 3.5 GHz
|
||||||
|
- **Boost Clock**: 4.2 GHz
|
||||||
|
- **TDP**: 280W
|
||||||
|
- **Architecture**: Zen 2 (7nm)
|
||||||
|
- **Socket**: sTRX4
|
||||||
|
- **Features**: ECC support, PCIe 4.0
|
||||||
|
|
||||||
|
#### RAM
|
||||||
|
- **Capacity**: 128 GB
|
||||||
|
- **Type**: DDR4 ECC Registered
|
||||||
|
- **Speed**: Unknown (needs investigation)
|
||||||
|
- **Channels**: 8-channel (quad-channel per socket)
|
||||||
|
- **Idle Power**: ~30-40W
|
||||||
|
|
||||||
|
#### Storage
|
||||||
|
|
||||||
|
**OS/VM Storage:**
|
||||||
|
|
||||||
|
| Pool | Devices | Type | Capacity | Purpose |
|
||||||
|
|------|---------|------|----------|---------|
|
||||||
|
| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
|
||||||
|
| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
|
||||||
|
| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
|
||||||
|
|
||||||
|
**Total Storage**: ~9 TB usable
|
||||||
|
|
||||||
|
#### GPUs
|
||||||
|
|
||||||
|
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|
||||||
|
|-------|------|------|-----|---------|-----------|
|
||||||
|
| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
|
||||||
|
| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
|
||||||
|
|
||||||
|
**Total GPU Power**: 75W + 280W = 355W (under load)
|
||||||
|
|
||||||
|
#### Network Cards
|
||||||
|
|
||||||
|
| Interface | Model | Speed | Purpose | Bridge |
|
||||||
|
|-----------|-------|-------|---------|--------|
|
||||||
|
| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
|
||||||
|
| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
|
||||||
|
| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
|
||||||
|
|
||||||
|
**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
|
||||||
|
|
||||||
|
#### Storage Controllers
|
||||||
|
|
||||||
|
| Model | Interface | Purpose |
|
||||||
|
|-------|-----------|---------|
|
||||||
|
| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
|
||||||
|
| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
|
||||||
|
|
||||||
|
#### Motherboard
|
||||||
|
- **Model**: Unknown - needs investigation
|
||||||
|
- **Chipset**: AMD TRX40
|
||||||
|
- **Form Factor**: ATX/EATX
|
||||||
|
- **PCIe Slots**: Multiple PCIe 4.0 slots
|
||||||
|
- **Features**: IOMMU support, ECC memory
|
||||||
|
|
||||||
|
#### Power Supply
|
||||||
|
- **Model**: Unknown
|
||||||
|
- **Wattage**: Likely 1000W+ (needs investigation)
|
||||||
|
- **Type**: ATX, 80+ certification unknown
|
||||||
|
|
||||||
|
#### Cooling
|
||||||
|
- **CPU Cooler**: Unknown - likely large tower or AIO
|
||||||
|
- **Case Fans**: Unknown quantity
|
||||||
|
- **Note**: CPU temps 70-80°C under load (healthy)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### PVE2 (10.10.10.102) - Secondary Proxmox Server
|
||||||
|
|
||||||
|
#### CPU
|
||||||
|
- **Model**: AMD Ryzen Threadripper PRO 3975WX
|
||||||
|
- **Specs**: Same as PVE (32C/64T, 280W TDP)
|
||||||
|
|
||||||
|
#### RAM
|
||||||
|
- **Capacity**: 128 GB DDR4 ECC
|
||||||
|
- **Same specs as PVE**
|
||||||
|
|
||||||
|
#### Storage
|
||||||
|
|
||||||
|
| Pool | Devices | Type | Capacity | Purpose |
|
||||||
|
|------|---------|------|----------|---------|
|
||||||
|
| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
|
||||||
|
| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
|
||||||
|
|
||||||
|
**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
|
||||||
|
|
||||||
|
#### GPUs
|
||||||
|
|
||||||
|
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|
||||||
|
|-------|------|------|-----|---------|-----------|
|
||||||
|
| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
|
||||||
|
|
||||||
|
#### Network Cards
|
||||||
|
|
||||||
|
| Interface | Model | Speed | Purpose |
|
||||||
|
|-----------|-------|-------|---------|
|
||||||
|
| nic1 | Unknown (onboard) | 1 Gb | Management |
|
||||||
|
|
||||||
|
**Note**: MTU set to 9000 for jumbo frames
|
||||||
|
|
||||||
|
#### Motherboard
|
||||||
|
- **Model**: Unknown
|
||||||
|
- **Chipset**: AMD TRX40
|
||||||
|
- **Similar to PVE**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Network Equipment
|
||||||
|
|
||||||
|
### UniFi Dream Machine Pro (UCG-Fiber)
|
||||||
|
|
||||||
|
- **Model**: UniFi Cloud Gateway Fiber
|
||||||
|
- **IP**: 10.10.10.1
|
||||||
|
- **Ports**: Multiple 1Gb + SFP+ uplink
|
||||||
|
- **Features**: Router, firewall, VPN, IDS/IPS
|
||||||
|
- **MTU**: 9216 (supports jumbo frames)
|
||||||
|
- **Tailscale**: Installed for VPN failover
|
||||||
|
|
||||||
|
### Switches
|
||||||
|
|
||||||
|
**Details needed** - investigate current switch setup:
|
||||||
|
- 10Gb switch for high-speed connections?
|
||||||
|
- 1Gb switch for general devices?
|
||||||
|
- PoE capabilities?
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check what's connected to 10Gb interfaces
|
||||||
|
ssh pve 'ip link show enp35s0f0'
|
||||||
|
ssh pve 'ip link show enp35s0f1'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Hardware
|
||||||
|
|
||||||
|
### EMC Storage Enclosure
|
||||||
|
|
||||||
|
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
|
||||||
|
|
||||||
|
- **Model**: EMC KTN-STL4 (or similar)
|
||||||
|
- **Form Factor**: 4U rackmount
|
||||||
|
- **Drive Bays**: 25x 3.5" SAS/SATA
|
||||||
|
- **Controllers**: Dual LCC (Link Control Cards)
|
||||||
|
- **Connection**: SAS via LSI SAS2308 HBA
|
||||||
|
- **Passed to**: TrueNAS VM (VMID 100)
|
||||||
|
|
||||||
|
**Current Status**:
|
||||||
|
- LCC A: Active (working)
|
||||||
|
- LCC B: Failed (replacement ordered)
|
||||||
|
|
||||||
|
**Drive Inventory**: Unknown - needs audit
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get drive list from TrueNAS
|
||||||
|
ssh truenas 'smartctl --scan'
|
||||||
|
ssh truenas 'lsblk'
|
||||||
|
```
|
||||||
|
|
||||||
|
### NVMe Drives
|
||||||
|
|
||||||
|
| Model | Quantity | Capacity | Location | Pool |
|
||||||
|
|-------|----------|----------|----------|------|
|
||||||
|
| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
|
||||||
|
| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
|
||||||
|
| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
|
||||||
|
| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
|
||||||
|
|
||||||
|
### SSDs
|
||||||
|
|
||||||
|
| Model | Quantity | Capacity | Location | Pool |
|
||||||
|
|-------|----------|----------|----------|------|
|
||||||
|
| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
|
||||||
|
|
||||||
|
### HDDs
|
||||||
|
|
||||||
|
| Model | Quantity | Capacity | Location | Pool |
|
||||||
|
|-------|----------|----------|----------|------|
|
||||||
|
| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
|
||||||
|
| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## UPS
|
||||||
|
|
||||||
|
### Current UPS
|
||||||
|
|
||||||
|
| Specification | Value |
|
||||||
|
|---------------|-------|
|
||||||
|
| **Model** | CyberPower OR2200PFCRT2U |
|
||||||
|
| **Capacity** | 2200VA / 1320W |
|
||||||
|
| **Form Factor** | 2U rackmount |
|
||||||
|
| **Input** | NEMA 5-15P (rewired from 5-20P) |
|
||||||
|
| **Outlets** | 2x 5-20R + 6x 5-15R |
|
||||||
|
| **Output** | PFC Sinewave |
|
||||||
|
| **Runtime** | ~15-20 min @ 33% load |
|
||||||
|
| **Interface** | USB (connected to PVE) |
|
||||||
|
|
||||||
|
**See [UPS.md](UPS.md) for configuration details**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Client Devices
|
||||||
|
|
||||||
|
### Mac Mini (Hutson's Workstation)
|
||||||
|
|
||||||
|
- **Model**: Unknown generation
|
||||||
|
- **CPU**: Unknown
|
||||||
|
- **RAM**: Unknown
|
||||||
|
- **Storage**: Unknown
|
||||||
|
- **Network**: 1Gb Ethernet (en0) - MTU 9000
|
||||||
|
- **Tailscale IP**: 100.108.89.58
|
||||||
|
- **Local IP**: 10.10.10.125 (static)
|
||||||
|
- **Purpose**: Primary workstation, Happy Coder daemon host
|
||||||
|
|
||||||
|
### MacBook (Mobile)
|
||||||
|
|
||||||
|
- **Model**: Unknown
|
||||||
|
- **Network**: Wi-Fi + Ethernet adapter
|
||||||
|
- **Tailscale IP**: Unknown
|
||||||
|
- **Purpose**: Mobile work, development
|
||||||
|
|
||||||
|
### Windows PC
|
||||||
|
|
||||||
|
- **Model**: Unknown
|
||||||
|
- **CPU**: Unknown
|
||||||
|
- **Network**: 1Gb Ethernet
|
||||||
|
- **IP**: 10.10.10.150
|
||||||
|
- **Purpose**: Gaming, Windows development, Syncthing node
|
||||||
|
|
||||||
|
### Phone (Android)
|
||||||
|
|
||||||
|
- **Model**: Unknown
|
||||||
|
- **IP**: 10.10.10.54 (when on Wi-Fi)
|
||||||
|
- **Purpose**: Syncthing mobile node, Happy Coder client
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Rack Layout (If Applicable)
|
||||||
|
|
||||||
|
**Needs documentation** - Current rack configuration unknown
|
||||||
|
|
||||||
|
Suggested format:
|
||||||
|
```
|
||||||
|
U42: Blank panel
|
||||||
|
U41: UPS (CyberPower 2U)
|
||||||
|
U40: UPS (CyberPower 2U)
|
||||||
|
U39: Switch (10Gb)
|
||||||
|
U38-U35: EMC Storage Enclosure (4U)
|
||||||
|
U34: PVE Server
|
||||||
|
U33: PVE2 Server
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Power Consumption
|
||||||
|
|
||||||
|
### Measured Power Draw
|
||||||
|
|
||||||
|
| Component | Idle | Typical | Peak | Notes |
|
||||||
|
|-----------|------|---------|------|-------|
|
||||||
|
| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
|
||||||
|
| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
|
||||||
|
| Network Gear | ~50W | ~50W | ~50W | Router + switches |
|
||||||
|
| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
|
||||||
|
|
||||||
|
**UPS Capacity**: 1320W
|
||||||
|
**Typical Load**: 33-50% (safe margin)
|
||||||
|
**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
|
||||||
|
|
||||||
|
### Power Optimizations Applied
|
||||||
|
|
||||||
|
**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
|
||||||
|
|
||||||
|
- KSMD disabled: ~60-80W saved
|
||||||
|
- CPU governors: ~60-120W saved
|
||||||
|
- Syncthing rescans: ~60-80W saved
|
||||||
|
- HDD spindown: ~10-16W saved when idle
|
||||||
|
- **Total savings**: ~150-300W
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Thermal Management
|
||||||
|
|
||||||
|
### CPU Cooling
|
||||||
|
|
||||||
|
**PVE & PVE2**:
|
||||||
|
- CPU cooler: Unknown model
|
||||||
|
- Thermal paste: Unknown, likely needs refresh if temps >85°C
|
||||||
|
- Target temp: 70-80°C under load
|
||||||
|
- Max safe: 90°C Tctl (Threadripper PRO spec)
|
||||||
|
|
||||||
|
### GPU Cooling
|
||||||
|
|
||||||
|
All GPUs are passively managed (stock coolers):
|
||||||
|
- TITAN RTX: 2-3W idle, 280W load
|
||||||
|
- RTX A6000: 11W idle, 300W load
|
||||||
|
- Quadro P2000: 25W constant (Plex active)
|
||||||
|
|
||||||
|
### Case Airflow
|
||||||
|
|
||||||
|
**Unknown** - needs investigation:
|
||||||
|
- Case model?
|
||||||
|
- Fan configuration?
|
||||||
|
- Positive or negative pressure?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cable Management
|
||||||
|
|
||||||
|
### Network Cables
|
||||||
|
|
||||||
|
| Connection | Type | Length | Speed |
|
||||||
|
|------------|------|--------|-------|
|
||||||
|
| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
|
||||||
|
| PVE2 → Router | Cat6 | Unknown | 1Gb |
|
||||||
|
| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
|
||||||
|
| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
|
||||||
|
|
||||||
|
### Power Cables
|
||||||
|
|
||||||
|
**Critical**: All servers on UPS battery-backed outlets
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Maintenance Schedule
|
||||||
|
|
||||||
|
### Annual Maintenance
|
||||||
|
|
||||||
|
- [ ] Clean dust from servers (every 6-12 months)
|
||||||
|
- [ ] Check thermal paste on CPUs (every 2-3 years)
|
||||||
|
- [ ] Test UPS battery runtime (annually)
|
||||||
|
- [ ] Verify all fans operational
|
||||||
|
- [ ] Check for bulging capacitors on PSUs
|
||||||
|
|
||||||
|
### Drive Health
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check SMART status on all drives
|
||||||
|
ssh pve 'smartctl -a /dev/nvme0'
|
||||||
|
ssh pve2 'smartctl -a /dev/sda'
|
||||||
|
ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Temperature Monitoring
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check all temps (needs lm-sensors installed)
|
||||||
|
ssh pve 'sensors'
|
||||||
|
ssh pve2 'sensors'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Warranty & Purchase Info
|
||||||
|
|
||||||
|
**Needs documentation**:
|
||||||
|
- When were servers purchased?
|
||||||
|
- Where were components bought?
|
||||||
|
- Any warranties still active?
|
||||||
|
- Replacement part sources?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Upgrade Path
|
||||||
|
|
||||||
|
### Short-term Upgrades (< 6 months)
|
||||||
|
|
||||||
|
- [ ] 20A circuit for UPS (restore original 5-20P plug)
|
||||||
|
- [ ] Document missing hardware specs
|
||||||
|
- [ ] Label all cables
|
||||||
|
- [ ] Create rack diagram
|
||||||
|
|
||||||
|
### Medium-term Upgrades (6-12 months)
|
||||||
|
|
||||||
|
- [ ] Additional 10Gb NIC for PVE2?
|
||||||
|
- [ ] More NVMe storage?
|
||||||
|
- [ ] Upgrade network switches?
|
||||||
|
- [ ] Replace EMC enclosure with newer model?
|
||||||
|
|
||||||
|
### Long-term Upgrades (1-2 years)
|
||||||
|
|
||||||
|
- [ ] CPU upgrade to newer Threadripper?
|
||||||
|
- [ ] RAM expansion to 256GB?
|
||||||
|
- [ ] Additional GPU for AI workloads?
|
||||||
|
- [ ] Migrate to PCIe 5.0 storage?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Investigation Needed
|
||||||
|
|
||||||
|
High-priority items to document:
|
||||||
|
|
||||||
|
- [ ] Get exact motherboard model (both servers)
|
||||||
|
- [ ] Get PSU model and wattage
|
||||||
|
- [ ] CPU cooler models
|
||||||
|
- [ ] Network switch models and configuration
|
||||||
|
- [ ] Complete drive inventory in EMC enclosure
|
||||||
|
- [ ] RAM speed and timings
|
||||||
|
- [ ] Case models
|
||||||
|
- [ ] Exact NVMe models for all drives
|
||||||
|
|
||||||
|
**Commands to gather info**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Motherboard
|
||||||
|
ssh pve 'dmidecode -t baseboard'
|
||||||
|
|
||||||
|
# CPU details
|
||||||
|
ssh pve 'lscpu'
|
||||||
|
|
||||||
|
# RAM details
|
||||||
|
ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
|
||||||
|
|
||||||
|
# Storage devices
|
||||||
|
ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
|
||||||
|
|
||||||
|
# Network cards
|
||||||
|
ssh pve 'lspci | grep -i network'
|
||||||
|
|
||||||
|
# GPU details
|
||||||
|
ssh pve 'lspci | grep -i vga'
|
||||||
|
ssh pve 'nvidia-smi -L' # If nvidia-smi available
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [VMS.md](VMS.md) - VM resource allocation
|
||||||
|
- [STORAGE.md](STORAGE.md) - Storage pools and usage
|
||||||
|
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
|
||||||
|
- [UPS.md](UPS.md) - UPS configuration
|
||||||
|
- [NETWORK.md](NETWORK.md) - Network configuration
|
||||||
|
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
|
**Status**: ⚠️ Incomplete - many specs need investigation
|
||||||
@@ -131,6 +131,42 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
|
|||||||
- **Philips Hue** - Lights
|
- **Philips Hue** - Lights
|
||||||
- **Sonos** - Speakers
|
- **Sonos** - Speakers
|
||||||
- **Motion Sensors** - Various locations
|
- **Motion Sensors** - Various locations
|
||||||
|
- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
|
||||||
|
|
||||||
|
### NUT / UPS Integration
|
||||||
|
|
||||||
|
Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
|
||||||
|
|
||||||
|
**Connection:**
|
||||||
|
- Host: 10.10.10.120
|
||||||
|
- Port: 3493
|
||||||
|
- Username: upsmon
|
||||||
|
- Password: upsmon123
|
||||||
|
|
||||||
|
**Entities:**
|
||||||
|
| Entity ID | Description |
|
||||||
|
|-----------|-------------|
|
||||||
|
| `sensor.cyberpower_battery_charge` | Battery percentage |
|
||||||
|
| `sensor.cyberpower_load` | Current load % |
|
||||||
|
| `sensor.cyberpower_input_voltage` | Input voltage |
|
||||||
|
| `sensor.cyberpower_output_voltage` | Output voltage |
|
||||||
|
| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
|
||||||
|
| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
|
||||||
|
|
||||||
|
**Dashboard Card Example:**
|
||||||
|
```yaml
|
||||||
|
type: entities
|
||||||
|
title: UPS Status
|
||||||
|
entities:
|
||||||
|
- entity: sensor.cyberpower_status
|
||||||
|
name: Status
|
||||||
|
- entity: sensor.cyberpower_battery_charge
|
||||||
|
name: Battery
|
||||||
|
- entity: sensor.cyberpower_load
|
||||||
|
name: Load
|
||||||
|
- entity: sensor.cyberpower_input_voltage
|
||||||
|
name: Input Voltage
|
||||||
|
```
|
||||||
|
|
||||||
## Automations
|
## Automations
|
||||||
|
|
||||||
|
|||||||
@@ -45,6 +45,7 @@ This document tracks all IP addresses in the homelab infrastructure.
|
|||||||
|------|------|------------|---------|--------|
|
|------|------|------------|---------|--------|
|
||||||
| 300 | gitea-vm | 10.10.10.220 | Git server | Running |
|
| 300 | gitea-vm | 10.10.10.220 | Git server | Running |
|
||||||
| 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running |
|
| 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running |
|
||||||
|
| 302 | docker-host2 | 10.10.10.207 | Docker services (n8n, future apps) | Running |
|
||||||
|
|
||||||
## Workstations & Personal Devices
|
## Workstations & Personal Devices
|
||||||
|
|
||||||
@@ -69,6 +70,9 @@ This document tracks all IP addresses in the homelab infrastructure.
|
|||||||
| CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary |
|
| CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary |
|
||||||
| LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary |
|
| LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary |
|
||||||
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary |
|
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary |
|
||||||
|
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 | Traefik-Primary |
|
||||||
|
| n8n | n8n.htsn.io | 10.10.10.207:5678 | Traefik-Primary |
|
||||||
|
| Crafty Controller | mc.htsn.io | 10.10.10.207:8443 | Traefik-Primary |
|
||||||
| Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox |
|
| Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox |
|
||||||
| Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox |
|
| Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox |
|
||||||
| Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox |
|
| Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox |
|
||||||
@@ -92,6 +96,7 @@ This document tracks all IP addresses in the homelab infrastructure.
|
|||||||
- .200 - TrueNAS
|
- .200 - TrueNAS
|
||||||
- .201 - CopyParty
|
- .201 - CopyParty
|
||||||
- .206 - Docker-host
|
- .206 - Docker-host
|
||||||
|
- .207 - Docker-host2
|
||||||
- .220 - Gitea
|
- .220 - Gitea
|
||||||
- .221 - Trading VM
|
- .221 - Trading VM
|
||||||
- .250 - Traefik-Primary
|
- .250 - Traefik-Primary
|
||||||
@@ -110,7 +115,7 @@ This document tracks all IP addresses in the homelab infrastructure.
|
|||||||
- 10.10.10.148 - 10.10.10.149 (2 IPs)
|
- 10.10.10.148 - 10.10.10.149 (2 IPs)
|
||||||
- 10.10.10.151 - 10.10.10.199 (49 IPs)
|
- 10.10.10.151 - 10.10.10.199 (49 IPs)
|
||||||
- 10.10.10.202 - 10.10.10.205 (4 IPs)
|
- 10.10.10.202 - 10.10.10.205 (4 IPs)
|
||||||
- 10.10.10.207 - 10.10.10.219 (13 IPs)
|
- 10.10.10.208 - 10.10.10.219 (12 IPs)
|
||||||
- 10.10.10.222 - 10.10.10.249 (28 IPs)
|
- 10.10.10.222 - 10.10.10.249 (28 IPs)
|
||||||
- 10.10.10.251 - 10.10.10.254 (4 IPs)
|
- 10.10.10.251 - 10.10.10.254 (4 IPs)
|
||||||
|
|
||||||
@@ -123,6 +128,18 @@ This document tracks all IP addresses in the homelab infrastructure.
|
|||||||
| Portainer Agent | 9001 | Remote management from other Portainer |
|
| Portainer Agent | 9001 | Remote management from other Portainer |
|
||||||
| Gotenberg | 3000 | PDF generation API |
|
| Gotenberg | 3000 | PDF generation API |
|
||||||
|
|
||||||
|
## Docker Host 2 Services (10.10.10.207) - PVE2
|
||||||
|
|
||||||
|
| Service | Port | Purpose |
|
||||||
|
|---------|------|---------|
|
||||||
|
| MetaMCP | 12008 | MCP Aggregator/Gateway (metamcp.htsn.io) |
|
||||||
|
| n8n | 5678 | Workflow automation |
|
||||||
|
| Crafty Controller | 8443 | Minecraft server management (mc.htsn.io) |
|
||||||
|
| Minecraft Java | 25565 | Minecraft Java Edition server |
|
||||||
|
| Minecraft Bedrock | 19132/udp | Minecraft Bedrock Edition (Geyser) |
|
||||||
|
| Trading Redis | 6379 | Redis for trading platform |
|
||||||
|
| Trading TimescaleDB | 5433 | TimescaleDB for trading platform |
|
||||||
|
|
||||||
## Syncthing API Endpoints
|
## Syncthing API Endpoints
|
||||||
|
|
||||||
| Device | IP | Port | API Key |
|
| Device | IP | Port | API Key |
|
||||||
|
|||||||
618
MAINTENANCE.md
Normal file
618
MAINTENANCE.md
Normal file
@@ -0,0 +1,618 @@
|
|||||||
|
# Maintenance Procedures and Schedules
|
||||||
|
|
||||||
|
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
| Frequency | Tasks | Estimated Time |
|
||||||
|
|-----------|-------|----------------|
|
||||||
|
| **Daily** | Quick health check | 2-5 min |
|
||||||
|
| **Weekly** | Service status, logs review | 15-30 min |
|
||||||
|
| **Monthly** | Updates, backups verification | 1-2 hours |
|
||||||
|
| **Quarterly** | Full system audit, testing | 2-4 hours |
|
||||||
|
| **Annual** | Hardware maintenance, planning | 4-8 hours |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Daily Maintenance (Automated)
|
||||||
|
|
||||||
|
### Quick Health Check Script
|
||||||
|
|
||||||
|
Save as `~/bin/homelab-health-check.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Daily homelab health check
|
||||||
|
|
||||||
|
echo "=== Homelab Health Check ==="
|
||||||
|
echo "Date: $(date)"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Server Status ==="
|
||||||
|
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
|
||||||
|
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== CPU Temperatures ==="
|
||||||
|
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== UPS Status ==="
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== ZFS Pools ==="
|
||||||
|
ssh pve 'zpool status -x' 2>/dev/null
|
||||||
|
ssh pve2 'zpool status -x' 2>/dev/null
|
||||||
|
ssh truenas 'zpool status -x vault'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Disk Space ==="
|
||||||
|
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
|
||||||
|
ssh truenas 'df -h /mnt/vault'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== VM Status ==="
|
||||||
|
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
|
||||||
|
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Syncthing Connections ==="
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||||
|
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||||
|
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Check Complete ==="
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run daily via cron**:
|
||||||
|
```bash
|
||||||
|
# Add to crontab
|
||||||
|
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Weekly Maintenance
|
||||||
|
|
||||||
|
### Service Status Review
|
||||||
|
|
||||||
|
**Check all critical services**:
|
||||||
|
```bash
|
||||||
|
# Proxmox services
|
||||||
|
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
|
||||||
|
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
|
||||||
|
|
||||||
|
# NUT (UPS monitoring)
|
||||||
|
ssh pve 'systemctl status nut-server nut-monitor'
|
||||||
|
ssh pve2 'systemctl status nut-monitor'
|
||||||
|
|
||||||
|
# Container services
|
||||||
|
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
|
||||||
|
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
|
||||||
|
|
||||||
|
# VM services (via QEMU agent)
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Review
|
||||||
|
|
||||||
|
**Check for errors in critical logs**:
|
||||||
|
```bash
|
||||||
|
# Proxmox system logs
|
||||||
|
ssh pve 'journalctl -p err -b | tail -50'
|
||||||
|
ssh pve2 'journalctl -p err -b | tail -50'
|
||||||
|
|
||||||
|
# VM logs (if QEMU agent available)
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
|
||||||
|
|
||||||
|
# Traefik access logs
|
||||||
|
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Syncthing Sync Status
|
||||||
|
|
||||||
|
**Check for sync errors**:
|
||||||
|
```bash
|
||||||
|
# Check all folder errors
|
||||||
|
for folder in documents downloads desktop movies pictures notes config; do
|
||||||
|
echo "=== $folder ==="
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: [SYNCTHING.md](SYNCTHING.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monthly Maintenance
|
||||||
|
|
||||||
|
### System Updates
|
||||||
|
|
||||||
|
#### Proxmox Updates
|
||||||
|
|
||||||
|
**Check for updates**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'apt update && apt list --upgradable'
|
||||||
|
ssh pve2 'apt update && apt list --upgradable'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Apply updates**:
|
||||||
|
```bash
|
||||||
|
# PVE
|
||||||
|
ssh pve 'apt update && apt dist-upgrade -y'
|
||||||
|
|
||||||
|
# PVE2
|
||||||
|
ssh pve2 'apt update && apt dist-upgrade -y'
|
||||||
|
|
||||||
|
# Reboot if kernel updated
|
||||||
|
ssh pve 'reboot'
|
||||||
|
ssh pve2 'reboot'
|
||||||
|
```
|
||||||
|
|
||||||
|
**⚠️ Important**:
|
||||||
|
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
|
||||||
|
- Test on PVE2 first if possible
|
||||||
|
- Ensure all VMs are backed up before updating
|
||||||
|
- Monitor VMs after reboot - some may need manual restart
|
||||||
|
|
||||||
|
#### Container Updates (LXC)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Update all containers
|
||||||
|
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### VM Updates
|
||||||
|
|
||||||
|
**Update VMs individually via SSH**:
|
||||||
|
```bash
|
||||||
|
# Ubuntu/Debian VMs
|
||||||
|
ssh truenas 'apt update && apt upgrade -y'
|
||||||
|
ssh docker-host 'apt update && apt upgrade -y'
|
||||||
|
ssh fs-dev 'apt update && apt upgrade -y'
|
||||||
|
|
||||||
|
# Check if reboot required
|
||||||
|
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### ZFS Scrubs
|
||||||
|
|
||||||
|
**Schedule**: Run monthly on all pools
|
||||||
|
|
||||||
|
**PVE**:
|
||||||
|
```bash
|
||||||
|
# Start scrub on all pools
|
||||||
|
ssh pve 'zpool scrub nvme-mirror1'
|
||||||
|
ssh pve 'zpool scrub nvme-mirror2'
|
||||||
|
ssh pve 'zpool scrub rpool'
|
||||||
|
|
||||||
|
# Check scrub status
|
||||||
|
ssh pve 'zpool status | grep -A2 scrub'
|
||||||
|
```
|
||||||
|
|
||||||
|
**PVE2**:
|
||||||
|
```bash
|
||||||
|
ssh pve2 'zpool scrub nvme-mirror3'
|
||||||
|
ssh pve2 'zpool scrub local-zfs2'
|
||||||
|
ssh pve2 'zpool status | grep -A2 scrub'
|
||||||
|
```
|
||||||
|
|
||||||
|
**TrueNAS**:
|
||||||
|
```bash
|
||||||
|
# Scrub via TrueNAS web UI or SSH
|
||||||
|
ssh truenas 'zpool scrub vault'
|
||||||
|
ssh truenas 'zpool status vault | grep -A2 scrub'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Automate scrubs**:
|
||||||
|
```bash
|
||||||
|
# Add to crontab (run on 1st of month at 2 AM)
|
||||||
|
0 2 1 * * /sbin/zpool scrub nvme-mirror1
|
||||||
|
0 2 1 * * /sbin/zpool scrub nvme-mirror2
|
||||||
|
0 2 1 * * /sbin/zpool scrub rpool
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: [STORAGE.md](STORAGE.md) for pool details
|
||||||
|
|
||||||
|
### SMART Tests
|
||||||
|
|
||||||
|
**Run extended SMART tests monthly**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# TrueNAS drives (via QEMU agent)
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
|
||||||
|
|
||||||
|
# Check results after 4-8 hours
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
|
||||||
|
|
||||||
|
# PVE drives
|
||||||
|
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
|
||||||
|
|
||||||
|
# PVE2 drives
|
||||||
|
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Automate SMART tests**:
|
||||||
|
```bash
|
||||||
|
# Add to crontab (run on 15th of month at 3 AM)
|
||||||
|
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
|
||||||
|
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
|
||||||
|
```
|
||||||
|
|
||||||
|
### Certificate Renewal Verification
|
||||||
|
|
||||||
|
**Check SSL certificate expiry**:
|
||||||
|
```bash
|
||||||
|
# Check Traefik certificates
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
|
||||||
|
|
||||||
|
# Check specific service
|
||||||
|
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
|
||||||
|
```
|
||||||
|
|
||||||
|
**Certificates should auto-renew 30 days before expiry via Traefik**
|
||||||
|
|
||||||
|
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
|
||||||
|
|
||||||
|
### Backup Verification
|
||||||
|
|
||||||
|
**⚠️ TODO**: No backup strategy currently in place
|
||||||
|
|
||||||
|
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quarterly Maintenance
|
||||||
|
|
||||||
|
### Full System Audit
|
||||||
|
|
||||||
|
**Check all systems comprehensively**:
|
||||||
|
|
||||||
|
1. **ZFS Pool Health**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'zpool status -v'
|
||||||
|
ssh pve2 'zpool status -v'
|
||||||
|
ssh truenas 'zpool status -v vault'
|
||||||
|
```
|
||||||
|
Look for: errors, degraded vdevs, resilver operations
|
||||||
|
|
||||||
|
2. **SMART Health**:
|
||||||
|
```bash
|
||||||
|
# Run SMART health check script
|
||||||
|
~/bin/smart-health-check.sh
|
||||||
|
```
|
||||||
|
Look for: reallocated sectors, pending sectors, failures
|
||||||
|
|
||||||
|
3. **Disk Space Trends**:
|
||||||
|
```bash
|
||||||
|
# Check growth rate
|
||||||
|
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
|
||||||
|
ssh truenas 'df -h /mnt/vault'
|
||||||
|
```
|
||||||
|
Plan for expansion if >80% full
|
||||||
|
|
||||||
|
4. **VM Resource Usage**:
|
||||||
|
```bash
|
||||||
|
# Check if VMs need more/less resources
|
||||||
|
ssh pve 'qm list'
|
||||||
|
ssh pve 'pvesh get /nodes/pve/status'
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Network Performance**:
|
||||||
|
```bash
|
||||||
|
# Test bandwidth between critical nodes
|
||||||
|
iperf3 -s # On one host
|
||||||
|
iperf3 -c 10.10.10.120 # From another
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Temperature Monitoring**:
|
||||||
|
```bash
|
||||||
|
# Check max temps over past quarter
|
||||||
|
# TODO: Set up Prometheus/Grafana for historical data
|
||||||
|
ssh pve 'sensors'
|
||||||
|
ssh pve2 'sensors'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Dependency Testing
|
||||||
|
|
||||||
|
**Test critical paths**:
|
||||||
|
|
||||||
|
1. **Power failure recovery** (if safe to test):
|
||||||
|
- See [UPS.md](UPS.md) for full procedure
|
||||||
|
- Verify VM startup order works
|
||||||
|
- Confirm all services come back online
|
||||||
|
|
||||||
|
2. **Failover testing**:
|
||||||
|
- Tailscale subnet routing (PVE → UCG-Fiber)
|
||||||
|
- NUT monitoring (PVE server → PVE2 client)
|
||||||
|
|
||||||
|
3. **Backup restoration** (when backups implemented):
|
||||||
|
- Test restoring a VM from backup
|
||||||
|
- Test restoring files from Syncthing versioning
|
||||||
|
|
||||||
|
### Documentation Review
|
||||||
|
|
||||||
|
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
|
||||||
|
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
|
||||||
|
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
|
||||||
|
- [ ] Update any changed procedures in this document
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Annual Maintenance
|
||||||
|
|
||||||
|
### Hardware Maintenance
|
||||||
|
|
||||||
|
**Physical cleaning**:
|
||||||
|
```bash
|
||||||
|
# Shut down servers (coordinate with users)
|
||||||
|
ssh pve 'shutdown -h now'
|
||||||
|
ssh pve2 'shutdown -h now'
|
||||||
|
|
||||||
|
# Clean dust from:
|
||||||
|
# - CPU heatsinks
|
||||||
|
# - GPU fans
|
||||||
|
# - Case fans
|
||||||
|
# - PSU vents
|
||||||
|
# - Storage enclosure fans
|
||||||
|
|
||||||
|
# Check for:
|
||||||
|
# - Bulging capacitors on PSU/motherboard
|
||||||
|
# - Loose cables
|
||||||
|
# - Fan noise/vibration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Thermal paste inspection** (every 2-3 years):
|
||||||
|
- Check CPU temps vs baseline
|
||||||
|
- If temps >85°C under load, consider reapplying paste
|
||||||
|
- Threadripper PRO: Tctl max safe = 90°C
|
||||||
|
|
||||||
|
**See**: [HARDWARE.md](HARDWARE.md) for component details
|
||||||
|
|
||||||
|
### UPS Battery Test
|
||||||
|
|
||||||
|
**Runtime test**:
|
||||||
|
```bash
|
||||||
|
# Check battery health
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep battery'
|
||||||
|
|
||||||
|
# Perform runtime test (coordinate power loss)
|
||||||
|
# 1. Note current runtime estimate
|
||||||
|
# 2. Unplug UPS from wall
|
||||||
|
# 3. Let battery drain to 20%
|
||||||
|
# 4. Note actual runtime vs estimate
|
||||||
|
# 5. Plug back in before shutdown triggers
|
||||||
|
|
||||||
|
# Battery replacement if:
|
||||||
|
# - Runtime < 10 min at typical load
|
||||||
|
# - Battery age > 3-5 years
|
||||||
|
# - Battery charge < 100% when on AC for 24h
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: [UPS.md](UPS.md) for full UPS details
|
||||||
|
|
||||||
|
### Drive Replacement Planning
|
||||||
|
|
||||||
|
**Check drive age and health**:
|
||||||
|
```bash
|
||||||
|
# Get drive hours and health
|
||||||
|
ssh truenas 'smartctl --scan | while read dev type; do
|
||||||
|
echo "=== $dev ===";
|
||||||
|
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
|
||||||
|
done'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Replace drives if**:
|
||||||
|
- Reallocated sectors > 0
|
||||||
|
- Pending sectors > 0
|
||||||
|
- SMART pre-fail warnings
|
||||||
|
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
|
||||||
|
- Hours > 50,000 for consumer drives
|
||||||
|
|
||||||
|
**Budget for replacements**:
|
||||||
|
- HDDs: WD Red 6TB (~$150/drive)
|
||||||
|
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
|
||||||
|
|
||||||
|
### Capacity Planning
|
||||||
|
|
||||||
|
**Review growth trends**:
|
||||||
|
```bash
|
||||||
|
# Storage growth (compare to last year)
|
||||||
|
ssh pve 'zpool list'
|
||||||
|
ssh truenas 'df -h /mnt/vault'
|
||||||
|
|
||||||
|
# Network bandwidth (if monitoring in place)
|
||||||
|
# Review Grafana dashboards
|
||||||
|
|
||||||
|
# Power consumption
|
||||||
|
ssh pve 'upsc cyberpower@localhost ups.load'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Plan expansions**:
|
||||||
|
- Storage: Add drives if >70% full
|
||||||
|
- RAM: Check if VMs hitting limits
|
||||||
|
- Network: Upgrade if bandwidth saturation
|
||||||
|
- UPS: Upgrade if load >80%
|
||||||
|
|
||||||
|
### License and Subscription Review
|
||||||
|
|
||||||
|
**Proxmox subscription** (if applicable):
|
||||||
|
- Community (free) or Enterprise subscription?
|
||||||
|
- Check for updates to pricing/features
|
||||||
|
|
||||||
|
**Service subscriptions**:
|
||||||
|
- Domain registration (htsn.io)
|
||||||
|
- Cloudflare plan (currently free)
|
||||||
|
- Let's Encrypt (free, no action needed)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Update Schedules
|
||||||
|
|
||||||
|
### Proxmox
|
||||||
|
|
||||||
|
| Component | Frequency | Notes |
|
||||||
|
|-----------|-----------|-------|
|
||||||
|
| Security patches | Weekly | Via `apt upgrade` |
|
||||||
|
| Minor updates | Monthly | Test on PVE2 first |
|
||||||
|
| Major versions | Quarterly | Read release notes, plan downtime |
|
||||||
|
| Kernel updates | Monthly | Requires reboot |
|
||||||
|
|
||||||
|
**Update procedure**:
|
||||||
|
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
|
||||||
|
2. Backup VM configs: `vzdump --dumpdir /tmp`
|
||||||
|
3. Update: `apt update && apt dist-upgrade`
|
||||||
|
4. Reboot if kernel changed: `reboot`
|
||||||
|
5. Verify VMs auto-started: `qm list`
|
||||||
|
|
||||||
|
### Containers (LXC)
|
||||||
|
|
||||||
|
| Container | Update Frequency | Package Manager |
|
||||||
|
|-----------|------------------|-----------------|
|
||||||
|
| Pi-hole (200) | Weekly | `apt` |
|
||||||
|
| Traefik (202) | Monthly | `apt` |
|
||||||
|
| FindShyt (205) | As needed | `apt` |
|
||||||
|
|
||||||
|
**Update command**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### VMs
|
||||||
|
|
||||||
|
| VM | Update Frequency | Notes |
|
||||||
|
|----|------------------|-------|
|
||||||
|
| TrueNAS | Monthly | Via web UI or `apt` |
|
||||||
|
| Saltbox | Weekly | Managed by Saltbox updates |
|
||||||
|
| HomeAssistant | Monthly | Via HA supervisor |
|
||||||
|
| Docker-host | Weekly | `apt` + Docker images |
|
||||||
|
| Trading-VM | As needed | Via SSH |
|
||||||
|
| Gitea-VM | Monthly | Via web UI + `apt` |
|
||||||
|
|
||||||
|
**Docker image updates**:
|
||||||
|
```bash
|
||||||
|
ssh docker-host 'docker-compose pull && docker-compose up -d'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Firmware Updates
|
||||||
|
|
||||||
|
| Component | Check Frequency | Update Method |
|
||||||
|
|-----------|----------------|---------------|
|
||||||
|
| Motherboard BIOS | Annually | Manual flash (high risk) |
|
||||||
|
| GPU firmware | Rarely | `nvidia-smi` or manual |
|
||||||
|
| SSD/NVMe firmware | Quarterly | Vendor tools |
|
||||||
|
| HBA firmware | Annually | LSI tools |
|
||||||
|
| UPS firmware | Annually | PowerPanel or manual |
|
||||||
|
|
||||||
|
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
|
||||||
|
- Critical security issue
|
||||||
|
- Needed for hardware compatibility
|
||||||
|
- Fixing known bug affecting you
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Checklists
|
||||||
|
|
||||||
|
### Pre-Update Checklist
|
||||||
|
|
||||||
|
Before ANY system update:
|
||||||
|
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
|
||||||
|
- [ ] Verify backups are current (when backup system in place)
|
||||||
|
- [ ] Check for critical VMs/services that can't have downtime
|
||||||
|
- [ ] Review update changelog/release notes
|
||||||
|
- [ ] Test on non-critical system first (PVE2 or test VM)
|
||||||
|
- [ ] Plan rollback strategy if update fails
|
||||||
|
- [ ] Notify users if downtime expected
|
||||||
|
|
||||||
|
### Post-Update Checklist
|
||||||
|
|
||||||
|
After system update:
|
||||||
|
- [ ] Verify system booted correctly: `uptime`
|
||||||
|
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
|
||||||
|
- [ ] Test critical services:
|
||||||
|
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
|
||||||
|
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
|
||||||
|
- [ ] NFS/SMB shares: Test mount from VM
|
||||||
|
- [ ] Syncthing sync: Check all devices connected
|
||||||
|
- [ ] Review logs for errors: `journalctl -p err -b`
|
||||||
|
- [ ] Check temperatures: `sensors`
|
||||||
|
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
|
||||||
|
|
||||||
|
### Disaster Recovery Test
|
||||||
|
|
||||||
|
**Quarterly test** (when backup system in place):
|
||||||
|
- [ ] Simulate VM failure: Restore from backup
|
||||||
|
- [ ] Simulate storage failure: Import pool on different system
|
||||||
|
- [ ] Simulate network failure: Verify Tailscale failover
|
||||||
|
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
|
||||||
|
- [ ] Document recovery time and issues
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Log Rotation
|
||||||
|
|
||||||
|
**System logs** are automatically rotated by systemd-journald and logrotate.
|
||||||
|
|
||||||
|
**Check log sizes**:
|
||||||
|
```bash
|
||||||
|
# Journalctl size
|
||||||
|
ssh pve 'journalctl --disk-usage'
|
||||||
|
|
||||||
|
# Traefik logs
|
||||||
|
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configure retention**:
|
||||||
|
```bash
|
||||||
|
# Limit journald to 500MB
|
||||||
|
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
|
||||||
|
ssh pve 'systemctl restart systemd-journald'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Traefik log rotation** (already configured):
|
||||||
|
```bash
|
||||||
|
# /etc/logrotate.d/traefik on CT 202
|
||||||
|
/var/log/traefik/*.log {
|
||||||
|
daily
|
||||||
|
rotate 7
|
||||||
|
compress
|
||||||
|
delaycompress
|
||||||
|
missingok
|
||||||
|
notifempty
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Integration
|
||||||
|
|
||||||
|
**TODO**: Set up automated monitoring for these procedures
|
||||||
|
|
||||||
|
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
|
||||||
|
- ZFS scrub completion/errors
|
||||||
|
- SMART test failures
|
||||||
|
- Certificate expiry warnings (<30 days)
|
||||||
|
- Update availability notifications
|
||||||
|
- Disk space thresholds (>80%)
|
||||||
|
- Temperature warnings (>85°C)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
|
||||||
|
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
|
||||||
|
- [UPS.md](UPS.md) - Power failure procedures
|
||||||
|
- [STORAGE.md](STORAGE.md) - ZFS pool management
|
||||||
|
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
|
||||||
|
- [SERVICES.md](SERVICES.md) - Service inventory
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
|
**Status**: ⚠️ Manual procedures only - monitoring automation needed
|
||||||
478
MINECRAFT.md
Normal file
478
MINECRAFT.md
Normal file
@@ -0,0 +1,478 @@
|
|||||||
|
# Minecraft Server - Hutworld
|
||||||
|
|
||||||
|
Minecraft server running on docker-host2 via Crafty Controller 4.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| **Web GUI** | https://mc.htsn.io |
|
||||||
|
| **Game Server (Java)** | hutworld.htsn.io:25565 |
|
||||||
|
| **Game Server (Bedrock)** | hutworld.htsn.io:19132 |
|
||||||
|
| **Host** | docker-host2 (10.10.10.207) |
|
||||||
|
| **Server Type** | Paper 1.21.11 |
|
||||||
|
| **World Name** | hutworld |
|
||||||
|
| **Memory** | 2GB min / 4GB max |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Crafty Controller Access
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| **URL** | https://mc.htsn.io |
|
||||||
|
| **Username** | admin |
|
||||||
|
| **Password** | See `/crafty/data/config/default-creds.txt` on docker-host2 |
|
||||||
|
|
||||||
|
**Get password:**
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'cat ~/crafty/data/config/default-creds.txt'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current Status
|
||||||
|
|
||||||
|
### Completed
|
||||||
|
|
||||||
|
- [x] Crafty Controller 4.4.7 deployed on docker-host2
|
||||||
|
- [x] Traefik reverse proxy configured (mc.htsn.io → 10.10.10.207:8443)
|
||||||
|
- [x] DNS A record created for hutworld.htsn.io (non-proxied, points to public IP)
|
||||||
|
- [x] Port forwarding configured via UniFi API:
|
||||||
|
- TCP/UDP 25565 → 10.10.10.207 (Java Edition)
|
||||||
|
- UDP 19132 → 10.10.10.207 (Bedrock via Geyser)
|
||||||
|
- [x] Server files transferred from Windows PC (D:\Minecraft\mcss\servers\hutworld)
|
||||||
|
- [x] Server imported into Crafty and running
|
||||||
|
- [x] Paper upgraded from 1.21.5 to 1.21.11
|
||||||
|
- [x] Plugins updated (GSit 3.1.1, LuckPerms 5.5.22)
|
||||||
|
- [x] Orphaned plugin data cleaned up
|
||||||
|
- [x] LuckPerms database restored with original permissions
|
||||||
|
- [x] Automated backups to TrueNAS configured (every 6 hours)
|
||||||
|
|
||||||
|
### Pending
|
||||||
|
|
||||||
|
- [ ] Change Crafty admin password to something memorable
|
||||||
|
- [ ] Test external connectivity from outside network
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Import Instructions
|
||||||
|
|
||||||
|
To import the hutworld server in Crafty:
|
||||||
|
|
||||||
|
1. Go to **Servers** → Click **+ Create New Server**
|
||||||
|
2. Select **Import Server** tab
|
||||||
|
3. Fill in:
|
||||||
|
- **Server Name:** `Hutworld`
|
||||||
|
- **Import Path:** `/crafty/import/hutworld`
|
||||||
|
- **Server JAR:** `paper.jar`
|
||||||
|
- **Min RAM:** `2048` (2GB)
|
||||||
|
- **Max RAM:** `6144` (6GB)
|
||||||
|
- **Server Port:** `25565`
|
||||||
|
4. Click **Import Server**
|
||||||
|
5. Go to server → Click **Start**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Server Configuration
|
||||||
|
|
||||||
|
### World Data
|
||||||
|
|
||||||
|
| World | Description |
|
||||||
|
|-------|-------------|
|
||||||
|
| hutworld | Main overworld |
|
||||||
|
| hutworld_nether | Nether dimension |
|
||||||
|
| hutworld_the_end | End dimension |
|
||||||
|
|
||||||
|
### Installed Plugins
|
||||||
|
|
||||||
|
| Plugin | Version | Purpose |
|
||||||
|
|--------|---------|---------|
|
||||||
|
| EssentialsX | 2.20.1 | Core server commands |
|
||||||
|
| EssentialsXChat | 2.20.1 | Chat formatting |
|
||||||
|
| EssentialsXSpawn | 2.20.1 | Spawn management |
|
||||||
|
| Geyser-Spigot | Latest | Bedrock Edition support |
|
||||||
|
| floodgate | Latest | Bedrock authentication |
|
||||||
|
| GSit | 3.1.1 | Sit/lay/crawl animations |
|
||||||
|
| LuckPerms | 5.5.22 | Permissions management |
|
||||||
|
| PluginPortal | 2.2.2 | Plugin management |
|
||||||
|
| Vault | 1.7.3 | Economy/permissions API |
|
||||||
|
| ViaVersion | Latest | Multi-version support |
|
||||||
|
| ViaBackwards | Latest | Older client support |
|
||||||
|
| randomtp | Latest | Random teleportation |
|
||||||
|
|
||||||
|
**Removed plugins** (cleaned up 2026-01-03):
|
||||||
|
- GriefPrevention, Multiverse-Core, Multiverse-Portals, ProtocolLib, WorldEdit, WorldGuard (disabled/orphaned)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Docker Configuration
|
||||||
|
|
||||||
|
**Location:** `~/crafty/docker-compose.yml` on docker-host2
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
crafty:
|
||||||
|
image: registry.gitlab.com/crafty-controller/crafty-4:4.4.7
|
||||||
|
container_name: crafty
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
- TZ=America/New_York
|
||||||
|
ports:
|
||||||
|
- "8443:8443" # Web GUI (HTTPS)
|
||||||
|
- "8123:8123" # Dynmap (if used)
|
||||||
|
- "25565:25565" # Minecraft Java
|
||||||
|
- "25566:25566" # Additional server
|
||||||
|
- "19132:19132/udp" # Minecraft Bedrock (Geyser)
|
||||||
|
volumes:
|
||||||
|
- ./data/backups:/crafty/backups
|
||||||
|
- ./data/logs:/crafty/logs
|
||||||
|
- ./data/servers:/crafty/servers
|
||||||
|
- ./data/config:/crafty/app/config
|
||||||
|
- ./data/import:/crafty/import
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Traefik Configuration
|
||||||
|
|
||||||
|
**File:** `/etc/traefik/conf.d/crafty.yaml` on CT 202 (10.10.10.250)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
crafty-secure:
|
||||||
|
entryPoints:
|
||||||
|
- websecure
|
||||||
|
rule: "Host(`mc.htsn.io`)"
|
||||||
|
service: crafty
|
||||||
|
tls:
|
||||||
|
certResolver: cloudflare
|
||||||
|
priority: 50
|
||||||
|
|
||||||
|
services:
|
||||||
|
crafty:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: "https://10.10.10.207:8443"
|
||||||
|
serversTransport: crafty-transport@file
|
||||||
|
|
||||||
|
serversTransports:
|
||||||
|
crafty-transport:
|
||||||
|
insecureSkipVerify: true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Port Forwarding (UniFi)
|
||||||
|
|
||||||
|
Configured via UniFi API on UCG-Fiber (10.10.10.1):
|
||||||
|
|
||||||
|
| Rule Name | Port | Protocol | Destination |
|
||||||
|
|-----------|------|----------|-------------|
|
||||||
|
| Minecraft Java | 25565 | TCP/UDP | 10.10.10.207:25565 |
|
||||||
|
| Minecraft Bedrock | 19132 | UDP | 10.10.10.207:19132 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## DNS Records (Cloudflare)
|
||||||
|
|
||||||
|
| Record | Type | Value | Proxied |
|
||||||
|
|--------|------|-------|---------|
|
||||||
|
| mc.htsn.io | CNAME | htsn.io | Yes (for web GUI) |
|
||||||
|
| hutworld.htsn.io | A | 70.237.94.174 | No (direct for game traffic) |
|
||||||
|
|
||||||
|
**Note:** Game traffic (25565, 19132) cannot be proxied through Cloudflare - only HTTP/HTTPS works with Cloudflare proxy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LuckPerms Web Editor
|
||||||
|
|
||||||
|
After server is running:
|
||||||
|
|
||||||
|
1. Open Crafty console for Hutworld server
|
||||||
|
2. Run command: `/lp editor`
|
||||||
|
3. A unique URL will be generated (cloud-hosted by LuckPerms)
|
||||||
|
4. Open the URL in browser to manage permissions
|
||||||
|
|
||||||
|
The editor is hosted by LuckPerms, so no additional port forwarding is needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup Configuration
|
||||||
|
|
||||||
|
### Automated Backups to TrueNAS
|
||||||
|
|
||||||
|
Backups run automatically every 6 hours and are stored on TrueNAS.
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| **Destination** | TrueNAS (10.10.10.200) |
|
||||||
|
| **Path** | `/mnt/vault/users/backups/minecraft/` |
|
||||||
|
| **Frequency** | Every 6 hours (12am, 6am, 12pm, 6pm) |
|
||||||
|
| **Retention** | 14 backups (~3.5 days of history) |
|
||||||
|
| **Size** | ~2.3 GB per backup |
|
||||||
|
| **Script** | `/home/hutson/minecraft-backup.sh` on docker-host2 |
|
||||||
|
| **Log** | `/home/hutson/minecraft-backup.log` on docker-host2 |
|
||||||
|
|
||||||
|
### Backup Script
|
||||||
|
|
||||||
|
**Location:** `~/minecraft-backup.sh` on docker-host2
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Minecraft Server Backup Script
|
||||||
|
# Backs up Crafty server data to TrueNAS
|
||||||
|
|
||||||
|
BACKUP_SRC="$HOME/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60"
|
||||||
|
BACKUP_DEST="hutson@10.10.10.200:/mnt/vault/users/backups/minecraft"
|
||||||
|
DATE=$(date +%Y-%m-%d_%H%M)
|
||||||
|
BACKUP_NAME="hutworld-$DATE.tar.gz"
|
||||||
|
LOCAL_BACKUP="/tmp/$BACKUP_NAME"
|
||||||
|
|
||||||
|
# Create compressed backup (exclude large unnecessary files)
|
||||||
|
tar -czf "$LOCAL_BACKUP" \
|
||||||
|
--exclude="*.jar" \
|
||||||
|
--exclude="cache" \
|
||||||
|
--exclude="libraries" \
|
||||||
|
--exclude=".paper-remapped" \
|
||||||
|
-C "$HOME/crafty/data/servers" \
|
||||||
|
19f604a9-f037-442d-9283-0761c73cfd60
|
||||||
|
|
||||||
|
# Transfer to TrueNAS
|
||||||
|
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$LOCAL_BACKUP" "$BACKUP_DEST/"
|
||||||
|
|
||||||
|
# Clean up local temp file
|
||||||
|
rm -f "$LOCAL_BACKUP"
|
||||||
|
|
||||||
|
# Keep only last 14 backups on TrueNAS
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 '
|
||||||
|
cd /mnt/vault/users/backups/minecraft
|
||||||
|
ls -t hutworld-*.tar.gz 2>/dev/null | tail -n +15 | xargs -r rm -f
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cron Schedule
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View current schedule
|
||||||
|
ssh docker-host2 'crontab -l | grep minecraft'
|
||||||
|
|
||||||
|
# Output: 0 */6 * * * /home/hutson/minecraft-backup.sh >> /home/hutson/minecraft-backup.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Manual Backup Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run backup manually
|
||||||
|
ssh docker-host2 '~/minecraft-backup.sh'
|
||||||
|
|
||||||
|
# Check backup log
|
||||||
|
ssh docker-host2 'tail -20 ~/minecraft-backup.log'
|
||||||
|
|
||||||
|
# List backups on TrueNAS
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
|
||||||
|
'ls -lh /mnt/vault/users/backups/minecraft/'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restore from Backup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Stop the server in Crafty web UI
|
||||||
|
|
||||||
|
# 2. Copy backup from TrueNAS
|
||||||
|
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no \
|
||||||
|
hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz \
|
||||||
|
/tmp/
|
||||||
|
|
||||||
|
# 3. Extract to server directory (backup existing first)
|
||||||
|
ssh docker-host2 'cd ~/crafty/data/servers && \
|
||||||
|
mv 19f604a9-f037-442d-9283-0761c73cfd60 19f604a9-f037-442d-9283-0761c73cfd60.old && \
|
||||||
|
tar -xzf /tmp/hutworld-YYYY-MM-DD_HHMM.tar.gz'
|
||||||
|
|
||||||
|
# 4. Start server in Crafty web UI
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Tasks
|
||||||
|
|
||||||
|
### Start/Stop Server
|
||||||
|
|
||||||
|
Via Crafty web UI at https://mc.htsn.io, or:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Crafty container status
|
||||||
|
ssh docker-host2 'docker ps | grep crafty'
|
||||||
|
|
||||||
|
# Restart Crafty container
|
||||||
|
ssh docker-host2 'cd ~/crafty && docker compose restart'
|
||||||
|
|
||||||
|
# View Crafty logs
|
||||||
|
ssh docker-host2 'docker logs -f crafty'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Server
|
||||||
|
|
||||||
|
See [Backup Configuration](#backup-configuration) for full details.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run backup manually
|
||||||
|
ssh docker-host2 '~/minecraft-backup.sh'
|
||||||
|
|
||||||
|
# Check recent backups
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
|
||||||
|
'ls -lht /mnt/vault/users/backups/minecraft/ | head -5'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Update Plugins
|
||||||
|
|
||||||
|
1. Download new plugin JAR
|
||||||
|
2. Upload via Crafty Files tab, or:
|
||||||
|
```bash
|
||||||
|
scp plugin.jar docker-host2:~/crafty/data/servers/hutworld/plugins/
|
||||||
|
```
|
||||||
|
3. Restart server in Crafty
|
||||||
|
|
||||||
|
### Check Server Logs
|
||||||
|
|
||||||
|
Via Crafty web UI (Logs tab), or:
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'tail -f ~/crafty/data/servers/hutworld/logs/latest.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Server won't start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Crafty container logs
|
||||||
|
ssh docker-host2 'docker logs crafty --tail 50'
|
||||||
|
|
||||||
|
# Check server logs
|
||||||
|
ssh docker-host2 'cat ~/crafty/data/servers/hutworld/logs/latest.log | tail -100'
|
||||||
|
|
||||||
|
# Check Java version in container
|
||||||
|
ssh docker-host2 'docker exec crafty java -version'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Can't connect externally
|
||||||
|
|
||||||
|
1. Verify port forwarding is active:
|
||||||
|
```bash
|
||||||
|
ssh root@10.10.10.1 'iptables -t nat -L -n | grep 25565'
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Test from external network:
|
||||||
|
```bash
|
||||||
|
nc -zv hutworld.htsn.io 25565
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check if server is listening:
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'netstat -tlnp | grep 25565'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Bedrock players can't connect
|
||||||
|
|
||||||
|
1. Verify Geyser plugin is installed and enabled
|
||||||
|
2. Check Geyser config: `~/crafty/data/servers/hutworld/plugins/Geyser-Spigot/config.yml`
|
||||||
|
3. Ensure UDP 19132 is forwarded and not blocked
|
||||||
|
|
||||||
|
### LuckPerms missing users/permissions
|
||||||
|
|
||||||
|
If LuckPerms shows a fresh database (missing users like Suwan):
|
||||||
|
|
||||||
|
1. **Check if original database exists:**
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'ls -la ~/crafty/data/import/hutworld/plugins/LuckPerms/*.db'
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Restore from import backup:**
|
||||||
|
```bash
|
||||||
|
# Stop server in Crafty UI first
|
||||||
|
ssh docker-host2 'cp ~/crafty/data/import/hutworld/plugins/LuckPerms/luckperms-h2-v2.mv.db \
|
||||||
|
~/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60/plugins/LuckPerms/'
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Or restore from TrueNAS backup:**
|
||||||
|
```bash
|
||||||
|
# List available backups
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
|
||||||
|
'ls -lt /mnt/vault/users/backups/minecraft/'
|
||||||
|
|
||||||
|
# Extract LuckPerms database from backup
|
||||||
|
sshpass -p 'GrilledCh33s3#' scp hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz /tmp/
|
||||||
|
tar -xzf /tmp/hutworld-*.tar.gz -C /tmp --strip-components=2 \
|
||||||
|
'*/plugins/LuckPerms/luckperms-h2-v2.mv.db'
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Restart server in Crafty UI**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration History
|
||||||
|
|
||||||
|
### 2026-01-04: Backup System
|
||||||
|
|
||||||
|
- Configured automated backups to TrueNAS every 6 hours
|
||||||
|
- Set 14-backup retention (~3.5 days of recovery points)
|
||||||
|
- Created backup script with compression and cleanup
|
||||||
|
- Storage: `/mnt/vault/users/backups/minecraft/`
|
||||||
|
|
||||||
|
### 2026-01-03: Server Fixes & Updates
|
||||||
|
|
||||||
|
**Updates:**
|
||||||
|
- Upgraded Paper from 1.21.5 to 1.21.11 (build 69)
|
||||||
|
- Updated GSit from 2.3.2 to 3.1.1
|
||||||
|
- Fixed corrupted LuckPerms JAR (re-downloaded 5.5.22)
|
||||||
|
- Restored original LuckPerms database with user permissions
|
||||||
|
|
||||||
|
**Cleanup:**
|
||||||
|
- Removed disabled plugins: Dynmap, Graves
|
||||||
|
- Removed orphaned data folders: GriefPreventionData, SilkSpawners_v2, Graves, ViaRewind
|
||||||
|
|
||||||
|
**Fixes:**
|
||||||
|
- Fixed memory allocation (was attempting 2TB, set to 2GB min / 4GB max)
|
||||||
|
- Fixed file permissions for Docker container access
|
||||||
|
|
||||||
|
### 2026-01-03: Initial Migration
|
||||||
|
|
||||||
|
**Source:** Windows PC (10.10.10.150) - D:\Minecraft\mcss\servers\hutworld
|
||||||
|
|
||||||
|
**Steps completed:**
|
||||||
|
1. Compressed hutworld folder on Windows (2.4GB zip)
|
||||||
|
2. Transferred via SCP to docker-host2
|
||||||
|
3. Unzipped to ~/crafty/data/import/hutworld
|
||||||
|
4. Downloaded Paper 1.21.5 JAR (later upgraded to 1.21.11)
|
||||||
|
5. Imported server into Crafty Controller
|
||||||
|
6. Configured port forwarding (updated existing 25565 rule, added 19132)
|
||||||
|
7. Created DNS record for hutworld.htsn.io
|
||||||
|
|
||||||
|
**Original MCSS config preserved:** `mcss_server_config.json`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [IP Assignments](IP-ASSIGNMENTS.md) - Network configuration
|
||||||
|
- [Traefik](TRAEFIK.md) - Reverse proxy setup
|
||||||
|
- [VMs](VMS.md) - docker-host2 details
|
||||||
|
- [Gateway](GATEWAY.md) - UCG-Fiber configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- [Crafty Controller Docs](https://docs.craftycontrol.com/)
|
||||||
|
- [Paper MC](https://papermc.io/)
|
||||||
|
- [Geyser MC](https://geysermc.org/)
|
||||||
|
- [LuckPerms](https://luckperms.net/)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated:** 2026-01-04
|
||||||
583
MONITORING.md
Normal file
583
MONITORING.md
Normal file
@@ -0,0 +1,583 @@
|
|||||||
|
# Monitoring and Alerting
|
||||||
|
|
||||||
|
Documentation for system monitoring, health checks, and alerting across the homelab.
|
||||||
|
|
||||||
|
## Current Monitoring Status
|
||||||
|
|
||||||
|
| Component | Monitored? | Method | Alerts | Notes |
|
||||||
|
|-----------|------------|--------|--------|-------|
|
||||||
|
| **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
|
||||||
|
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
|
||||||
|
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
|
||||||
|
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
|
||||||
|
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
|
||||||
|
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
|
||||||
|
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
|
||||||
|
| **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
|
||||||
|
| **Services** | ❌ No | - | ❌ No | No health checks |
|
||||||
|
| **Backups** | ❌ No | - | ❌ No | No verification |
|
||||||
|
|
||||||
|
**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, most else is manual
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Existing Monitoring
|
||||||
|
|
||||||
|
### UPS Monitoring (NUT)
|
||||||
|
|
||||||
|
**Status**: ✅ **Active and working**
|
||||||
|
|
||||||
|
**What's monitored**:
|
||||||
|
- Battery charge percentage
|
||||||
|
- Runtime remaining (seconds)
|
||||||
|
- Load percentage
|
||||||
|
- Input/output voltage
|
||||||
|
- UPS status (OL/OB/LB)
|
||||||
|
|
||||||
|
**Access**:
|
||||||
|
```bash
|
||||||
|
# Full UPS status
|
||||||
|
ssh pve 'upsc cyberpower@localhost'
|
||||||
|
|
||||||
|
# Key metrics
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Home Assistant Integration**:
|
||||||
|
- Sensors: `sensor.cyberpower_*`
|
||||||
|
- Can be used for automation/alerts
|
||||||
|
- Currently: No alerts configured
|
||||||
|
|
||||||
|
**See**: [UPS.md](UPS.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Gateway Monitoring
|
||||||
|
|
||||||
|
**Status**: ✅ **Active with auto-recovery**
|
||||||
|
|
||||||
|
Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):
|
||||||
|
|
||||||
|
**1. Internet Watchdog** (`internet-watchdog.service`)
|
||||||
|
- Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
|
||||||
|
- Auto-reboots gateway after 5 consecutive failures (~5 minutes)
|
||||||
|
- Logs to `/var/log/internet-watchdog.log`
|
||||||
|
|
||||||
|
**2. Memory Monitor** (`memory-monitor.service`)
|
||||||
|
- Logs memory usage and top processes every 10 minutes
|
||||||
|
- Logs to `/data/logs/memory-history.log`
|
||||||
|
- Auto-rotates when log exceeds 10MB
|
||||||
|
|
||||||
|
**Quick Commands**:
|
||||||
|
```bash
|
||||||
|
# Check service status
|
||||||
|
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
|
||||||
|
|
||||||
|
# View watchdog activity
|
||||||
|
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
|
||||||
|
|
||||||
|
# View memory history
|
||||||
|
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
|
||||||
|
|
||||||
|
# Current memory usage
|
||||||
|
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: [GATEWAY.md](GATEWAY.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Syncthing Monitoring
|
||||||
|
|
||||||
|
**Status**: ⚠️ **Partial** - API available, no automated monitoring
|
||||||
|
|
||||||
|
**What's available**:
|
||||||
|
- Device connection status
|
||||||
|
- Folder sync status
|
||||||
|
- Sync errors
|
||||||
|
- Bandwidth usage
|
||||||
|
|
||||||
|
**Manual Checks**:
|
||||||
|
```bash
|
||||||
|
# Check connections (Mac Mini)
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||||
|
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||||
|
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||||
|
|
||||||
|
# Check folder status
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
|
||||||
|
|
||||||
|
# Check errors
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
|
||||||
|
```
|
||||||
|
|
||||||
|
**Needs**: Automated monitoring script + alerts
|
||||||
|
|
||||||
|
**See**: [SYNCTHING.md](SYNCTHING.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Temperature Monitoring
|
||||||
|
|
||||||
|
**Status**: ⚠️ **Manual only**
|
||||||
|
|
||||||
|
**Current Method**:
|
||||||
|
```bash
|
||||||
|
# CPU temperature (Threadripper Tctl)
|
||||||
|
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
||||||
|
label=$(cat ${f%_input}_label 2>/dev/null); \
|
||||||
|
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
|
||||||
|
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
|
||||||
|
label=$(cat ${f%_input}_label 2>/dev/null); \
|
||||||
|
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Thresholds**:
|
||||||
|
- Healthy: 70-80°C under load
|
||||||
|
- Warning: >85°C
|
||||||
|
- Critical: >90°C (throttling)
|
||||||
|
|
||||||
|
**Needs**: Automated monitoring + alert if >85°C
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Proxmox VM Monitoring
|
||||||
|
|
||||||
|
**Status**: ⚠️ **Manual only**
|
||||||
|
|
||||||
|
**Current Access**:
|
||||||
|
- Proxmox Web UI: Node → Summary
|
||||||
|
- CLI: `ssh pve 'qm list'`
|
||||||
|
|
||||||
|
**Metrics Available** (via Proxmox):
|
||||||
|
- CPU usage per VM
|
||||||
|
- RAM usage per VM
|
||||||
|
- Disk I/O
|
||||||
|
- Network I/O
|
||||||
|
- VM uptime
|
||||||
|
|
||||||
|
**Needs**: API-based monitoring + alerts for VM down
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Monitoring Stack
|
||||||
|
|
||||||
|
### Option 1: Prometheus + Grafana (Recommended)
|
||||||
|
|
||||||
|
**Why**:
|
||||||
|
- Industry standard
|
||||||
|
- Extensive integrations
|
||||||
|
- Beautiful dashboards
|
||||||
|
- Flexible alerting
|
||||||
|
|
||||||
|
**Architecture**:
|
||||||
|
```
|
||||||
|
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
|
||||||
|
↓
|
||||||
|
Alertmanager (alerts)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Required Exporters**:
|
||||||
|
| Exporter | Monitors | Install On |
|
||||||
|
|----------|----------|------------|
|
||||||
|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
|
||||||
|
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
|
||||||
|
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
|
||||||
|
| nut_exporter | UPS metrics | PVE |
|
||||||
|
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
|
||||||
|
| cadvisor | Docker containers | Saltbox, docker-host |
|
||||||
|
|
||||||
|
**Deployment**:
|
||||||
|
```bash
|
||||||
|
# Create monitoring VM
|
||||||
|
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
|
||||||
|
--net0 virtio,bridge=vmbr0'
|
||||||
|
|
||||||
|
# Install Prometheus + Grafana (via Docker)
|
||||||
|
# /opt/monitoring/docker-compose.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Estimated Setup Time**: 4-6 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 2: Uptime Kuma (Simpler Alternative)
|
||||||
|
|
||||||
|
**Why**:
|
||||||
|
- Lightweight
|
||||||
|
- Easy to set up
|
||||||
|
- Web-based dashboard
|
||||||
|
- Built-in alerts (email, Slack, etc.)
|
||||||
|
|
||||||
|
**What it monitors**:
|
||||||
|
- HTTP/HTTPS endpoints
|
||||||
|
- Ping (ICMP)
|
||||||
|
- Ports (TCP)
|
||||||
|
- Docker containers
|
||||||
|
|
||||||
|
**Deployment**:
|
||||||
|
```bash
|
||||||
|
ssh docker-host 'mkdir -p /opt/uptime-kuma'
|
||||||
|
cat > docker-compose.yml << 'EOF'
|
||||||
|
version: "3.8"
|
||||||
|
services:
|
||||||
|
uptime-kuma:
|
||||||
|
image: louislam/uptime-kuma:latest
|
||||||
|
ports:
|
||||||
|
- "3001:3001"
|
||||||
|
volumes:
|
||||||
|
- ./data:/app/data
|
||||||
|
restart: unless-stopped
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Access: http://10.10.10.206:3001
|
||||||
|
# Add Traefik config for uptime.htsn.io
|
||||||
|
```
|
||||||
|
|
||||||
|
**Estimated Setup Time**: 1-2 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Option 3: Netdata (Real-time Monitoring)
|
||||||
|
|
||||||
|
**Why**:
|
||||||
|
- Real-time metrics (1-second granularity)
|
||||||
|
- Auto-discovers services
|
||||||
|
- Low overhead
|
||||||
|
- Beautiful web UI
|
||||||
|
|
||||||
|
**Deployment**:
|
||||||
|
```bash
|
||||||
|
# Install on each server
|
||||||
|
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
|
||||||
|
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
|
||||||
|
|
||||||
|
# Access:
|
||||||
|
# http://10.10.10.120:19999 (PVE)
|
||||||
|
# http://10.10.10.102:19999 (PVE2)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parent-Child Setup** (optional):
|
||||||
|
- Configure PVE as parent
|
||||||
|
- Stream metrics from PVE2 → PVE
|
||||||
|
- Single dashboard for both servers
|
||||||
|
|
||||||
|
**Estimated Setup Time**: 1 hour
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Metrics to Monitor
|
||||||
|
|
||||||
|
### Server Health
|
||||||
|
|
||||||
|
| Metric | Threshold | Action |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| **CPU usage** | >90% for 5 min | Alert |
|
||||||
|
| **CPU temp** | >85°C | Alert |
|
||||||
|
| **CPU temp** | >90°C | Critical alert |
|
||||||
|
| **RAM usage** | >95% | Alert |
|
||||||
|
| **Disk space** | >80% | Warning |
|
||||||
|
| **Disk space** | >90% | Alert |
|
||||||
|
| **Load average** | >CPU count | Alert |
|
||||||
|
|
||||||
|
### Storage Health
|
||||||
|
|
||||||
|
| Metric | Threshold | Action |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| **ZFS pool errors** | >0 | Alert immediately |
|
||||||
|
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
|
||||||
|
| **ZFS scrub failed** | Last scrub error | Alert |
|
||||||
|
| **SMART reallocated sectors** | >0 | Warning |
|
||||||
|
| **SMART pending sectors** | >0 | Alert |
|
||||||
|
| **SMART failure** | Pre-fail | Critical - replace drive |
|
||||||
|
|
||||||
|
### UPS
|
||||||
|
|
||||||
|
| Metric | Threshold | Action |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| **Battery charge** | <20% | Warning |
|
||||||
|
| **Battery charge** | <10% | Alert |
|
||||||
|
| **On battery** | >5 min | Alert |
|
||||||
|
| **Runtime** | <5 min | Critical |
|
||||||
|
|
||||||
|
### Network
|
||||||
|
|
||||||
|
| Metric | Threshold | Action |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| **Device unreachable** | >2 min down | Alert |
|
||||||
|
| **High packet loss** | >5% | Warning |
|
||||||
|
| **Bandwidth saturation** | >90% | Warning |
|
||||||
|
|
||||||
|
### VMs/Services
|
||||||
|
|
||||||
|
| Metric | Threshold | Action |
|
||||||
|
|--------|-----------|--------|
|
||||||
|
| **VM stopped** | Critical VM down | Alert immediately |
|
||||||
|
| **Service unreachable** | HTTP 5xx or timeout | Alert |
|
||||||
|
| **Backup failed** | Any backup failure | Alert |
|
||||||
|
| **Certificate expiry** | <30 days | Warning |
|
||||||
|
| **Certificate expiry** | <7 days | Alert |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alert Destinations
|
||||||
|
|
||||||
|
### Email Alerts
|
||||||
|
|
||||||
|
**Recommended**: Set up SMTP relay for email alerts
|
||||||
|
|
||||||
|
**Options**:
|
||||||
|
1. Gmail SMTP (free, rate-limited)
|
||||||
|
2. SendGrid (free tier: 100 emails/day)
|
||||||
|
3. Mailgun (free tier available)
|
||||||
|
4. Self-hosted mail server (complex)
|
||||||
|
|
||||||
|
**Configuration Example** (Prometheus Alertmanager):
|
||||||
|
```yaml
|
||||||
|
# /etc/alertmanager/alertmanager.yml
|
||||||
|
receivers:
|
||||||
|
- name: 'email'
|
||||||
|
email_configs:
|
||||||
|
- to: 'hutson@example.com'
|
||||||
|
from: 'alerts@htsn.io'
|
||||||
|
smarthost: 'smtp.gmail.com:587'
|
||||||
|
auth_username: 'alerts@htsn.io'
|
||||||
|
auth_password: 'app-password-here'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Push Notifications
|
||||||
|
|
||||||
|
**Options**:
|
||||||
|
- **Pushover**: $5 one-time, reliable
|
||||||
|
- **Pushbullet**: Free tier available
|
||||||
|
- **Telegram Bot**: Free
|
||||||
|
- **Discord Webhook**: Free
|
||||||
|
- **Slack**: Free tier available
|
||||||
|
|
||||||
|
**Recommended**: Pushover or Telegram for mobile alerts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Home Assistant Alerts
|
||||||
|
|
||||||
|
Since Home Assistant is already running, use it for alerts:
|
||||||
|
|
||||||
|
**Automation Example**:
|
||||||
|
```yaml
|
||||||
|
automation:
|
||||||
|
- alias: "UPS Low Battery Alert"
|
||||||
|
trigger:
|
||||||
|
- platform: numeric_state
|
||||||
|
entity_id: sensor.cyberpower_battery_charge
|
||||||
|
below: 20
|
||||||
|
action:
|
||||||
|
- service: notify.mobile_app
|
||||||
|
data:
|
||||||
|
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
|
||||||
|
|
||||||
|
- alias: "Server High Temperature"
|
||||||
|
trigger:
|
||||||
|
- platform: template
|
||||||
|
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
|
||||||
|
action:
|
||||||
|
- service: notify.mobile_app
|
||||||
|
data:
|
||||||
|
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Scripts
|
||||||
|
|
||||||
|
### Daily Health Check
|
||||||
|
|
||||||
|
Save as `~/bin/homelab-health-check.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Daily homelab health check
|
||||||
|
|
||||||
|
echo "=== Homelab Health Check ==="
|
||||||
|
echo "Date: $(date)"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Server Status ==="
|
||||||
|
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
|
||||||
|
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== CPU Temperatures ==="
|
||||||
|
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== UPS Status ==="
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== ZFS Pools ==="
|
||||||
|
ssh pve 'zpool status -x' 2>/dev/null
|
||||||
|
ssh pve2 'zpool status -x' 2>/dev/null
|
||||||
|
ssh truenas 'zpool status -x vault'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Disk Space ==="
|
||||||
|
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
|
||||||
|
ssh truenas 'df -h /mnt/vault'
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== VM Status ==="
|
||||||
|
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
|
||||||
|
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Syncthing Connections ==="
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/system/connections" | \
|
||||||
|
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
|
||||||
|
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Check Complete ==="
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run daily**:
|
||||||
|
```cron
|
||||||
|
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### ZFS Scrub Checker
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Check last ZFS scrub status
|
||||||
|
|
||||||
|
echo "=== ZFS Scrub Status ==="
|
||||||
|
|
||||||
|
for host in pve pve2; do
|
||||||
|
echo "--- $host ---"
|
||||||
|
ssh $host 'zpool status | grep -A1 scrub'
|
||||||
|
echo ""
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "--- TrueNAS ---"
|
||||||
|
ssh truenas 'zpool status vault | grep -A1 scrub'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### SMART Health Checker
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Check SMART health on all drives
|
||||||
|
|
||||||
|
echo "=== SMART Health Check ==="
|
||||||
|
|
||||||
|
echo "--- TrueNAS Drives ---"
|
||||||
|
ssh truenas 'smartctl --scan | while read dev type; do
|
||||||
|
echo "=== $dev ===";
|
||||||
|
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
|
||||||
|
done'
|
||||||
|
|
||||||
|
echo "--- PVE Drives ---"
|
||||||
|
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
|
||||||
|
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
|
||||||
|
done'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dashboard Recommendations
|
||||||
|
|
||||||
|
### Grafana Dashboard Layout
|
||||||
|
|
||||||
|
**Page 1: Overview**
|
||||||
|
- Server uptime
|
||||||
|
- CPU usage (all servers)
|
||||||
|
- RAM usage (all servers)
|
||||||
|
- Disk space (all pools)
|
||||||
|
- Network traffic
|
||||||
|
- UPS status
|
||||||
|
|
||||||
|
**Page 2: Storage**
|
||||||
|
- ZFS pool health
|
||||||
|
- SMART status for all drives
|
||||||
|
- I/O latency
|
||||||
|
- Scrub progress
|
||||||
|
- Disk temperatures
|
||||||
|
|
||||||
|
**Page 3: VMs**
|
||||||
|
- VM status (up/down)
|
||||||
|
- VM resource usage
|
||||||
|
- VM disk I/O
|
||||||
|
- VM network traffic
|
||||||
|
|
||||||
|
**Page 4: Services**
|
||||||
|
- Service health checks
|
||||||
|
- HTTP response times
|
||||||
|
- Certificate expiry dates
|
||||||
|
- Syncthing sync status
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Basic Monitoring (Week 1)
|
||||||
|
|
||||||
|
- [ ] Install Uptime Kuma or Netdata
|
||||||
|
- [ ] Add HTTP checks for all services
|
||||||
|
- [ ] Configure UPS alerts in Home Assistant
|
||||||
|
- [ ] Set up daily health check email
|
||||||
|
|
||||||
|
**Estimated Time**: 4-6 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 2: Advanced Monitoring (Week 2-3)
|
||||||
|
|
||||||
|
- [ ] Install Prometheus + Grafana
|
||||||
|
- [ ] Deploy node_exporter on all servers
|
||||||
|
- [ ] Deploy zfs_exporter
|
||||||
|
- [ ] Deploy smartmon_exporter
|
||||||
|
- [ ] Create Grafana dashboards
|
||||||
|
|
||||||
|
**Estimated Time**: 8-12 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 3: Alerting (Week 4)
|
||||||
|
|
||||||
|
- [ ] Configure Alertmanager
|
||||||
|
- [ ] Set up email/push notifications
|
||||||
|
- [ ] Create alert rules for all critical metrics
|
||||||
|
- [ ] Test all alert paths
|
||||||
|
- [ ] Document alert procedures
|
||||||
|
|
||||||
|
**Estimated Time**: 4-6 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting
|
||||||
|
- [UPS.md](UPS.md) - UPS monitoring details
|
||||||
|
- [STORAGE.md](STORAGE.md) - ZFS health checks
|
||||||
|
- [SERVICES.md](SERVICES.md) - Service inventory
|
||||||
|
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
|
||||||
|
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2026-01-02
|
||||||
|
**Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**
|
||||||
382
N8N-INTEGRATIONS.md
Normal file
382
N8N-INTEGRATIONS.md
Normal file
@@ -0,0 +1,382 @@
|
|||||||
|
# n8n Homelab Integrations - Quick Start Guide
|
||||||
|
|
||||||
|
n8n is running on your homelab network (10.10.10.207) and can access all local services. This guide sets up useful automations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Network Access Verified
|
||||||
|
|
||||||
|
n8n can connect to:
|
||||||
|
- ✅ **Home Assistant** (10.10.10.110:8123)
|
||||||
|
- ✅ **Prometheus** (10.10.10.206:9090)
|
||||||
|
- ✅ **Grafana** (10.10.10.206:3001)
|
||||||
|
- ✅ **Syncthing** (10.10.10.200:8384)
|
||||||
|
- ✅ **PiHole** (10.10.10.10)
|
||||||
|
- ✅ **Gitea** (10.10.10.220:3000)
|
||||||
|
- ✅ **Proxmox** (10.10.10.120:8006, 10.10.10.102:8006)
|
||||||
|
- ✅ **TrueNAS** (10.10.10.200)
|
||||||
|
- ✅ **All external APIs** (via internet)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Initial Setup (First-Time)
|
||||||
|
|
||||||
|
1. Open **https://n8n.htsn.io**
|
||||||
|
2. Complete the setup wizard:
|
||||||
|
- **Owner Email:** hutson@htsn.io
|
||||||
|
- **Owner Name:** Hutson
|
||||||
|
- **Password:** (choose secure password)
|
||||||
|
3. Skip data sharing (optional)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Credentials to Add in n8n
|
||||||
|
|
||||||
|
Go to **Settings → Credentials** and add:
|
||||||
|
|
||||||
|
### 1. Home Assistant
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | Home Assistant API |
|
||||||
|
| **Host** | `http://10.10.10.110:8123` |
|
||||||
|
| **Access Token** | (get from Home Assistant) |
|
||||||
|
|
||||||
|
**Get Token:** Home Assistant → Profile → Long-Lived Access Tokens → Create Token
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Prometheus
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | HTTP Request (Generic) |
|
||||||
|
| **URL** | `http://10.10.10.206:9090` |
|
||||||
|
| **Authentication** | None |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Grafana
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | Grafana API |
|
||||||
|
| **URL** | `http://10.10.10.206:3001` |
|
||||||
|
| **API Key** | (create in Grafana) |
|
||||||
|
|
||||||
|
**Get API Key:** Grafana → Administration → Service Accounts → Create → Add Token
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Syncthing
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | HTTP Request (Generic) |
|
||||||
|
| **URL** | `http://10.10.10.200:8384` |
|
||||||
|
| **Header Name** | `X-API-Key` |
|
||||||
|
| **Header Value** | `VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. Telegram Bot
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | Telegram API |
|
||||||
|
| **Access Token** | `8450212653:AAHoVBlNUuA0vtrVPMNUfSgJh_gmFMxlrBg` |
|
||||||
|
|
||||||
|
**Your Chat ID:** `1004084736`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. Proxmox
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Credential Type** | HTTP Request (Generic) |
|
||||||
|
| **URL** | `http://10.10.10.120:8006` |
|
||||||
|
| **Authentication** | API Token |
|
||||||
|
| **Token** | (use monitoring@pve token if needed) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Starter Workflows
|
||||||
|
|
||||||
|
### Workflow 1: Homelab Health Check (Every Hour)
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (every hour)
|
||||||
|
2. **HTTP Request** → Prometheus query for down hosts
|
||||||
|
- URL: `http://10.10.10.206:9090/api/v1/query`
|
||||||
|
- Query param: `query=up{job=~"node.*"} == 0`
|
||||||
|
3. **If** → Check if any hosts are down
|
||||||
|
4. **Telegram** → Send alert if hosts down
|
||||||
|
|
||||||
|
**PromQL Query:**
|
||||||
|
```
|
||||||
|
up{job=~"node.*"} == 0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 2: Daily Backup Status
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (8am daily)
|
||||||
|
2. **HTTP Request** → Query Syncthing sync status
|
||||||
|
- URL: `http://10.10.10.200:8384/rest/db/status?folder=backup`
|
||||||
|
- Header: `X-API-Key: VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs`
|
||||||
|
3. **Function** → Check if folder is syncing
|
||||||
|
4. **Telegram** → Send daily status report
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 3: High CPU Alert
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (every 5 minutes)
|
||||||
|
2. **HTTP Request** → Prometheus CPU query
|
||||||
|
- URL: `http://10.10.10.206:9090/api/v1/query`
|
||||||
|
- Query: `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
|
||||||
|
3. **If** → CPU > 90%
|
||||||
|
4. **Telegram** → Send alert
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 4: UPS Power Event
|
||||||
|
|
||||||
|
**Webhook Trigger Setup:**
|
||||||
|
1. Create webhook trigger in n8n
|
||||||
|
2. Get webhook URL: `https://n8n.htsn.io/webhook/ups-alert`
|
||||||
|
3. Configure NUT to call webhook on power events
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Webhook Trigger** → Receive UPS event
|
||||||
|
2. **Switch** → Route by event type (on battery, low battery, online)
|
||||||
|
3. **Telegram** → Send appropriate alert
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 5: Gitea → Deploy on Push
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Webhook Trigger** → Gitea push event
|
||||||
|
2. **If** → Check if branch is `main`
|
||||||
|
3. **SSH** → Connect to target server
|
||||||
|
4. **Execute Command** → `git pull && docker-compose up -d`
|
||||||
|
5. **Telegram** → Notify deployment complete
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 6: Syncthing Folder Behind Alert
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (every 30 minutes)
|
||||||
|
2. **HTTP Request** → Get all folder statuses
|
||||||
|
- URL: `http://10.10.10.200:8384/rest/stats/folder`
|
||||||
|
3. **Function** → Check if any folder has errors or is significantly behind
|
||||||
|
4. **If** → Errors found
|
||||||
|
5. **Telegram** → Alert with folder name and status
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 7: Grafana Alert Forwarder
|
||||||
|
|
||||||
|
**Purpose:** Forward Grafana alerts to Telegram
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Webhook Trigger** → Grafana webhook
|
||||||
|
2. **Function** → Parse alert data
|
||||||
|
3. **Telegram** → Format and send alert
|
||||||
|
|
||||||
|
**Grafana Setup:**
|
||||||
|
- Contact Point → Add webhook: `https://n8n.htsn.io/webhook/grafana-alerts`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 8: Daily Homelab Summary
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (9am daily)
|
||||||
|
2. **Multiple HTTP Requests in parallel:**
|
||||||
|
- Prometheus: System uptime
|
||||||
|
- Prometheus: Average CPU usage (24h)
|
||||||
|
- Prometheus: Disk usage
|
||||||
|
- Syncthing: Sync status (all folders)
|
||||||
|
- PiHole: Queries blocked (24h)
|
||||||
|
3. **Function** → Format data as summary
|
||||||
|
4. **Telegram** → Send daily report
|
||||||
|
|
||||||
|
**Example Output:**
|
||||||
|
```
|
||||||
|
🏠 Homelab Daily Summary
|
||||||
|
|
||||||
|
✅ All systems operational
|
||||||
|
⏱️ Uptime: 14 days
|
||||||
|
📊 Avg CPU: 12%
|
||||||
|
💾 Disk: 45% used
|
||||||
|
🔄 Syncthing: All folders in sync
|
||||||
|
🛡️ PiHole: 2,341 queries blocked
|
||||||
|
|
||||||
|
Last updated: 2025-12-27 09:00
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 9: VM State Change Monitor
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (every 1 minute)
|
||||||
|
2. **HTTP Request** → Query Proxmox API for VM list
|
||||||
|
3. **Function** → Compare with previous state (use Set node)
|
||||||
|
4. **If** → VM state changed
|
||||||
|
5. **Telegram** → Notify VM started/stopped
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 10: Internet Speed Test Alert
|
||||||
|
|
||||||
|
**Nodes:**
|
||||||
|
1. **Schedule Trigger** (every 6 hours)
|
||||||
|
2. **HTTP Request** → Prometheus speedtest exporter
|
||||||
|
3. **If** → Download speed < 500 Mbps
|
||||||
|
4. **Telegram** → Alert about slow internet
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advanced Integration Ideas
|
||||||
|
|
||||||
|
### Home Assistant Automations
|
||||||
|
- Turn on lights when server room temperature > 80°F
|
||||||
|
- Trigger workflows from HA button press
|
||||||
|
- Send sensor data to external services
|
||||||
|
|
||||||
|
### Proxmox Automation
|
||||||
|
- Auto-snapshot VMs before updates
|
||||||
|
- Clone VMs for testing
|
||||||
|
- Monitor resource usage and rebalance
|
||||||
|
|
||||||
|
### Media Management
|
||||||
|
- Notify when new Plex content added
|
||||||
|
- Auto-organize downloads
|
||||||
|
- Send weekly watch statistics
|
||||||
|
|
||||||
|
### Backup Monitoring
|
||||||
|
- Verify all Syncthing folders synced
|
||||||
|
- Alert on ZFS scrub errors
|
||||||
|
- Monitor snapshot ages
|
||||||
|
|
||||||
|
### Security
|
||||||
|
- Alert on failed SSH attempts (from logs)
|
||||||
|
- Monitor SSL certificate expiration
|
||||||
|
- Track unusual network traffic patterns
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## n8n Best Practices
|
||||||
|
|
||||||
|
1. **Error Handling:** Always add error workflows to catch failures
|
||||||
|
2. **Rate Limiting:** Don't query APIs too frequently
|
||||||
|
3. **Credentials:** Never hardcode - always use credential store
|
||||||
|
4. **Testing:** Use manual trigger during development
|
||||||
|
5. **Logging:** Add Set nodes to track workflow state
|
||||||
|
6. **Backups:** Export workflows regularly (Settings → Export)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Useful PromQL Queries for n8n
|
||||||
|
|
||||||
|
**CPU Usage:**
|
||||||
|
```promql
|
||||||
|
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Memory Usage:**
|
||||||
|
```promql
|
||||||
|
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
**Disk Usage:**
|
||||||
|
```promql
|
||||||
|
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
**Hosts Down:**
|
||||||
|
```promql
|
||||||
|
up{job=~"node.*"} == 0
|
||||||
|
```
|
||||||
|
|
||||||
|
**Syncthing Disconnected:**
|
||||||
|
```promql
|
||||||
|
up{job=~"syncthing.*"} == 0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Webhook URLs
|
||||||
|
|
||||||
|
After creating webhooks in n8n, you'll get URLs like:
|
||||||
|
- `https://n8n.htsn.io/webhook/your-webhook-name`
|
||||||
|
|
||||||
|
These can be called from:
|
||||||
|
- Grafana alerts
|
||||||
|
- Home Assistant automations
|
||||||
|
- Gitea webhooks
|
||||||
|
- Custom scripts
|
||||||
|
- UPS monitoring (NUT)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Credentials
|
||||||
|
|
||||||
|
Test each credential after adding:
|
||||||
|
1. Create simple workflow with manual trigger
|
||||||
|
2. Add HTTP Request node with credential
|
||||||
|
3. Execute and check response
|
||||||
|
4. Verify data returned correctly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**Can't reach local service:**
|
||||||
|
- Verify service IP and port
|
||||||
|
- Check if service requires HTTPS
|
||||||
|
- Test with `curl` from docker-host2 first
|
||||||
|
|
||||||
|
**Webhook not triggering:**
|
||||||
|
- Check n8n is accessible: `curl https://n8n.htsn.io/webhook/test`
|
||||||
|
- Verify webhook URL in external service
|
||||||
|
- Check n8n execution logs
|
||||||
|
|
||||||
|
**Workflow fails silently:**
|
||||||
|
- Enable "Execute on Error" workflow
|
||||||
|
- Check workflow execution list
|
||||||
|
- Add Function nodes to log data
|
||||||
|
|
||||||
|
**API authentication fails:**
|
||||||
|
- Verify credential is saved
|
||||||
|
- Check API token hasn't expired
|
||||||
|
- Test with curl manually first
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Add Credentials** - Start with Telegram and Prometheus
|
||||||
|
2. **Create Test Workflow** - Simple hourly health check
|
||||||
|
3. **Test Telegram** - Verify messages arrive
|
||||||
|
4. **Build Gradually** - Add one workflow at a time
|
||||||
|
5. **Export Backups** - Save workflows regularly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- **n8n Docs:** https://docs.n8n.io
|
||||||
|
- **Community Workflows:** https://n8n.io/workflows
|
||||||
|
- **Your n8n:** https://n8n.htsn.io
|
||||||
|
- **Your API Docs:** [N8N.md](N8N.md)
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-27
|
||||||
308
N8N.md
Normal file
308
N8N.md
Normal file
@@ -0,0 +1,308 @@
|
|||||||
|
# n8n - Workflow Automation
|
||||||
|
|
||||||
|
n8n is an extendable workflow automation tool deployed on docker-host2 for automating tasks across your homelab and external services.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
| Setting | Value |
|
||||||
|
|---------|-------|
|
||||||
|
| **URL** | https://n8n.htsn.io |
|
||||||
|
| **Local IP** | 10.10.10.207:5678 |
|
||||||
|
| **Server** | docker-host2 (PVE2 VMID 302) |
|
||||||
|
| **Database** | PostgreSQL (containerized) |
|
||||||
|
| **API Endpoint** | http://10.10.10.207:5678/api/v1/ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Claude Code Integration (MCP)
|
||||||
|
|
||||||
|
### n8n-MCP Server
|
||||||
|
|
||||||
|
The n8n-MCP server gives Claude Code deep knowledge of all 545+ n8n nodes, enabling it to build complete workflows from natural language descriptions.
|
||||||
|
|
||||||
|
**Installation:** Already configured in `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"n8n-nodes": {
|
||||||
|
"command": "npx",
|
||||||
|
"args": ["-y", "@czlonkowski/n8n-mcp"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**What This Enables:**
|
||||||
|
- ✅ Build n8n workflows from natural language
|
||||||
|
- ✅ Get detailed help with node parameters and options
|
||||||
|
- ✅ Best practices for n8n node usage
|
||||||
|
- ✅ Debug workflow issues with full node context
|
||||||
|
|
||||||
|
**Example Prompts:**
|
||||||
|
```
|
||||||
|
"Create an n8n workflow to monitor Prometheus and send Telegram alerts"
|
||||||
|
"Build a workflow that triggers when Syncthing has errors"
|
||||||
|
"What's the best n8n node to parse JSON responses?"
|
||||||
|
```
|
||||||
|
|
||||||
|
**How It Works:**
|
||||||
|
- MCP server provides offline documentation for all n8n nodes
|
||||||
|
- No connection to your n8n instance required
|
||||||
|
- Claude builds workflows that you can then import into https://n8n.htsn.io
|
||||||
|
|
||||||
|
**Resources:**
|
||||||
|
- [n8n-MCP GitHub](https://github.com/czlonkowski/n8n-mcp)
|
||||||
|
- [MCP Documentation](https://docs.n8n.io/advanced-ai/accessing-n8n-mcp-server/)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Access
|
||||||
|
|
||||||
|
### API Key
|
||||||
|
|
||||||
|
```
|
||||||
|
X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo
|
||||||
|
```
|
||||||
|
|
||||||
|
### API Examples
|
||||||
|
|
||||||
|
**List Workflows:**
|
||||||
|
```bash
|
||||||
|
curl -H "X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo" \
|
||||||
|
http://10.10.10.207:5678/api/v1/workflows
|
||||||
|
```
|
||||||
|
|
||||||
|
**Get Workflow by ID:**
|
||||||
|
```bash
|
||||||
|
curl -H "X-N8N-API-KEY: YOUR_API_KEY" \
|
||||||
|
http://10.10.10.207:5678/api/v1/workflows/{id}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Trigger Workflow:**
|
||||||
|
```bash
|
||||||
|
curl -X POST \
|
||||||
|
-H "X-N8N-API-KEY: YOUR_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"data": {"key": "value"}}' \
|
||||||
|
http://10.10.10.207:5678/api/v1/workflows/{id}/execute
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Documentation:** https://docs.n8n.io/api/
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment Details
|
||||||
|
|
||||||
|
### Docker Compose
|
||||||
|
|
||||||
|
**Location:** `/opt/n8n/docker-compose.yml` on docker-host2
|
||||||
|
|
||||||
|
**Services:**
|
||||||
|
- `n8n` - Main application (port 5678)
|
||||||
|
- `postgres` - Database backend
|
||||||
|
|
||||||
|
**Volumes:**
|
||||||
|
- `n8n_data` - Workflow data, credentials, settings
|
||||||
|
- `postgres_data` - Database storage
|
||||||
|
|
||||||
|
### Environment Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
N8N_HOST: n8n.htsn.io
|
||||||
|
N8N_PORT: 5678
|
||||||
|
N8N_PROTOCOL: https
|
||||||
|
NODE_ENV: production
|
||||||
|
WEBHOOK_URL: https://n8n.htsn.io/
|
||||||
|
GENERIC_TIMEZONE: America/Los_Angeles
|
||||||
|
DB_TYPE: postgresdb
|
||||||
|
DB_POSTGRESDB_HOST: postgres
|
||||||
|
DB_POSTGRESDB_DATABASE: n8n
|
||||||
|
DB_POSTGRESDB_USER: n8n
|
||||||
|
DB_POSTGRESDB_PASSWORD: n8n_secure_password_2024
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resource Limits
|
||||||
|
|
||||||
|
- **Memory**: 512MB-1GB (soft/hard)
|
||||||
|
- **CPU**: Shared (4 vCPUs on host)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Tasks
|
||||||
|
|
||||||
|
### Restart n8n
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'cd /opt/n8n && docker compose restart n8n'
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'docker logs -f n8n'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Workflows
|
||||||
|
|
||||||
|
Workflows are stored in PostgreSQL. To backup:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'docker exec n8n-postgres pg_dump -U n8n n8n > /tmp/n8n-backup-$(date +%Y%m%d).sql'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Update n8n
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'cd /opt/n8n && docker compose pull n8n && docker compose up -d n8n'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Traefik Configuration
|
||||||
|
|
||||||
|
**File:** `/etc/traefik/conf.d/n8n.yaml` on CT 202
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
n8n-secure:
|
||||||
|
entryPoints:
|
||||||
|
- websecure
|
||||||
|
rule: "Host(`n8n.htsn.io`)"
|
||||||
|
service: n8n
|
||||||
|
tls:
|
||||||
|
certResolver: cloudflare
|
||||||
|
priority: 50
|
||||||
|
|
||||||
|
n8n-redirect:
|
||||||
|
entryPoints:
|
||||||
|
- web
|
||||||
|
rule: "Host(`n8n.htsn.io`)"
|
||||||
|
middlewares:
|
||||||
|
- n8n-https-redirect
|
||||||
|
service: n8n
|
||||||
|
priority: 50
|
||||||
|
|
||||||
|
services:
|
||||||
|
n8n:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: "http://10.10.10.207:5678"
|
||||||
|
|
||||||
|
middlewares:
|
||||||
|
n8n-https-redirect:
|
||||||
|
redirectScheme:
|
||||||
|
scheme: https
|
||||||
|
permanent: true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### Prometheus
|
||||||
|
|
||||||
|
n8n exposes metrics at `http://10.10.10.207:5678/metrics` (if enabled)
|
||||||
|
|
||||||
|
### Grafana
|
||||||
|
|
||||||
|
n8n metrics can be visualized in Grafana dashboards
|
||||||
|
|
||||||
|
### Uptime Monitoring
|
||||||
|
|
||||||
|
Add to Pulse: https://pulse.htsn.io
|
||||||
|
- Monitor: https://n8n.htsn.io
|
||||||
|
- Check interval: 60s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### n8n won't start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh docker-host2 'docker logs n8n | tail -50'
|
||||||
|
ssh docker-host2 'docker logs n8n-postgres | tail -50'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Database connection issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check postgres health
|
||||||
|
ssh docker-host2 'docker exec n8n-postgres pg_isready -U n8n'
|
||||||
|
|
||||||
|
# Restart postgres
|
||||||
|
ssh docker-host2 'cd /opt/n8n && docker compose restart postgres'
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSL/HTTPS issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Traefik config
|
||||||
|
ssh root@10.10.10.250 'cat /etc/traefik/conf.d/n8n.yaml'
|
||||||
|
|
||||||
|
# Reload Traefik
|
||||||
|
ssh root@10.10.10.250 'systemctl reload traefik'
|
||||||
|
```
|
||||||
|
|
||||||
|
### API not responding
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test API locally
|
||||||
|
curl -H "X-N8N-API-KEY: YOUR_KEY" http://10.10.10.207:5678/api/v1/workflows
|
||||||
|
|
||||||
|
# Check if n8n container is healthy
|
||||||
|
ssh docker-host2 'docker ps | grep n8n'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Examples
|
||||||
|
|
||||||
|
### Homelab Automation Ideas
|
||||||
|
|
||||||
|
1. **Backup Notifications** - Send Telegram alerts when backups complete
|
||||||
|
2. **Server Monitoring** - Query Prometheus and alert on high CPU/memory
|
||||||
|
3. **Media Management** - Trigger Sonarr/Radarr downloads
|
||||||
|
4. **Home Assistant Integration** - Automate smart home workflows
|
||||||
|
5. **Git Webhooks** - Deploy changes from Gitea automatically
|
||||||
|
6. **Syncthing Monitoring** - Alert when sync folders get behind
|
||||||
|
7. **UPS Alerts** - Notify on power events from NUT
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Notes
|
||||||
|
|
||||||
|
- API key provides full access to all workflows and data
|
||||||
|
- Store API key securely (added to this doc for homelab reference)
|
||||||
|
- n8n credentials are encrypted at rest in PostgreSQL
|
||||||
|
- HTTPS enforced via Traefik
|
||||||
|
- No public internet exposure (only via Tailscale)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
**New to n8n?** Start here: **[N8N-INTEGRATIONS.md](N8N-INTEGRATIONS.md)** ⭐
|
||||||
|
|
||||||
|
This guide includes:
|
||||||
|
- ✅ Network access verification
|
||||||
|
- ✅ Credential setup for all homelab services
|
||||||
|
- ✅ 10 ready-to-use starter workflows
|
||||||
|
- ✅ Home Assistant, Prometheus, Syncthing, Telegram integrations
|
||||||
|
- ✅ Troubleshooting tips
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [n8n Homelab Integrations Guide](N8N-INTEGRATIONS.md) - **START HERE**
|
||||||
|
- [docker-host2 VM details](VMS.md)
|
||||||
|
- [Traefik reverse proxy](TRAEFIK.md)
|
||||||
|
- [IP Assignments](IP-ASSIGNMENTS.md)
|
||||||
|
- [Pulse Setup](PULSE-SETUP.md)
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-26
|
||||||
509
POWER-MANAGEMENT.md
Normal file
509
POWER-MANAGEMENT.md
Normal file
@@ -0,0 +1,509 @@
|
|||||||
|
# Power Management and Optimization
|
||||||
|
|
||||||
|
Documentation of power optimizations applied to reduce idle power consumption and heat generation.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
|
||||||
|
|
||||||
|
Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Power Draw Estimates
|
||||||
|
|
||||||
|
### PVE (10.10.10.120)
|
||||||
|
|
||||||
|
| Component | Idle | Load | TDP |
|
||||||
|
|-----------|------|------|-----|
|
||||||
|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
|
||||||
|
| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
|
||||||
|
| NVIDIA Quadro P2000 | 25W | 70W | 75W |
|
||||||
|
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
|
||||||
|
| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
|
||||||
|
| HBAs, fans, misc | 20-30W | 20-30W | - |
|
||||||
|
| **Total** | **250-350W** | **800-940W** | - |
|
||||||
|
|
||||||
|
### PVE2 (10.10.10.102)
|
||||||
|
|
||||||
|
| Component | Idle | Load | TDP |
|
||||||
|
|-----------|------|------|-----|
|
||||||
|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
|
||||||
|
| NVIDIA RTX A6000 | 11W | 280W | 300W |
|
||||||
|
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
|
||||||
|
| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
|
||||||
|
| Fans, misc | 15-20W | 15-20W | - |
|
||||||
|
| **Total** | **226-330W** | **765-890W** | - |
|
||||||
|
|
||||||
|
### Combined
|
||||||
|
|
||||||
|
| Metric | Idle | Load |
|
||||||
|
|--------|------|------|
|
||||||
|
| Servers | 476-680W | 1565-1830W |
|
||||||
|
| Network gear | ~50W | ~50W |
|
||||||
|
| **Total** | **~530-730W** | **~1615-1880W** |
|
||||||
|
| **UPS Load** | 40-55% | 120-140% ⚠️ |
|
||||||
|
|
||||||
|
**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Optimizations Applied
|
||||||
|
|
||||||
|
### 1. KSMD Disabled (2024-12-17)
|
||||||
|
|
||||||
|
**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
|
||||||
|
|
||||||
|
**Problem**:
|
||||||
|
- KSMD was consuming 44-57% CPU continuously on PVE
|
||||||
|
- Caused CPU temp to rise from 74°C to 83°C
|
||||||
|
- **Negative profit**: More power spent scanning than saved from deduplication
|
||||||
|
|
||||||
|
**Solution**: Disabled KSM permanently
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
|
||||||
|
**Systemd service**: `/etc/systemd/system/disable-ksm.service`
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Disable KSM (Kernel Same-page Merging)
|
||||||
|
After=multi-user.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
|
||||||
|
RemainAfterExit=yes
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enable and start**:
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now disable-ksm
|
||||||
|
systemctl mask ksmtuned # Prevent re-enabling
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify**:
|
||||||
|
```bash
|
||||||
|
# KSM should be disabled (run=0)
|
||||||
|
cat /sys/kernel/mm/ksm/run # Should output: 0
|
||||||
|
|
||||||
|
# ksmd should show 0% CPU
|
||||||
|
ps aux | grep ksmd
|
||||||
|
```
|
||||||
|
|
||||||
|
**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
|
||||||
|
|
||||||
|
**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
|
||||||
|
```bash
|
||||||
|
cat /sys/kernel/mm/ksm/run
|
||||||
|
# If 1, disable it:
|
||||||
|
echo 0 > /sys/kernel/mm/ksm/run
|
||||||
|
systemctl mask ksmtuned
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. CPU Governor Optimization (2024-12-16)
|
||||||
|
|
||||||
|
Default CPU governor keeps cores at max frequency even when idle, wasting power.
|
||||||
|
|
||||||
|
#### PVE: `amd-pstate-epp` Driver
|
||||||
|
|
||||||
|
**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
|
||||||
|
**Governor**: `powersave`
|
||||||
|
**EPP**: `balance_power`
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
|
||||||
|
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Set CPU governor to powersave with balance_power EPP
|
||||||
|
After=multi-user.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
|
||||||
|
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
|
||||||
|
RemainAfterExit=yes
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enable**:
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now cpu-powersave
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify**:
|
||||||
|
```bash
|
||||||
|
# Check governor
|
||||||
|
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
||||||
|
# Output: powersave
|
||||||
|
|
||||||
|
# Check EPP
|
||||||
|
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
|
||||||
|
# Output: balance_power
|
||||||
|
|
||||||
|
# Check current frequency (should be low when idle)
|
||||||
|
grep MHz /proc/cpuinfo | head -5
|
||||||
|
# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
|
||||||
|
```
|
||||||
|
|
||||||
|
#### PVE2: `acpi-cpufreq` Driver
|
||||||
|
|
||||||
|
**Driver**: `acpi-cpufreq` (older ACPI driver)
|
||||||
|
**Governor**: `schedutil` (adaptive, better than powersave for this driver)
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
|
||||||
|
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Set CPU governor to schedutil
|
||||||
|
After=multi-user.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
|
||||||
|
RemainAfterExit=yes
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enable**:
|
||||||
|
```bash
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now cpu-powersave
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify**:
|
||||||
|
```bash
|
||||||
|
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
||||||
|
# Output: schedutil
|
||||||
|
|
||||||
|
grep MHz /proc/cpuinfo | head -5
|
||||||
|
# Should show ~1700-2200 MHz idle
|
||||||
|
```
|
||||||
|
|
||||||
|
**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
|
||||||
|
|
||||||
|
**Performance impact**: Minimal - CPU still boosts to max frequency under load
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. GPU Power States (2024-12-16)
|
||||||
|
|
||||||
|
GPUs automatically enter low-power states when idle. Verified optimal.
|
||||||
|
|
||||||
|
| GPU | Location | Idle Power | P-State | Notes |
|
||||||
|
|-----|----------|------------|---------|-------|
|
||||||
|
| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
|
||||||
|
| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
|
||||||
|
| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
|
||||||
|
|
||||||
|
**Check GPU power state**:
|
||||||
|
```bash
|
||||||
|
# Via nvidia-smi (if installed in VM)
|
||||||
|
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
# name, power.draw [W], pstate
|
||||||
|
# NVIDIA TITAN RTX, 2.50 W, P8
|
||||||
|
|
||||||
|
# Via lspci (from Proxmox host - shows link speed, not power)
|
||||||
|
ssh pve 'lspci | grep -i nvidia'
|
||||||
|
```
|
||||||
|
|
||||||
|
**P-States**:
|
||||||
|
- **P0**: Maximum performance
|
||||||
|
- **P8**: Minimum power (idle)
|
||||||
|
|
||||||
|
**No action needed** - GPUs automatically manage power states.
|
||||||
|
|
||||||
|
**Savings**: N/A (already optimal)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Syncthing Rescan Intervals (2024-12-16)
|
||||||
|
|
||||||
|
Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
|
||||||
|
|
||||||
|
**Changed**:
|
||||||
|
- Large folders: 60s → **3600s** (1 hour)
|
||||||
|
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
|
||||||
|
|
||||||
|
**Configuration**: Via Syncthing UI on each device
|
||||||
|
- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
|
||||||
|
|
||||||
|
**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
|
||||||
|
|
||||||
|
**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
|
||||||
|
- Still acceptable for most use cases
|
||||||
|
- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. ksmtuned Disabled (2024-12-16)
|
||||||
|
|
||||||
|
**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
|
||||||
|
|
||||||
|
**Solution**: Stopped and disabled on both servers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop ksmtuned
|
||||||
|
systemctl disable ksmtuned
|
||||||
|
systemctl mask ksmtuned # Prevent re-enabling
|
||||||
|
```
|
||||||
|
|
||||||
|
**Savings**: ~2-5W
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 6. HDD Spindown on PVE2 (2024-12-16)
|
||||||
|
|
||||||
|
**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
|
||||||
|
|
||||||
|
**Solution**: Configure 30-minute spindown timeout
|
||||||
|
|
||||||
|
**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
|
||||||
|
```udev
|
||||||
|
# Spin down WD Red 6TB drives after 30 minutes idle
|
||||||
|
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
|
||||||
|
```
|
||||||
|
|
||||||
|
**hdparm value**: 241 = 30 minutes
|
||||||
|
- Formula: `value * 5 seconds = timeout`
|
||||||
|
- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
|
||||||
|
|
||||||
|
**Apply rule**:
|
||||||
|
```bash
|
||||||
|
udevadm control --reload-rules
|
||||||
|
udevadm trigger
|
||||||
|
|
||||||
|
# Verify drives have spindown set
|
||||||
|
hdparm -I /dev/sda | grep -i standby
|
||||||
|
hdparm -I /dev/sdb | grep -i standby
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check if drives are spun down**:
|
||||||
|
```bash
|
||||||
|
hdparm -C /dev/sda
|
||||||
|
# Output: drive state is: standby (spun down)
|
||||||
|
# or: drive state is: active/idle (spinning)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Savings**: ~10-16W when spun down (8W per drive)
|
||||||
|
|
||||||
|
**Trade-off**: 5-10 second delay when accessing pool after spindown
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Potential Optimizations (Not Yet Applied)
|
||||||
|
|
||||||
|
### PCIe ASPM (Active State Power Management)
|
||||||
|
|
||||||
|
**Benefit**: Reduce power of idle PCIe devices
|
||||||
|
**Risk**: May cause stability issues with some devices
|
||||||
|
**Estimated savings**: 5-15W
|
||||||
|
|
||||||
|
**Test**:
|
||||||
|
```bash
|
||||||
|
# Check current ASPM state
|
||||||
|
lspci -vv | grep -i aspm
|
||||||
|
|
||||||
|
# Enable ASPM (test first)
|
||||||
|
# Add to kernel cmdline: pcie_aspm=force
|
||||||
|
# Edit /etc/default/grub:
|
||||||
|
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
|
||||||
|
|
||||||
|
# Update grub
|
||||||
|
update-grub
|
||||||
|
reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
### NMI Watchdog Disable
|
||||||
|
|
||||||
|
**Benefit**: Reduce CPU wakeups
|
||||||
|
**Risk**: Harder to debug kernel hangs
|
||||||
|
**Estimated savings**: 1-3W
|
||||||
|
|
||||||
|
**Test**:
|
||||||
|
```bash
|
||||||
|
# Disable NMI watchdog
|
||||||
|
echo 0 > /proc/sys/kernel/nmi_watchdog
|
||||||
|
|
||||||
|
# Make permanent (add to kernel cmdline)
|
||||||
|
# Edit /etc/default/grub:
|
||||||
|
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
|
||||||
|
|
||||||
|
update-grub
|
||||||
|
reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
### CPU Frequency
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Current frequency on all cores
|
||||||
|
ssh pve 'grep MHz /proc/cpuinfo | head -10'
|
||||||
|
|
||||||
|
# Governor
|
||||||
|
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
|
||||||
|
|
||||||
|
# Available governors
|
||||||
|
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
|
||||||
|
```
|
||||||
|
|
||||||
|
### CPU Temperature
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# PVE
|
||||||
|
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
|
||||||
|
# PVE2
|
||||||
|
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Healthy temps**: 70-80°C under load
|
||||||
|
**Warning**: >85°C
|
||||||
|
**Throttle**: 90°C (Tctl max for Threadripper PRO)
|
||||||
|
|
||||||
|
### GPU Power Draw
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# If nvidia-smi installed in VM
|
||||||
|
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
|
||||||
|
|
||||||
|
# Sample output:
|
||||||
|
# name, power.draw [W], power.limit [W], pstate
|
||||||
|
# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Power Consumption (UPS)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check UPS load percentage
|
||||||
|
ssh pve 'upsc cyberpower@localhost ups.load'
|
||||||
|
|
||||||
|
# Battery runtime (seconds)
|
||||||
|
ssh pve 'upsc cyberpower@localhost battery.runtime'
|
||||||
|
|
||||||
|
# Full UPS status
|
||||||
|
ssh pve 'upsc cyberpower@localhost'
|
||||||
|
```
|
||||||
|
|
||||||
|
See [UPS.md](UPS.md) for more UPS monitoring details.
|
||||||
|
|
||||||
|
### ZFS ARC Memory Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# PVE
|
||||||
|
ssh pve 'arc_summary | grep -A5 "ARC size"'
|
||||||
|
|
||||||
|
# TrueNAS
|
||||||
|
ssh truenas 'arc_summary | grep -A5 "ARC size"'
|
||||||
|
```
|
||||||
|
|
||||||
|
**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Limit ARC to 32 GB (example)
|
||||||
|
# Edit /etc/modprobe.d/zfs.conf:
|
||||||
|
options zfs zfs_arc_max=34359738368
|
||||||
|
|
||||||
|
# Apply (reboot required)
|
||||||
|
update-initramfs -u
|
||||||
|
reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### CPU Not Downclocking
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check current governor
|
||||||
|
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
||||||
|
|
||||||
|
# Should be: powersave (PVE) or schedutil (PVE2)
|
||||||
|
# If not, systemd service may have failed
|
||||||
|
|
||||||
|
# Check service status
|
||||||
|
systemctl status cpu-powersave
|
||||||
|
|
||||||
|
# Manually set governor (temporary)
|
||||||
|
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||||
|
|
||||||
|
# Check frequency
|
||||||
|
grep MHz /proc/cpuinfo | head -5
|
||||||
|
```
|
||||||
|
|
||||||
|
### High Idle Power After Update
|
||||||
|
|
||||||
|
**Common causes**:
|
||||||
|
1. **KSM re-enabled** after Proxmox update
|
||||||
|
- Check: `cat /sys/kernel/mm/ksm/run`
|
||||||
|
- Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
|
||||||
|
|
||||||
|
2. **CPU governor reset** to default
|
||||||
|
- Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
|
||||||
|
- Fix: `systemctl restart cpu-powersave`
|
||||||
|
|
||||||
|
3. **GPU stuck in high-performance mode**
|
||||||
|
- Check: `nvidia-smi --query-gpu=pstate --format=csv`
|
||||||
|
- Fix: Restart VM or power cycle GPU
|
||||||
|
|
||||||
|
### HDDs Won't Spin Down
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check spindown setting
|
||||||
|
hdparm -I /dev/sda | grep -i standby
|
||||||
|
|
||||||
|
# Set spindown manually (temporary)
|
||||||
|
hdparm -S 241 /dev/sda
|
||||||
|
|
||||||
|
# Check if drive is idle (ZFS may keep it active)
|
||||||
|
zpool iostat -v 1 5 # Watch for activity
|
||||||
|
|
||||||
|
# Check what's accessing the drive
|
||||||
|
lsof | grep /mnt/pool
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Power Optimization Summary
|
||||||
|
|
||||||
|
| Optimization | Savings | Applied | Notes |
|
||||||
|
|--------------|---------|---------|-------|
|
||||||
|
| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
|
||||||
|
| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
|
||||||
|
| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
|
||||||
|
| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
|
||||||
|
| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
|
||||||
|
| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
|
||||||
|
| PCIe ASPM | 5-15W | ❌ | Not yet tested |
|
||||||
|
| NMI watchdog | 1-3W | ❌ | Not yet tested |
|
||||||
|
| **Total savings** | **~150-300W** | - | Significant reduction |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [UPS.md](UPS.md) - UPS capacity and power monitoring
|
||||||
|
- [STORAGE.md](STORAGE.md) - HDD spindown configuration
|
||||||
|
- [VMS.md](VMS.md) - VM resource allocation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
69
PULSE-SETUP.md
Normal file
69
PULSE-SETUP.md
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
# Add n8n and docker-host2 to Pulse Monitoring
|
||||||
|
|
||||||
|
Pulse automatically monitors based on Prometheus targets, but you can also add custom HTTP monitors.
|
||||||
|
|
||||||
|
## Quick Steps
|
||||||
|
|
||||||
|
1. Open **https://pulse.htsn.io** in your browser
|
||||||
|
2. Login if required
|
||||||
|
3. Click **"+ Add Monitor"** or **"New Monitor"**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitor: n8n
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Name** | n8n Workflow Automation |
|
||||||
|
| **URL** | https://n8n.htsn.io |
|
||||||
|
| **Check Interval** | 60 seconds |
|
||||||
|
| **Monitor Type** | HTTP/HTTPS |
|
||||||
|
| **Expected Status** | 200 |
|
||||||
|
| **Timeout** | 10 seconds |
|
||||||
|
| **Alert After** | 2 failed checks |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitor: docker-host2
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Name** | docker-host2 (node_exporter) |
|
||||||
|
| **URL** | http://10.10.10.207:9100/metrics |
|
||||||
|
| **Check Interval** | 60 seconds |
|
||||||
|
| **Monitor Type** | HTTP |
|
||||||
|
| **Expected Status** | 200 |
|
||||||
|
| **Expected Content** | `node_exporter` |
|
||||||
|
| **Timeout** | 5 seconds |
|
||||||
|
| **Alert After** | 2 failed checks |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Optional: docker-host2 SSH
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Name** | docker-host2 SSH |
|
||||||
|
| **Host** | 10.10.10.207 |
|
||||||
|
| **Port** | 22 |
|
||||||
|
| **Monitor Type** | TCP Port |
|
||||||
|
| **Check Interval** | 60 seconds |
|
||||||
|
| **Timeout** | 5 seconds |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
After adding monitors, you should see:
|
||||||
|
- ✅ Green status for both monitors
|
||||||
|
- Response time graphs
|
||||||
|
- Uptime percentage
|
||||||
|
- Alert history (should be empty)
|
||||||
|
|
||||||
|
Access Pulse dashboard: **https://pulse.htsn.io**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Note:** Pulse may already be monitoring these services via Prometheus integration. Check existing monitors before adding duplicates.
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-27
|
||||||
149
README.md
Normal file
149
README.md
Normal file
@@ -0,0 +1,149 @@
|
|||||||
|
# Homelab Documentation
|
||||||
|
|
||||||
|
Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
|
||||||
|
|
||||||
|
## 🚀 Quick Start
|
||||||
|
|
||||||
|
**New to this homelab?** Start here:
|
||||||
|
1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
|
||||||
|
2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
|
||||||
|
3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
|
||||||
|
4. [SERVICES.md](SERVICES.md) - What services are running
|
||||||
|
|
||||||
|
**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
|
||||||
|
|
||||||
|
## 📚 Documentation Index
|
||||||
|
|
||||||
|
### Infrastructure
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [GATEWAY.md](GATEWAY.md) | UniFi gateway monitoring, watchdog services, troubleshooting |
|
||||||
|
| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
|
||||||
|
| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
|
||||||
|
| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
|
||||||
|
| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
|
||||||
|
| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
|
||||||
|
| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
|
||||||
|
|
||||||
|
### Services & Applications
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
|
||||||
|
| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
|
||||||
|
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
|
||||||
|
| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
|
||||||
|
| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
|
||||||
|
|
||||||
|
### Access & Security
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
|
||||||
|
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
|
||||||
|
| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
|
||||||
|
|
||||||
|
### Operations
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
|
||||||
|
| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
|
||||||
|
| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
|
||||||
|
| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
|
||||||
|
|
||||||
|
### Reference
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
|
||||||
|
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
|
||||||
|
|
||||||
|
## 🖥️ System Overview
|
||||||
|
|
||||||
|
### Servers
|
||||||
|
|
||||||
|
- **PVE** (10.10.10.120) - Primary Proxmox server
|
||||||
|
- AMD Threadripper PRO 3975WX (32-core)
|
||||||
|
- 128 GB RAM
|
||||||
|
- NVIDIA Quadro P2000 + TITAN RTX
|
||||||
|
|
||||||
|
- **PVE2** (10.10.10.102) - Secondary Proxmox server
|
||||||
|
- AMD Threadripper PRO 3975WX (32-core)
|
||||||
|
- 128 GB RAM
|
||||||
|
- NVIDIA RTX A6000
|
||||||
|
|
||||||
|
### Key Services
|
||||||
|
|
||||||
|
| Service | Location | URL |
|
||||||
|
|---------|----------|-----|
|
||||||
|
| **Proxmox** | PVE | https://pve.htsn.io |
|
||||||
|
| **TrueNAS** | VM 100 | https://truenas.htsn.io |
|
||||||
|
| **Plex** | Saltbox VM | https://plex.htsn.io |
|
||||||
|
| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
|
||||||
|
| **Gitea** | VM 300 | https://git.htsn.io |
|
||||||
|
| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
|
||||||
|
| **Traefik** | CT 202 | http://10.10.10.250:8080 |
|
||||||
|
|
||||||
|
[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
|
||||||
|
|
||||||
|
## 🔥 Emergency Procedures
|
||||||
|
|
||||||
|
### Power Failure
|
||||||
|
1. UPS provides ~15 min runtime at typical load
|
||||||
|
2. At 2 min remaining, NUT triggers graceful VM shutdown
|
||||||
|
3. When power returns, servers auto-boot and start VMs in order
|
||||||
|
|
||||||
|
See [UPS.md](UPS.md) for details.
|
||||||
|
|
||||||
|
### Service Down
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Quick health check (run from Mac Mini)
|
||||||
|
ssh pve 'qm list' # Check VMs on PVE
|
||||||
|
ssh pve2 'qm list' # Check VMs on PVE2
|
||||||
|
ssh pve 'pct list' # Check containers
|
||||||
|
|
||||||
|
# Syncthing status
|
||||||
|
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/system/connections"
|
||||||
|
|
||||||
|
# Restart a VM
|
||||||
|
ssh pve 'qm stop VMID && qm start VMID'
|
||||||
|
```
|
||||||
|
|
||||||
|
See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
|
||||||
|
|
||||||
|
## 📞 Getting Help
|
||||||
|
|
||||||
|
**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
|
||||||
|
|
||||||
|
**Key Contacts**:
|
||||||
|
- Homelab Owner: Hutson
|
||||||
|
- Git Repo: https://git.htsn.io/hutson/homelab-docs
|
||||||
|
- Local Path: `~/Projects/homelab`
|
||||||
|
|
||||||
|
## 🔄 Recent Changes
|
||||||
|
|
||||||
|
See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
|
||||||
|
|
||||||
|
## 📝 Contributing
|
||||||
|
|
||||||
|
When updating docs:
|
||||||
|
1. Keep CLAUDE.md as quick reference only
|
||||||
|
2. Move detailed content to specialized docs
|
||||||
|
3. Update cross-references
|
||||||
|
4. Test all commands before committing
|
||||||
|
5. Add entries to changelog
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/Projects/homelab
|
||||||
|
git add -A
|
||||||
|
git commit -m "Update documentation: <description>"
|
||||||
|
git push
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2026-01-02
|
||||||
591
SERVICES.md
Normal file
591
SERVICES.md
Normal file
@@ -0,0 +1,591 @@
|
|||||||
|
# Services Inventory
|
||||||
|
|
||||||
|
Complete inventory of all services running across the homelab infrastructure.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
| Category | Services | Location | Access |
|
||||||
|
|----------|----------|----------|--------|
|
||||||
|
| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
|
||||||
|
| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
|
||||||
|
| **Development** | Gitea, Docker services | VMs | Web UI |
|
||||||
|
| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
|
||||||
|
| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
|
||||||
|
|
||||||
|
**Total Services**: 25+ running services
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service URLs Quick Reference
|
||||||
|
|
||||||
|
| Service | URL | Authentication | Purpose |
|
||||||
|
|---------|-----|----------------|---------|
|
||||||
|
| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
|
||||||
|
| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
|
||||||
|
| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
|
||||||
|
| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
|
||||||
|
| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
|
||||||
|
| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
|
||||||
|
| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
|
||||||
|
| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
|
||||||
|
| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
|
||||||
|
| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
|
||||||
|
| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
|
||||||
|
| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Infrastructure Services
|
||||||
|
|
||||||
|
### Proxmox VE (PVE & PVE2)
|
||||||
|
|
||||||
|
**Purpose**: Virtualization platform, VM/CT host
|
||||||
|
**Location**: Physical servers (10.10.10.120, 10.10.10.102)
|
||||||
|
**Access**: https://pve.htsn.io:8006, SSH
|
||||||
|
**Version**: Unknown (check: `pveversion`)
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- Web-based management
|
||||||
|
- VM and LXC container support
|
||||||
|
- ZFS storage pools
|
||||||
|
- Clustering (2-node)
|
||||||
|
- API access
|
||||||
|
|
||||||
|
**Common Operations**:
|
||||||
|
```bash
|
||||||
|
# List VMs
|
||||||
|
ssh pve 'qm list'
|
||||||
|
|
||||||
|
# Create VM
|
||||||
|
ssh pve 'qm create VMID --name myvm ...'
|
||||||
|
|
||||||
|
# Backup VM
|
||||||
|
ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: [VMS.md](VMS.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### TrueNAS SCALE (VM 100)
|
||||||
|
|
||||||
|
**Purpose**: Central file storage, NFS/SMB shares
|
||||||
|
**Location**: VM on PVE (10.10.10.200)
|
||||||
|
**Access**: https://truenas.htsn.io, SSH
|
||||||
|
**Version**: TrueNAS SCALE (check version in UI)
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- ZFS storage management
|
||||||
|
- NFS exports
|
||||||
|
- SMB shares
|
||||||
|
- Syncthing hub
|
||||||
|
- Snapshot management
|
||||||
|
|
||||||
|
**Storage Pools**:
|
||||||
|
- `vault`: Main data pool on EMC enclosure
|
||||||
|
|
||||||
|
**Shares** (needs documentation):
|
||||||
|
- NFS exports for Saltbox media
|
||||||
|
- SMB shares for Windows access
|
||||||
|
- Syncthing sync folders
|
||||||
|
|
||||||
|
**See**: [STORAGE.md](STORAGE.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Pi-hole (CT 200)
|
||||||
|
|
||||||
|
**Purpose**: Network-wide DNS server and ad blocker
|
||||||
|
**Location**: LXC on PVE (10.10.10.10)
|
||||||
|
**Access**: http://10.10.10.10/admin
|
||||||
|
**Version**: Unknown
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
- **Upstream DNS**: Cloudflare (1.1.1.1)
|
||||||
|
- **Blocklists**: Unknown count
|
||||||
|
- **Queries**: All network DNS traffic
|
||||||
|
- **DHCP**: Disabled (router handles DHCP)
|
||||||
|
|
||||||
|
**Stats** (example):
|
||||||
|
```bash
|
||||||
|
ssh pihole 'pihole -c -e' # Stats
|
||||||
|
ssh pihole 'pihole status' # Status
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common Tasks**:
|
||||||
|
- Update blocklists: `ssh pihole 'pihole -g'`
|
||||||
|
- Whitelist domain: `ssh pihole 'pihole -w example.com'`
|
||||||
|
- View logs: `ssh pihole 'pihole -t'`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Traefik (CT 202)
|
||||||
|
|
||||||
|
**Purpose**: Reverse proxy for all public-facing services
|
||||||
|
**Location**: LXC on PVE (10.10.10.250)
|
||||||
|
**Access**: http://10.10.10.250:8080/dashboard/
|
||||||
|
**Version**: Unknown (check: `traefik version`)
|
||||||
|
|
||||||
|
**Managed Services**:
|
||||||
|
- All *.htsn.io domains (except Saltbox services)
|
||||||
|
- SSL/TLS certificates via Let's Encrypt
|
||||||
|
- HTTP → HTTPS redirects
|
||||||
|
|
||||||
|
**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Media Services (Saltbox VM)
|
||||||
|
|
||||||
|
All media services run in Docker on the Saltbox VM (10.10.10.100).
|
||||||
|
|
||||||
|
### Plex Media Server
|
||||||
|
|
||||||
|
**Purpose**: Media streaming platform
|
||||||
|
**URL**: https://plex.htsn.io
|
||||||
|
**Access**: Plex account
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Hardware transcoding (TITAN RTX)
|
||||||
|
- Libraries: Movies, TV, Music
|
||||||
|
- Remote access enabled
|
||||||
|
- Managed by Saltbox
|
||||||
|
|
||||||
|
**Media Storage**:
|
||||||
|
- Source: TrueNAS NFS mounts
|
||||||
|
- Location: `/mnt/unionfs/`
|
||||||
|
|
||||||
|
**Common Tasks**:
|
||||||
|
```bash
|
||||||
|
# View Plex status
|
||||||
|
ssh saltbox 'docker logs -f plex'
|
||||||
|
|
||||||
|
# Restart Plex
|
||||||
|
ssh saltbox 'docker restart plex'
|
||||||
|
|
||||||
|
# Scan library
|
||||||
|
# (via Plex UI: Settings → Library → Scan)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### *arr Apps (Media Automation)
|
||||||
|
|
||||||
|
Running on Saltbox VM, managed via Traefik-Saltbox.
|
||||||
|
|
||||||
|
| Service | Purpose | URL | Notes |
|
||||||
|
|---------|---------|-----|-------|
|
||||||
|
| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
|
||||||
|
| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
|
||||||
|
| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
|
||||||
|
| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
|
||||||
|
| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
|
||||||
|
|
||||||
|
**Downloaders**:
|
||||||
|
| Service | Purpose | URL |
|
||||||
|
|---------|---------|-----|
|
||||||
|
| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
|
||||||
|
| **NZBGet** | Usenet downloader | nzbget.htsn.io |
|
||||||
|
| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
|
||||||
|
|
||||||
|
**Indexers**:
|
||||||
|
| Service | Purpose | URL |
|
||||||
|
|---------|---------|-----|
|
||||||
|
| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
|
||||||
|
| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Supporting Media Services
|
||||||
|
|
||||||
|
| Service | Purpose | URL |
|
||||||
|
|---------|---------|-----|
|
||||||
|
| **Tautulli** | Plex statistics | tautulli.htsn.io |
|
||||||
|
| **Organizr** | Service dashboard | organizr.htsn.io |
|
||||||
|
| **Authelia** | SSO authentication | auth.htsn.io |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Development Services
|
||||||
|
|
||||||
|
### Gitea (VM 300)
|
||||||
|
|
||||||
|
**Purpose**: Self-hosted Git server
|
||||||
|
**Location**: VM on PVE2 (10.10.10.220)
|
||||||
|
**URL**: https://git.htsn.io
|
||||||
|
**Access**: Username/password
|
||||||
|
|
||||||
|
**Repositories**:
|
||||||
|
- homelab-docs (this documentation)
|
||||||
|
- Personal projects
|
||||||
|
- Private repos
|
||||||
|
|
||||||
|
**Common Tasks**:
|
||||||
|
```bash
|
||||||
|
# SSH to Gitea VM
|
||||||
|
ssh gitea-vm
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
ssh gitea-vm 'journalctl -u gitea -f'
|
||||||
|
|
||||||
|
# Backup
|
||||||
|
ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
|
||||||
|
```
|
||||||
|
|
||||||
|
**See**: Gitea documentation for API usage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Docker Services (docker-host VM)
|
||||||
|
|
||||||
|
Running on VM 206 (10.10.10.206).
|
||||||
|
|
||||||
|
| Service | URL | Purpose | Port |
|
||||||
|
|---------|-----|---------|------|
|
||||||
|
| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
|
||||||
|
| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
|
||||||
|
| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
|
||||||
|
|
||||||
|
**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
|
||||||
|
|
||||||
|
**Managing services**:
|
||||||
|
```bash
|
||||||
|
ssh docker-host 'docker ps'
|
||||||
|
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
|
||||||
|
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Home Automation
|
||||||
|
|
||||||
|
### Home Assistant (VM 110)
|
||||||
|
|
||||||
|
**Purpose**: Smart home automation platform
|
||||||
|
**Location**: VM on PVE (10.10.10.110)
|
||||||
|
**URL**: https://homeassistant.htsn.io
|
||||||
|
**Access**: Username/password
|
||||||
|
|
||||||
|
**Integrations**:
|
||||||
|
- UPS monitoring (NUT sensors)
|
||||||
|
- Unknown other integrations (needs documentation)
|
||||||
|
|
||||||
|
**Sensors**:
|
||||||
|
- `sensor.cyberpower_battery_charge`
|
||||||
|
- `sensor.cyberpower_load`
|
||||||
|
- `sensor.cyberpower_battery_runtime`
|
||||||
|
- `sensor.cyberpower_status`
|
||||||
|
|
||||||
|
**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Happy Coder Relay (docker-host)
|
||||||
|
|
||||||
|
**Purpose**: Self-hosted relay server for Happy Coder mobile app
|
||||||
|
**Location**: docker-host (10.10.10.206)
|
||||||
|
**URL**: https://happy.htsn.io
|
||||||
|
**Access**: QR code authentication
|
||||||
|
|
||||||
|
**Stack**:
|
||||||
|
- Happy Server (Node.js)
|
||||||
|
- PostgreSQL (user/session data)
|
||||||
|
- Redis (real-time events)
|
||||||
|
- MinIO (file/image storage)
|
||||||
|
|
||||||
|
**Clients**:
|
||||||
|
- Mac Mini (Happy daemon)
|
||||||
|
- Mobile app (iOS/Android)
|
||||||
|
|
||||||
|
**Credentials**:
|
||||||
|
- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
|
||||||
|
- PostgreSQL: `happy` / `happypass`
|
||||||
|
- MinIO: `happyadmin` / `happyadmin123`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Sync & Storage
|
||||||
|
|
||||||
|
### Syncthing
|
||||||
|
|
||||||
|
**Purpose**: File synchronization across all devices
|
||||||
|
**Devices**:
|
||||||
|
- Mac Mini (10.10.10.125) - Hub
|
||||||
|
- MacBook - Mobile sync
|
||||||
|
- TrueNAS (10.10.10.200) - Central storage
|
||||||
|
- Windows PC (10.10.10.150) - Windows sync
|
||||||
|
- Phone (10.10.10.54) - Mobile sync
|
||||||
|
|
||||||
|
**API Keys**:
|
||||||
|
- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
|
||||||
|
- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
|
||||||
|
- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
|
||||||
|
|
||||||
|
**Synced Folders**:
|
||||||
|
- documents (~11 GB)
|
||||||
|
- downloads (~38 GB)
|
||||||
|
- pictures
|
||||||
|
- notes
|
||||||
|
- desktop (~7.2 GB)
|
||||||
|
- config
|
||||||
|
- movies
|
||||||
|
|
||||||
|
**See**: [SYNCTHING.md](SYNCTHING.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Copyparty (VM 201)
|
||||||
|
|
||||||
|
**Purpose**: Simple HTTP file sharing
|
||||||
|
**Location**: VM on PVE (10.10.10.201)
|
||||||
|
**URL**: https://copyparty.htsn.io
|
||||||
|
**Access**: Unknown
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Web-based file upload/download
|
||||||
|
- Lightweight
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Trading & AI Services
|
||||||
|
|
||||||
|
### AI Trading Platform (trading-vm)
|
||||||
|
|
||||||
|
**Purpose**: Algorithmic trading with AI models
|
||||||
|
**Location**: VM 301 on PVE2 (10.10.10.221)
|
||||||
|
**URL**: https://aitrade.htsn.io (if accessible)
|
||||||
|
**GPU**: RTX A6000 (48GB VRAM)
|
||||||
|
|
||||||
|
**Components**:
|
||||||
|
- Trading algorithms
|
||||||
|
- AI models for market prediction
|
||||||
|
- Real-time data feeds
|
||||||
|
- Backtesting infrastructure
|
||||||
|
|
||||||
|
**Access**: SSH only (no web UI documented)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### LM Dev (lmdev1)
|
||||||
|
|
||||||
|
**Purpose**: AI/LLM development environment
|
||||||
|
**Location**: VM 111 on PVE (10.10.10.111)
|
||||||
|
**URL**: https://lmdev.htsn.io (if accessible)
|
||||||
|
**GPU**: TITAN RTX (shared with Saltbox)
|
||||||
|
|
||||||
|
**Installed**:
|
||||||
|
- CUDA toolkit
|
||||||
|
- Python 3.11+
|
||||||
|
- PyTorch, TensorFlow
|
||||||
|
- Hugging Face transformers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring & Utilities
|
||||||
|
|
||||||
|
### UPS Monitoring (NUT)
|
||||||
|
|
||||||
|
**Purpose**: Monitor UPS status and trigger shutdowns
|
||||||
|
**Location**: PVE (master), PVE2 (slave)
|
||||||
|
**Access**: Command-line (`upsc`)
|
||||||
|
|
||||||
|
**Key Commands**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'upsc cyberpower@localhost'
|
||||||
|
ssh pve 'upsc cyberpower@localhost ups.load'
|
||||||
|
ssh pve 'upsc cyberpower@localhost battery.runtime'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Home Assistant Integration**: UPS sensors exposed
|
||||||
|
|
||||||
|
**See**: [UPS.md](UPS.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Pulse Monitoring
|
||||||
|
|
||||||
|
**Purpose**: Unknown monitoring dashboard
|
||||||
|
**Location**: docker-host (10.10.10.206:7655)
|
||||||
|
**URL**: https://pulse.htsn.io
|
||||||
|
**Access**: Unknown
|
||||||
|
|
||||||
|
**Needs documentation**:
|
||||||
|
- What does it monitor?
|
||||||
|
- How to configure?
|
||||||
|
- Authentication?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Tailscale VPN
|
||||||
|
|
||||||
|
**Purpose**: Secure remote access to homelab
|
||||||
|
**Subnet Routers**:
|
||||||
|
- PVE (100.113.177.80) - Primary
|
||||||
|
- UCG-Fiber (100.94.246.32) - Failover
|
||||||
|
|
||||||
|
**Devices on Tailscale**:
|
||||||
|
- Mac Mini: 100.108.89.58
|
||||||
|
- PVE: 100.113.177.80
|
||||||
|
- TrueNAS: 100.100.94.71
|
||||||
|
- Pi-hole: 100.112.59.128
|
||||||
|
|
||||||
|
**See**: [NETWORK.md](NETWORK.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Custom Applications
|
||||||
|
|
||||||
|
### FindShyt (CT 205)
|
||||||
|
|
||||||
|
**Purpose**: Unknown custom application
|
||||||
|
**Location**: LXC on PVE (10.10.10.8)
|
||||||
|
**URL**: https://findshyt.htsn.io
|
||||||
|
**Access**: Unknown
|
||||||
|
|
||||||
|
**Needs documentation**:
|
||||||
|
- What is this app?
|
||||||
|
- How to use it?
|
||||||
|
- Tech stack?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Dependencies
|
||||||
|
|
||||||
|
### Critical Dependencies
|
||||||
|
|
||||||
|
```
|
||||||
|
TrueNAS
|
||||||
|
├── Plex (media files via NFS)
|
||||||
|
├── *arr apps (downloads via NFS)
|
||||||
|
├── Syncthing (central storage hub)
|
||||||
|
└── Backups (if configured)
|
||||||
|
|
||||||
|
Traefik (CT 202)
|
||||||
|
├── All *.htsn.io services
|
||||||
|
└── SSL certificate management
|
||||||
|
|
||||||
|
Pi-hole
|
||||||
|
└── DNS for entire network
|
||||||
|
|
||||||
|
Router
|
||||||
|
└── Gateway for all services
|
||||||
|
```
|
||||||
|
|
||||||
|
### Startup Order
|
||||||
|
|
||||||
|
**See [VMS.md](VMS.md)** for VM boot order configuration:
|
||||||
|
1. TrueNAS (storage first)
|
||||||
|
2. Saltbox (depends on TrueNAS NFS)
|
||||||
|
3. Other VMs
|
||||||
|
4. Containers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Port Reference
|
||||||
|
|
||||||
|
### Well-Known Ports
|
||||||
|
|
||||||
|
| Port | Service | Protocol | Purpose |
|
||||||
|
|------|---------|----------|---------|
|
||||||
|
| 22 | SSH | TCP | Remote access |
|
||||||
|
| 53 | Pi-hole | UDP | DNS queries |
|
||||||
|
| 80 | Traefik | TCP | HTTP (redirects to 443) |
|
||||||
|
| 443 | Traefik | TCP | HTTPS |
|
||||||
|
| 3000 | Gitea | TCP | Git HTTP/S |
|
||||||
|
| 8006 | Proxmox | TCP | Web UI |
|
||||||
|
| 8096 | Plex | TCP | Plex Media Server |
|
||||||
|
| 8384 | Syncthing | TCP | Web UI |
|
||||||
|
| 22000 | Syncthing | TCP | Sync protocol |
|
||||||
|
|
||||||
|
### Internal Ports
|
||||||
|
|
||||||
|
| Port | Service | Purpose |
|
||||||
|
|------|---------|---------|
|
||||||
|
| 3002 | Happy Server | Relay backend |
|
||||||
|
| 5432 | PostgreSQL | Happy Server DB |
|
||||||
|
| 6379 | Redis | Happy Server cache |
|
||||||
|
| 7655 | Pulse | Monitoring |
|
||||||
|
| 8080 | Excalidraw | Whiteboard |
|
||||||
|
| 8080 | Traefik | Dashboard |
|
||||||
|
| 9000 | MinIO | Object storage |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Health Checks
|
||||||
|
|
||||||
|
### Quick Health Check Script
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Check all critical services
|
||||||
|
|
||||||
|
echo "=== Infrastructure ==="
|
||||||
|
curl -Is https://pve.htsn.io:8006 | head -1
|
||||||
|
curl -Is https://truenas.htsn.io | head -1
|
||||||
|
curl -I http://10.10.10.10/admin 2>/dev/null | head -1
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Media Services ==="
|
||||||
|
curl -Is https://plex.htsn.io | head -1
|
||||||
|
curl -Is https://sonarr.htsn.io | head -1
|
||||||
|
curl -Is https://radarr.htsn.io | head -1
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Development ==="
|
||||||
|
curl -Is https://git.htsn.io | head -1
|
||||||
|
curl -Is https://excalidraw.htsn.io | head -1
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
echo "=== Home Automation ==="
|
||||||
|
curl -Is https://homeassistant.htsn.io | head -1
|
||||||
|
curl -Is https://happy.htsn.io/health | head -1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service-Specific Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Proxmox VMs
|
||||||
|
ssh pve 'qm list | grep running'
|
||||||
|
|
||||||
|
# Docker services
|
||||||
|
ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
|
||||||
|
|
||||||
|
# Syncthing
|
||||||
|
curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
|
||||||
|
"http://127.0.0.1:8384/rest/system/status"
|
||||||
|
|
||||||
|
# UPS
|
||||||
|
ssh pve 'upsc cyberpower@localhost ups.status'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Credentials
|
||||||
|
|
||||||
|
**Location**: See individual service documentation
|
||||||
|
|
||||||
|
| Service | Credentials Location | Notes |
|
||||||
|
|---------|---------------------|-------|
|
||||||
|
| Proxmox | Proxmox UI | Username + 2FA |
|
||||||
|
| TrueNAS | TrueNAS UI | Root password |
|
||||||
|
| Plex | Plex account | Managed externally |
|
||||||
|
| Gitea | Gitea DB | Self-managed |
|
||||||
|
| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
|
||||||
|
| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
|
||||||
|
|
||||||
|
**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [VMS.md](VMS.md) - VM/service locations
|
||||||
|
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
|
||||||
|
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
|
||||||
|
- [NETWORK.md](NETWORK.md) - Network configuration
|
||||||
|
- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
|
**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)
|
||||||
475
SSH-ACCESS.md
Normal file
475
SSH-ACCESS.md
Normal file
@@ -0,0 +1,475 @@
|
|||||||
|
# SSH Access
|
||||||
|
|
||||||
|
Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
|
||||||
|
|
||||||
|
**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SSH Key Authentication (Primary Method)
|
||||||
|
|
||||||
|
### SSH Key Configuration
|
||||||
|
|
||||||
|
SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
|
||||||
|
|
||||||
|
**Key file**: `~/.ssh/homelab` (Ed25519 key)
|
||||||
|
|
||||||
|
**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
|
||||||
|
|
||||||
|
### Host Aliases
|
||||||
|
|
||||||
|
Use these convenient aliases instead of IP addresses:
|
||||||
|
|
||||||
|
| Host Alias | IP | User | Type | Notes |
|
||||||
|
|------------|-----|------|------|-------|
|
||||||
|
| `ucg-fiber` / `gateway` | 10.10.10.1 | root | UniFi Gateway | Router/firewall |
|
||||||
|
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
|
||||||
|
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
|
||||||
|
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
|
||||||
|
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
|
||||||
|
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
|
||||||
|
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services (PVE) |
|
||||||
|
| `docker-host2` | 10.10.10.207 | hutson | VM | Docker services (PVE2) - MetaMCP, n8n |
|
||||||
|
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
|
||||||
|
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
|
||||||
|
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
|
||||||
|
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
|
||||||
|
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
|
||||||
|
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
|
||||||
|
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
|
||||||
|
|
||||||
|
### Usage Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List VMs on PVE
|
||||||
|
ssh pve 'qm list'
|
||||||
|
|
||||||
|
# Check ZFS pool on TrueNAS
|
||||||
|
ssh truenas 'zpool status vault'
|
||||||
|
|
||||||
|
# List Docker containers on Saltbox
|
||||||
|
ssh saltbox 'docker ps'
|
||||||
|
|
||||||
|
# Check Pi-hole status
|
||||||
|
ssh pihole 'pihole status'
|
||||||
|
|
||||||
|
# View Traefik config
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
|
||||||
|
```
|
||||||
|
|
||||||
|
### SSH Config File
|
||||||
|
|
||||||
|
**Location**: `~/.ssh/config`
|
||||||
|
|
||||||
|
**Example entries**:
|
||||||
|
|
||||||
|
```sshconfig
|
||||||
|
# Proxmox Servers
|
||||||
|
Host pve
|
||||||
|
HostName 10.10.10.120
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host pve2
|
||||||
|
HostName 10.10.10.102
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
# Post-quantum KEX causes MTU issues - use classic
|
||||||
|
KexAlgorithms curve25519-sha256
|
||||||
|
|
||||||
|
# VMs
|
||||||
|
Host truenas
|
||||||
|
HostName 10.10.10.200
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host saltbox
|
||||||
|
HostName 10.10.10.100
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host lmdev1
|
||||||
|
HostName 10.10.10.111
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host docker-host
|
||||||
|
HostName 10.10.10.206
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host docker-host2
|
||||||
|
HostName 10.10.10.207
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host fs-dev
|
||||||
|
HostName 10.10.10.5
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host copyparty
|
||||||
|
HostName 10.10.10.201
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host gitea-vm
|
||||||
|
HostName 10.10.10.220
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host trading-vm
|
||||||
|
HostName 10.10.10.221
|
||||||
|
User hutson
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
# LXC Containers
|
||||||
|
Host pihole
|
||||||
|
HostName 10.10.10.10
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host traefik
|
||||||
|
HostName 10.10.10.250
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
|
||||||
|
Host findshyt
|
||||||
|
HostName 10.10.10.8
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Password Authentication (Special Cases)
|
||||||
|
|
||||||
|
Some systems don't support SSH key auth or have other limitations.
|
||||||
|
|
||||||
|
### UniFi Router (10.10.10.1) - NOW USES KEY AUTH
|
||||||
|
|
||||||
|
**Host alias**: `ucg-fiber` or `gateway`
|
||||||
|
|
||||||
|
**Status**: SSH key authentication now works (as of 2026-01-02)
|
||||||
|
|
||||||
|
**Commands**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run command on router (using SSH key)
|
||||||
|
ssh ucg-fiber 'hostname'
|
||||||
|
|
||||||
|
# Get ARP table (all device IPs)
|
||||||
|
ssh ucg-fiber 'cat /proc/net/arp'
|
||||||
|
|
||||||
|
# Check Tailscale status
|
||||||
|
ssh ucg-fiber 'tailscale status'
|
||||||
|
|
||||||
|
# Check memory usage
|
||||||
|
ssh ucg-fiber 'free -m'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: Key may need to be re-deployed after firmware updates if UniFi clears authorized_keys.
|
||||||
|
|
||||||
|
### Windows PC (10.10.10.150)
|
||||||
|
|
||||||
|
**OS**: Windows with OpenSSH server
|
||||||
|
**User**: `claude`
|
||||||
|
**Password**: `GrilledCh33s3#`
|
||||||
|
**Shell**: PowerShell (not bash)
|
||||||
|
|
||||||
|
**Commands**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run PowerShell command
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
|
||||||
|
|
||||||
|
# Check Syncthing status
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
|
||||||
|
|
||||||
|
# Restart Syncthing
|
||||||
|
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
|
||||||
|
```
|
||||||
|
|
||||||
|
**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
|
||||||
|
|
||||||
|
**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## QEMU Guest Agent
|
||||||
|
|
||||||
|
Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
|
||||||
|
|
||||||
|
### VMs with QEMU Agent
|
||||||
|
|
||||||
|
| VMID | VM Name | Use Case |
|
||||||
|
|------|---------|----------|
|
||||||
|
| 100 | truenas | Execute commands, check ZFS |
|
||||||
|
| 101 | saltbox | Execute commands, Docker mgmt |
|
||||||
|
| 105 | fs-dev | Execute commands |
|
||||||
|
| 111 | lmdev1 | Execute commands |
|
||||||
|
| 201 | copyparty | Execute commands |
|
||||||
|
| 206 | docker-host | Execute commands |
|
||||||
|
| 300 | gitea-vm | Execute commands |
|
||||||
|
| 301 | trading-vm | Execute commands |
|
||||||
|
|
||||||
|
### VM WITHOUT QEMU Agent
|
||||||
|
|
||||||
|
**VMID 110 (homeassistant)**: No QEMU agent installed
|
||||||
|
- Access via web UI only
|
||||||
|
- Or install SSH server manually if needed
|
||||||
|
|
||||||
|
### Usage Examples
|
||||||
|
|
||||||
|
**Basic syntax**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Examples**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check ZFS pool on TrueNAS (without SSH)
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
|
||||||
|
|
||||||
|
# Get VM IP addresses
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
|
||||||
|
|
||||||
|
# Check Docker containers on Saltbox
|
||||||
|
ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
|
||||||
|
|
||||||
|
# Run multi-line command
|
||||||
|
ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
|
||||||
|
```
|
||||||
|
|
||||||
|
**When to use QEMU agent vs SSH**:
|
||||||
|
- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
|
||||||
|
- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
|
||||||
|
- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting SSH Issues
|
||||||
|
|
||||||
|
### Connection Refused
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if SSH service is running
|
||||||
|
ssh pve 'systemctl status sshd'
|
||||||
|
|
||||||
|
# Check if port 22 is open
|
||||||
|
nc -zv 10.10.10.XXX 22
|
||||||
|
|
||||||
|
# Check firewall
|
||||||
|
ssh pve 'iptables -L -n | grep 22'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Permission Denied (Public Key)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verify key file exists
|
||||||
|
ls -la ~/.ssh/homelab
|
||||||
|
|
||||||
|
# Check key permissions (should be 600)
|
||||||
|
chmod 600 ~/.ssh/homelab
|
||||||
|
|
||||||
|
# Test SSH key auth verbosely
|
||||||
|
ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
|
||||||
|
|
||||||
|
# Check authorized_keys on remote (via QEMU agent if SSH broken)
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Slow SSH Connection (PVE2 Issue)
|
||||||
|
|
||||||
|
**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
|
||||||
|
**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
|
||||||
|
**Fix**: Use classic KEX algorithm instead
|
||||||
|
|
||||||
|
**In `~/.ssh/config`**:
|
||||||
|
```sshconfig
|
||||||
|
Host pve2
|
||||||
|
HostName 10.10.10.102
|
||||||
|
User root
|
||||||
|
IdentityFile ~/.ssh/homelab
|
||||||
|
KexAlgorithms curve25519-sha256 # Avoid mlkem768x25519-sha256
|
||||||
|
```
|
||||||
|
|
||||||
|
**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding SSH Keys to New Systems
|
||||||
|
|
||||||
|
### Linux (VMs/LXCs)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Copy public key to new host
|
||||||
|
ssh-copy-id -i ~/.ssh/homelab user@hostname
|
||||||
|
|
||||||
|
# Or manually:
|
||||||
|
ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
|
||||||
|
ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
|
||||||
|
```
|
||||||
|
|
||||||
|
### LXC Containers (Root User)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Via pct exec from Proxmox host
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
|
||||||
|
|
||||||
|
# Also enable PermitRootLogin in sshd_config
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### VMs (via QEMU Agent)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Add key via QEMU agent (if SSH not working)
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SSH Key Management
|
||||||
|
|
||||||
|
### Rotate SSH Keys (Future)
|
||||||
|
|
||||||
|
When rotating SSH keys:
|
||||||
|
|
||||||
|
1. Generate new key pair:
|
||||||
|
```bash
|
||||||
|
ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Deploy new key to all hosts (keep old key for now):
|
||||||
|
```bash
|
||||||
|
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
|
||||||
|
ssh-copy-id -i ~/.ssh/homelab-new $host
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Update `~/.ssh/config` to use new key:
|
||||||
|
```sshconfig
|
||||||
|
IdentityFile ~/.ssh/homelab-new
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Test all connections:
|
||||||
|
```bash
|
||||||
|
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
|
||||||
|
echo "Testing $host..."
|
||||||
|
ssh $host 'hostname'
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Remove old key from all hosts once confirmed working
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
### Common SSH Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Execute command on remote host
|
||||||
|
ssh host 'command'
|
||||||
|
|
||||||
|
# Execute multiple commands
|
||||||
|
ssh host 'command1 && command2'
|
||||||
|
|
||||||
|
# Copy file to remote
|
||||||
|
scp file host:/path/
|
||||||
|
|
||||||
|
# Copy file from remote
|
||||||
|
scp host:/path/file ./
|
||||||
|
|
||||||
|
# Execute command on Proxmox VM (via QEMU agent)
|
||||||
|
ssh pve 'qm guest exec VMID -- bash -c "command"'
|
||||||
|
|
||||||
|
# Execute command on LXC
|
||||||
|
ssh pve 'pct exec CTID -- command'
|
||||||
|
|
||||||
|
# Interactive shell
|
||||||
|
ssh host
|
||||||
|
|
||||||
|
# SSH with X11 forwarding
|
||||||
|
ssh -X host
|
||||||
|
```
|
||||||
|
|
||||||
|
### Troubleshooting Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test SSH with verbose output
|
||||||
|
ssh -vvv host
|
||||||
|
|
||||||
|
# Check SSH service status (remote)
|
||||||
|
ssh host 'systemctl status sshd'
|
||||||
|
|
||||||
|
# Check SSH config (local)
|
||||||
|
ssh -G host
|
||||||
|
|
||||||
|
# Test port connectivity
|
||||||
|
nc -zv hostname 22
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Best Practices
|
||||||
|
|
||||||
|
### Current Security Posture
|
||||||
|
|
||||||
|
✅ **Good**:
|
||||||
|
- SSH keys used instead of passwords (where possible)
|
||||||
|
- Keys use Ed25519 (modern, secure algorithm)
|
||||||
|
- Root login disabled on VMs (use sudo instead)
|
||||||
|
- SSH keys have proper permissions (600)
|
||||||
|
|
||||||
|
⚠️ **Could Improve**:
|
||||||
|
- [ ] Disable password authentication on all hosts (force key-only)
|
||||||
|
- [ ] Use SSH certificate authority instead of individual keys
|
||||||
|
- [ ] Set up SSH bastion host (jump server)
|
||||||
|
- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
|
||||||
|
- [ ] Implement SSH key rotation policy (annually)
|
||||||
|
|
||||||
|
### Hardening SSH (Future)
|
||||||
|
|
||||||
|
For additional security, consider:
|
||||||
|
|
||||||
|
```sshconfig
|
||||||
|
# /etc/ssh/sshd_config (on remote hosts)
|
||||||
|
PermitRootLogin prohibit-password # No root password login
|
||||||
|
PasswordAuthentication no # Disable password auth entirely
|
||||||
|
PubkeyAuthentication yes # Only allow key auth
|
||||||
|
AuthorizedKeysFile .ssh/authorized_keys
|
||||||
|
MaxAuthTries 3 # Limit auth attempts
|
||||||
|
MaxSessions 10 # Limit concurrent sessions
|
||||||
|
ClientAliveInterval 300 # Timeout idle sessions
|
||||||
|
ClientAliveCountMax 2 # Drop after 2 keepalives
|
||||||
|
```
|
||||||
|
|
||||||
|
**Apply after editing**:
|
||||||
|
```bash
|
||||||
|
systemctl restart sshd
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [VMS.md](VMS.md) - Complete VM/CT inventory
|
||||||
|
- [NETWORK.md](NETWORK.md) - Network configuration
|
||||||
|
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
|
||||||
|
- [SECURITY.md](#) - Security policies (coming soon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
510
STORAGE.md
Normal file
510
STORAGE.md
Normal file
@@ -0,0 +1,510 @@
|
|||||||
|
# Storage Architecture
|
||||||
|
|
||||||
|
Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
### Storage Distribution
|
||||||
|
|
||||||
|
| Location | Type | Capacity | Purpose |
|
||||||
|
|----------|------|----------|---------|
|
||||||
|
| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
|
||||||
|
| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
|
||||||
|
| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PVE (10.10.10.120) Storage Pools
|
||||||
|
|
||||||
|
### nvme-mirror1 (Primary Fast Storage)
|
||||||
|
- **Type**: ZFS mirror
|
||||||
|
- **Devices**: 2x Sabrent Rocket Q NVMe
|
||||||
|
- **Capacity**: 3.6 TB usable
|
||||||
|
- **Purpose**: High-performance VM storage
|
||||||
|
- **Used By**:
|
||||||
|
- Critical VMs requiring fast IO
|
||||||
|
- Database workloads
|
||||||
|
- Development environments
|
||||||
|
|
||||||
|
**Check status**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'zpool status nvme-mirror1'
|
||||||
|
ssh pve 'zpool list nvme-mirror1'
|
||||||
|
```
|
||||||
|
|
||||||
|
### nvme-mirror2 (Secondary Fast Storage)
|
||||||
|
- **Type**: ZFS mirror
|
||||||
|
- **Devices**: 2x Kingston SFYRD 2TB NVMe
|
||||||
|
- **Capacity**: 1.8 TB usable
|
||||||
|
- **Purpose**: Additional fast VM storage
|
||||||
|
- **Used By**: TBD
|
||||||
|
|
||||||
|
**Check status**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'zpool status nvme-mirror2'
|
||||||
|
ssh pve 'zpool list nvme-mirror2'
|
||||||
|
```
|
||||||
|
|
||||||
|
### rpool (Root Pool)
|
||||||
|
- **Type**: ZFS mirror
|
||||||
|
- **Devices**: 2x Samsung 870 QVO 4TB SSD
|
||||||
|
- **Capacity**: 3.6 TB usable
|
||||||
|
- **Purpose**: Proxmox OS, container storage, VM backups
|
||||||
|
- **Used By**:
|
||||||
|
- Proxmox root filesystem
|
||||||
|
- LXC containers
|
||||||
|
- Local VM backups
|
||||||
|
|
||||||
|
**Check status**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'zpool status rpool'
|
||||||
|
ssh pve 'df -h /var/lib/vz'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Storage Pool Usage Summary (PVE)
|
||||||
|
|
||||||
|
**Get current usage**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'zpool list'
|
||||||
|
ssh pve 'pvesm status'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PVE2 (10.10.10.102) Storage Pools
|
||||||
|
|
||||||
|
### nvme-mirror3 (Fast Storage)
|
||||||
|
- **Type**: ZFS mirror
|
||||||
|
- **Devices**: 2x NVMe (model unknown)
|
||||||
|
- **Capacity**: Unknown (needs investigation)
|
||||||
|
- **Purpose**: High-performance VM storage
|
||||||
|
- **Used By**: Trading VM (301), other VMs
|
||||||
|
|
||||||
|
**Check status**:
|
||||||
|
```bash
|
||||||
|
ssh pve2 'zpool status nvme-mirror3'
|
||||||
|
ssh pve2 'zpool list nvme-mirror3'
|
||||||
|
```
|
||||||
|
|
||||||
|
### local-zfs2 (Bulk Storage)
|
||||||
|
- **Type**: ZFS mirror
|
||||||
|
- **Devices**: 2x WD Red 6TB HDD
|
||||||
|
- **Capacity**: ~6 TB usable
|
||||||
|
- **Purpose**: Bulk/archival storage
|
||||||
|
- **Power Management**: 30-minute spindown configured
|
||||||
|
- Saves ~10-16W when idle
|
||||||
|
- Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
|
||||||
|
- Command: `hdparm -S 241` (30 min)
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- Pool had only 768 KB used as of 2024-12-16
|
||||||
|
- Drives configured to spin down after 30 min idle
|
||||||
|
- Good for archival, NOT for active workloads
|
||||||
|
|
||||||
|
**Check status**:
|
||||||
|
```bash
|
||||||
|
ssh pve2 'zpool status local-zfs2'
|
||||||
|
ssh pve2 'zpool list local-zfs2'
|
||||||
|
|
||||||
|
# Check if drives are spun down
|
||||||
|
ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
|
||||||
|
|
||||||
|
### ZFS Pool: vault
|
||||||
|
|
||||||
|
**Primary storage pool** for all shared data.
|
||||||
|
|
||||||
|
**Devices**: ❓ Needs investigation
|
||||||
|
- EMC storage enclosure with multiple drives
|
||||||
|
- SAS connection via LSI SAS2308 HBA (passed through to VM)
|
||||||
|
|
||||||
|
**Capacity**: ❓ Needs investigation
|
||||||
|
|
||||||
|
**Check pool status**:
|
||||||
|
```bash
|
||||||
|
ssh truenas 'zpool status vault'
|
||||||
|
ssh truenas 'zpool list vault'
|
||||||
|
|
||||||
|
# Get detailed capacity
|
||||||
|
ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Datasets (Known)
|
||||||
|
|
||||||
|
Based on Syncthing configuration, likely datasets:
|
||||||
|
|
||||||
|
| Dataset | Purpose | Synced Devices | Notes |
|
||||||
|
|---------|---------|----------------|-------|
|
||||||
|
| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
|
||||||
|
| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
|
||||||
|
| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
|
||||||
|
| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
|
||||||
|
| vault/desktop | Desktop sync | Unknown | 7.2 GB |
|
||||||
|
| vault/movies | Movie library | Unknown | Unknown size |
|
||||||
|
| vault/config | Config files | Mac Mini, MacBook | Unknown size |
|
||||||
|
|
||||||
|
**Get complete dataset list**:
|
||||||
|
```bash
|
||||||
|
ssh truenas 'zfs list -r vault'
|
||||||
|
```
|
||||||
|
|
||||||
|
### NFS/SMB Shares
|
||||||
|
|
||||||
|
**Status**: ❓ Not documented
|
||||||
|
|
||||||
|
**Needs investigation**:
|
||||||
|
```bash
|
||||||
|
# List NFS exports
|
||||||
|
ssh truenas 'showmount -e localhost'
|
||||||
|
|
||||||
|
# List SMB shares
|
||||||
|
ssh truenas 'smbclient -L localhost -N'
|
||||||
|
|
||||||
|
# Via TrueNAS API/UI
|
||||||
|
# Sharing → Unix Shares (NFS)
|
||||||
|
# Sharing → Windows Shares (SMB)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected shares**:
|
||||||
|
- Media libraries for Plex (on Saltbox VM)
|
||||||
|
- Document storage
|
||||||
|
- VM backups?
|
||||||
|
- ISO storage?
|
||||||
|
|
||||||
|
### EMC Storage Enclosure
|
||||||
|
|
||||||
|
**Model**: EMC KTN-STL4 (or similar)
|
||||||
|
**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
|
||||||
|
**Drives**: ❓ Unknown count and capacity
|
||||||
|
|
||||||
|
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
|
||||||
|
- SES commands
|
||||||
|
- Fan control
|
||||||
|
- LCC (Link Control Card) troubleshooting
|
||||||
|
- Maintenance procedures
|
||||||
|
|
||||||
|
**Check enclosure status**:
|
||||||
|
```bash
|
||||||
|
ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor
|
||||||
|
ssh truenas 'smartctl --scan' # List all drives
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Network Architecture
|
||||||
|
|
||||||
|
### Internal Storage Network (10.10.10.20.0/24)
|
||||||
|
|
||||||
|
**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
|
||||||
|
|
||||||
|
**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
|
||||||
|
**Subnet**: 10.10.10.20.0/24
|
||||||
|
**DHCP**: No
|
||||||
|
**Gateway**: No (internal only, no internet)
|
||||||
|
|
||||||
|
**Connected VMs**:
|
||||||
|
- TrueNAS VM (secondary NIC)
|
||||||
|
- Saltbox VM (secondary NIC) - for NFS mounts
|
||||||
|
- Other VMs needing storage access
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
```bash
|
||||||
|
# On TrueNAS VM - check second NIC
|
||||||
|
ssh truenas 'ip addr show enp6s19'
|
||||||
|
|
||||||
|
# On Saltbox - check NFS mounts
|
||||||
|
ssh saltbox 'mount | grep nfs'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Separates storage traffic from general network
|
||||||
|
- Prevents NFS/SMB from saturating main network
|
||||||
|
- Better performance for storage-heavy workloads
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Capacity Planning
|
||||||
|
|
||||||
|
### Current Usage (Estimate)
|
||||||
|
|
||||||
|
**Needs actual audit**:
|
||||||
|
```bash
|
||||||
|
# PVE pools
|
||||||
|
ssh pve 'zpool list -o name,size,alloc,free'
|
||||||
|
|
||||||
|
# PVE2 pools
|
||||||
|
ssh pve2 'zpool list -o name,size,alloc,free'
|
||||||
|
|
||||||
|
# TrueNAS vault pool
|
||||||
|
ssh truenas 'zpool list vault'
|
||||||
|
|
||||||
|
# Get detailed breakdown
|
||||||
|
ssh truenas 'zfs list -r vault -o name,used,avail'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Growth Rate
|
||||||
|
|
||||||
|
**Needs tracking** - recommend monthly snapshots of capacity:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Save as ~/bin/storage-capacity-report.sh
|
||||||
|
|
||||||
|
DATE=$(date +%Y-%m-%d)
|
||||||
|
REPORT=~/Backups/storage-reports/capacity-$DATE.txt
|
||||||
|
|
||||||
|
mkdir -p ~/Backups/storage-reports
|
||||||
|
|
||||||
|
echo "Storage Capacity Report - $DATE" > $REPORT
|
||||||
|
echo "================================" >> $REPORT
|
||||||
|
echo "" >> $REPORT
|
||||||
|
|
||||||
|
echo "PVE Pools:" >> $REPORT
|
||||||
|
ssh pve 'zpool list' >> $REPORT
|
||||||
|
echo "" >> $REPORT
|
||||||
|
|
||||||
|
echo "PVE2 Pools:" >> $REPORT
|
||||||
|
ssh pve2 'zpool list' >> $REPORT
|
||||||
|
echo "" >> $REPORT
|
||||||
|
|
||||||
|
echo "TrueNAS Pools:" >> $REPORT
|
||||||
|
ssh truenas 'zpool list' >> $REPORT
|
||||||
|
echo "" >> $REPORT
|
||||||
|
|
||||||
|
echo "TrueNAS Datasets:" >> $REPORT
|
||||||
|
ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
|
||||||
|
|
||||||
|
echo "Report saved to $REPORT"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run monthly via cron**:
|
||||||
|
```cron
|
||||||
|
0 9 1 * * ~/bin/storage-capacity-report.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Expansion Planning
|
||||||
|
|
||||||
|
**When to expand**:
|
||||||
|
- Pool reaches 80% capacity
|
||||||
|
- Performance degrades
|
||||||
|
- New workloads require more space
|
||||||
|
|
||||||
|
**Expansion options**:
|
||||||
|
1. Add drives to existing pools (if mirrors, add mirror vdev)
|
||||||
|
2. Add new NVMe drives to PVE/PVE2
|
||||||
|
3. Expand EMC enclosure (add more drives)
|
||||||
|
4. Add second EMC enclosure
|
||||||
|
|
||||||
|
**Cost estimates**: TBD
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ZFS Health Monitoring
|
||||||
|
|
||||||
|
### Daily Health Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check for errors on all pools
|
||||||
|
ssh pve 'zpool status -x' # Shows only unhealthy pools
|
||||||
|
ssh pve2 'zpool status -x'
|
||||||
|
ssh truenas 'zpool status -x'
|
||||||
|
|
||||||
|
# Check scrub status
|
||||||
|
ssh pve 'zpool status | grep scrub'
|
||||||
|
ssh pve2 'zpool status | grep scrub'
|
||||||
|
ssh truenas 'zpool status | grep scrub'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scrub Schedule
|
||||||
|
|
||||||
|
**Recommended**: Monthly scrub on all pools
|
||||||
|
|
||||||
|
**Configure scrub**:
|
||||||
|
```bash
|
||||||
|
# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
|
||||||
|
# Or via cron:
|
||||||
|
0 2 1 * * /sbin/zpool scrub nvme-mirror1
|
||||||
|
0 2 1 * * /sbin/zpool scrub rpool
|
||||||
|
```
|
||||||
|
|
||||||
|
**On TrueNAS**:
|
||||||
|
- Configure via UI: Storage → Pools → Scrub Tasks
|
||||||
|
- Recommended: 1st of every month at 2 AM
|
||||||
|
|
||||||
|
### SMART Monitoring
|
||||||
|
|
||||||
|
**Check drive health**:
|
||||||
|
```bash
|
||||||
|
# PVE
|
||||||
|
ssh pve 'smartctl -a /dev/nvme0'
|
||||||
|
ssh pve 'smartctl -a /dev/sda'
|
||||||
|
|
||||||
|
# TrueNAS
|
||||||
|
ssh truenas 'smartctl --scan'
|
||||||
|
ssh truenas 'smartctl -a /dev/sdX' # For each drive
|
||||||
|
```
|
||||||
|
|
||||||
|
**Configure SMART tests**:
|
||||||
|
- TrueNAS UI: Tasks → S.M.A.R.T. Tests
|
||||||
|
- Recommended: Weekly short test, monthly long test
|
||||||
|
|
||||||
|
### Alerts
|
||||||
|
|
||||||
|
**Set up email alerts for**:
|
||||||
|
- ZFS pool errors
|
||||||
|
- SMART test failures
|
||||||
|
- Pool capacity > 80%
|
||||||
|
- Scrub failures
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Performance Tuning
|
||||||
|
|
||||||
|
### ZFS ARC (Cache)
|
||||||
|
|
||||||
|
**Check ARC usage**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'arc_summary'
|
||||||
|
ssh truenas 'arc_summary'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tuning** (if needed):
|
||||||
|
- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
|
||||||
|
- TrueNAS: Configure via UI (System → Advanced → Tunables)
|
||||||
|
|
||||||
|
### NFS Performance
|
||||||
|
|
||||||
|
**Mount options** (on clients like Saltbox):
|
||||||
|
```
|
||||||
|
rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify NFS mounts**:
|
||||||
|
```bash
|
||||||
|
ssh saltbox 'mount | grep nfs'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Record Size Optimization
|
||||||
|
|
||||||
|
**Different workloads need different record sizes**:
|
||||||
|
- VMs: 64K (default, good for VMs)
|
||||||
|
- Databases: 8K or 16K
|
||||||
|
- Media files: 1M (large sequential reads)
|
||||||
|
|
||||||
|
**Set record size** (on TrueNAS datasets):
|
||||||
|
```bash
|
||||||
|
ssh truenas 'zfs set recordsize=1M vault/movies'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Disaster Recovery
|
||||||
|
|
||||||
|
### Pool Recovery
|
||||||
|
|
||||||
|
**If a pool fails to import**:
|
||||||
|
```bash
|
||||||
|
# Try importing with different name
|
||||||
|
zpool import -f -N poolname newpoolname
|
||||||
|
|
||||||
|
# Check pool with readonly
|
||||||
|
zpool import -f -o readonly=on poolname
|
||||||
|
|
||||||
|
# Force import (last resort)
|
||||||
|
zpool import -f -F poolname
|
||||||
|
```
|
||||||
|
|
||||||
|
### Drive Replacement
|
||||||
|
|
||||||
|
**When a drive fails**:
|
||||||
|
```bash
|
||||||
|
# Identify failed drive
|
||||||
|
zpool status poolname
|
||||||
|
|
||||||
|
# Replace drive
|
||||||
|
zpool replace poolname old-device new-device
|
||||||
|
|
||||||
|
# Monitor resilver
|
||||||
|
watch zpool status poolname
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Recovery
|
||||||
|
|
||||||
|
**If pool is completely lost**:
|
||||||
|
1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
|
||||||
|
2. Recreate pool structure
|
||||||
|
3. Restore data
|
||||||
|
|
||||||
|
**Critical**: This is why we need offsite backups!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
### Common Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Pool status
|
||||||
|
zpool status [poolname]
|
||||||
|
zpool list
|
||||||
|
|
||||||
|
# Dataset usage
|
||||||
|
zfs list
|
||||||
|
zfs list -r vault
|
||||||
|
|
||||||
|
# Check pool health (only unhealthy)
|
||||||
|
zpool status -x
|
||||||
|
|
||||||
|
# Scrub pool
|
||||||
|
zpool scrub poolname
|
||||||
|
|
||||||
|
# Get pool IO stats
|
||||||
|
zpool iostat -v 1
|
||||||
|
|
||||||
|
# Snapshot management
|
||||||
|
zfs snapshot poolname/dataset@snapname
|
||||||
|
zfs list -t snapshot
|
||||||
|
zfs rollback poolname/dataset@snapname
|
||||||
|
zfs destroy poolname/dataset@snapname
|
||||||
|
```
|
||||||
|
|
||||||
|
### Storage Locations by Use Case
|
||||||
|
|
||||||
|
| Use Case | Recommended Storage | Why |
|
||||||
|
|----------|---------------------|-----|
|
||||||
|
| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
|
||||||
|
| Database | nvme-mirror1/2 | Low latency |
|
||||||
|
| Media files | TrueNAS vault | Large capacity |
|
||||||
|
| Development | nvme-mirror2 | Fast, mid-tier |
|
||||||
|
| Containers | rpool | Good performance |
|
||||||
|
| Backups | TrueNAS or rpool | Large capacity |
|
||||||
|
| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Investigation Needed
|
||||||
|
|
||||||
|
- [ ] Get complete TrueNAS dataset list
|
||||||
|
- [ ] Document NFS/SMB share configuration
|
||||||
|
- [ ] Inventory EMC enclosure drives (count, capacity, model)
|
||||||
|
- [ ] Document current pool usage percentages
|
||||||
|
- [ ] Set up monthly capacity reports
|
||||||
|
- [ ] Configure ZFS scrub schedules
|
||||||
|
- [ ] Set up storage health alerts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
|
||||||
|
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
|
||||||
|
- [VMS.md](VMS.md) - VM storage assignments
|
||||||
|
- [NETWORK.md](NETWORK.md) - Storage network configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
673
TRAEFIK.md
Normal file
673
TRAEFIK.md
Normal file
@@ -0,0 +1,673 @@
|
|||||||
|
# Traefik Reverse Proxy
|
||||||
|
|
||||||
|
Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
|
||||||
|
|
||||||
|
| Instance | Location | IP | Purpose | Managed By |
|
||||||
|
|----------|----------|-----|---------|------------|
|
||||||
|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
|
||||||
|
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ CRITICAL RULE: Which Traefik to Use
|
||||||
|
|
||||||
|
### When Adding ANY New Service:
|
||||||
|
|
||||||
|
✅ **USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
|
||||||
|
❌ **DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
|
||||||
|
|
||||||
|
### Why This Matters:
|
||||||
|
|
||||||
|
- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
|
||||||
|
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
|
||||||
|
- Each Traefik has its own Let's Encrypt certificates
|
||||||
|
- Mixing them causes certificate conflicts and routing issues
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Traefik-Primary (CT 202) - For New Services
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
**Location**: Container 202 on PVE (10.10.10.250)
|
||||||
|
**Config Directory**: `/etc/traefik/`
|
||||||
|
**Main Config**: `/etc/traefik/traefik.yaml`
|
||||||
|
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
|
||||||
|
|
||||||
|
### Access Traefik Config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From Mac Mini:
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
|
||||||
|
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
|
||||||
|
|
||||||
|
# Edit a service config:
|
||||||
|
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
|
||||||
|
|
||||||
|
# View logs:
|
||||||
|
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Services Using Traefik-Primary
|
||||||
|
|
||||||
|
| Service | Domain | Backend |
|
||||||
|
|---------|--------|---------|
|
||||||
|
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
|
||||||
|
| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
|
||||||
|
| Gitea | git.htsn.io | 10.10.10.220:3000 |
|
||||||
|
| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
|
||||||
|
| LM Dev | lmdev.htsn.io | 10.10.10.111 |
|
||||||
|
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 (docker-host2) |
|
||||||
|
| Pi-hole | pihole.htsn.io | 10.10.10.200 |
|
||||||
|
| TrueNAS | truenas.htsn.io | 10.10.10.200 |
|
||||||
|
| Proxmox | pve.htsn.io | 10.10.10.120 |
|
||||||
|
| Copyparty | copyparty.htsn.io | 10.10.10.201 |
|
||||||
|
| AI Trade | aitrade.htsn.io | (trading server) |
|
||||||
|
| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
|
||||||
|
| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Traefik-Saltbox (VM 101) - DO NOT MODIFY
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
**Location**: `/opt/traefik/` inside Saltbox VM
|
||||||
|
**Managed By**: Saltbox Ansible playbooks (automatic)
|
||||||
|
**Docker Mount**: `/opt/traefik` → `/etc/traefik` in container
|
||||||
|
|
||||||
|
### Services Using Traefik-Saltbox
|
||||||
|
|
||||||
|
- Plex (plex.htsn.io)
|
||||||
|
- Sonarr, Radarr, Lidarr
|
||||||
|
- SABnzbd, NZBGet, qBittorrent
|
||||||
|
- Overseerr, Tautulli, Organizr
|
||||||
|
- Jackett, NZBHydra2
|
||||||
|
- Authelia (SSO authentication)
|
||||||
|
- All other Saltbox-managed containers
|
||||||
|
|
||||||
|
### View Saltbox Traefik (Read-Only)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View config (don't edit!)
|
||||||
|
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
ssh saltbox 'docker logs -f traefik'
|
||||||
|
```
|
||||||
|
|
||||||
|
**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding a New Public Service - Complete Workflow
|
||||||
|
|
||||||
|
Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
|
||||||
|
|
||||||
|
### Step 0: Deploy Your Service
|
||||||
|
|
||||||
|
First, deploy your service on the appropriate host.
|
||||||
|
|
||||||
|
#### Option A: Docker on docker-host (10.10.10.206)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh hutson@10.10.10.206
|
||||||
|
sudo mkdir -p /opt/myservice
|
||||||
|
cat > /opt/myservice/docker-compose.yml << 'EOF'
|
||||||
|
version: "3.8"
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
image: myimage:latest
|
||||||
|
ports:
|
||||||
|
- "8080:80"
|
||||||
|
restart: unless-stopped
|
||||||
|
EOF
|
||||||
|
cd /opt/myservice && sudo docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option B: New LXC Container on PVE
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
|
||||||
|
--hostname myservice --memory 2048 --cores 2 \
|
||||||
|
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
|
||||||
|
--rootfs local-zfs:8 --unprivileged 1 --start 1'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option C: New VM on PVE
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
|
||||||
|
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 1: Create Traefik Config File
|
||||||
|
|
||||||
|
Use this template for new services on **Traefik-Primary (CT 202)**:
|
||||||
|
|
||||||
|
#### Basic Template
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# /etc/traefik/conf.d/myservice.yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
# HTTPS router
|
||||||
|
myservice-secure:
|
||||||
|
entryPoints:
|
||||||
|
- websecure
|
||||||
|
rule: "Host(`myservice.htsn.io`)"
|
||||||
|
service: myservice
|
||||||
|
tls:
|
||||||
|
certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
|
||||||
|
priority: 50
|
||||||
|
|
||||||
|
# HTTP → HTTPS redirect
|
||||||
|
myservice-redirect:
|
||||||
|
entryPoints:
|
||||||
|
- web
|
||||||
|
rule: "Host(`myservice.htsn.io`)"
|
||||||
|
middlewares:
|
||||||
|
- myservice-https-redirect
|
||||||
|
service: myservice
|
||||||
|
priority: 50
|
||||||
|
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: "http://10.10.10.XXX:PORT"
|
||||||
|
|
||||||
|
middlewares:
|
||||||
|
myservice-https-redirect:
|
||||||
|
redirectScheme:
|
||||||
|
scheme: https
|
||||||
|
permanent: true
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Deploy the Config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create file on CT 202
|
||||||
|
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
|
||||||
|
<paste config here>
|
||||||
|
EOF"'
|
||||||
|
|
||||||
|
# Traefik auto-reloads (watches conf.d directory)
|
||||||
|
# Check logs:
|
||||||
|
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Add Cloudflare DNS Entry
|
||||||
|
|
||||||
|
#### Cloudflare Credentials
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Email | cloudflare@htsn.io |
|
||||||
|
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
|
||||||
|
| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
|
||||||
|
| Public IP | 70.237.94.174 |
|
||||||
|
|
||||||
|
#### Method 1: Manual (Cloudflare Dashboard)
|
||||||
|
|
||||||
|
1. Go to https://dash.cloudflare.com/
|
||||||
|
2. Select `htsn.io` domain
|
||||||
|
3. DNS → Add Record
|
||||||
|
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
|
||||||
|
|
||||||
|
#### Method 2: Automated (CLI)
|
||||||
|
|
||||||
|
Save this as `~/bin/add-cloudflare-dns.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Add DNS record to Cloudflare for htsn.io
|
||||||
|
|
||||||
|
SUBDOMAIN="$1"
|
||||||
|
CF_EMAIL="cloudflare@htsn.io"
|
||||||
|
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
|
||||||
|
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
|
||||||
|
PUBLIC_IP="70.237.94.174"
|
||||||
|
|
||||||
|
if [ -z "$SUBDOMAIN" ]; then
|
||||||
|
echo "Usage: $0 <subdomain>"
|
||||||
|
echo "Example: $0 myservice # Creates myservice.htsn.io"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
|
||||||
|
-H "X-Auth-Email: $CF_EMAIL" \
|
||||||
|
-H "X-Auth-Key: $CF_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
--data "{
|
||||||
|
\"type\":\"A\",
|
||||||
|
\"name\":\"$SUBDOMAIN\",
|
||||||
|
\"content\":\"$PUBLIC_IP\",
|
||||||
|
\"ttl\":1,
|
||||||
|
\"proxied\":true
|
||||||
|
}" | jq .
|
||||||
|
```
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```bash
|
||||||
|
chmod +x ~/bin/add-cloudflare-dns.sh
|
||||||
|
~/bin/add-cloudflare-dns.sh myservice # Creates myservice.htsn.io
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if DNS resolves
|
||||||
|
dig myservice.htsn.io
|
||||||
|
|
||||||
|
# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
|
||||||
|
|
||||||
|
# Test HTTP redirect
|
||||||
|
curl -I http://myservice.htsn.io
|
||||||
|
|
||||||
|
# Expected: 301 redirect to https://
|
||||||
|
|
||||||
|
# Test HTTPS
|
||||||
|
curl -I https://myservice.htsn.io
|
||||||
|
|
||||||
|
# Expected: 200 OK
|
||||||
|
|
||||||
|
# Check Traefik dashboard (if enabled)
|
||||||
|
# http://10.10.10.250:8080/dashboard/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Update Documentation
|
||||||
|
|
||||||
|
After deploying, update:
|
||||||
|
|
||||||
|
1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
|
||||||
|
2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
|
||||||
|
3. **CLAUDE.md** - Update quick reference if needed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SSL Certificates
|
||||||
|
|
||||||
|
Traefik has **two certificate resolvers** configured:
|
||||||
|
|
||||||
|
| Resolver | Use When | Challenge Type | Notes |
|
||||||
|
|----------|----------|----------------|-------|
|
||||||
|
| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
|
||||||
|
| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
|
||||||
|
|
||||||
|
### ⚠️ Important: HTTP Challenge vs DNS Challenge
|
||||||
|
|
||||||
|
**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
|
||||||
|
|
||||||
|
**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
|
||||||
|
|
||||||
|
### Certificate Resolver Configuration
|
||||||
|
|
||||||
|
**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
Environment="CF_API_EMAIL=cloudflare@htsn.io"
|
||||||
|
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Certificate Storage
|
||||||
|
|
||||||
|
| Resolver | Storage File |
|
||||||
|
|----------|--------------|
|
||||||
|
| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
|
||||||
|
| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
|
||||||
|
|
||||||
|
**Permissions**: Must be `600` (read/write owner only)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check permissions
|
||||||
|
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
|
||||||
|
|
||||||
|
# Fix if needed
|
||||||
|
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
|
||||||
|
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Certificate Renewal
|
||||||
|
|
||||||
|
- **Automatic** via Traefik
|
||||||
|
- Checks every 24 hours
|
||||||
|
- Renews 30 days before expiry
|
||||||
|
- No manual intervention needed
|
||||||
|
|
||||||
|
### Troubleshooting Certificates
|
||||||
|
|
||||||
|
#### Certificate Fails to Issue
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Traefik logs
|
||||||
|
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
|
||||||
|
|
||||||
|
# Verify Cloudflare API access
|
||||||
|
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
|
||||||
|
-H "X-Auth-Email: cloudflare@htsn.io" \
|
||||||
|
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
|
||||||
|
|
||||||
|
# Check acme.json permissions
|
||||||
|
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Force Certificate Renewal
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete certificate (Traefik will re-request)
|
||||||
|
ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
|
||||||
|
ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
|
||||||
|
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
|
||||||
|
ssh pve 'pct exec 202 -- systemctl restart traefik'
|
||||||
|
|
||||||
|
# Watch logs
|
||||||
|
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Deployment - One-Liner
|
||||||
|
|
||||||
|
For fast deployment, use this all-in-one command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
|
||||||
|
|
||||||
|
# 1. Create Traefik config
|
||||||
|
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
myservice-secure:
|
||||||
|
entryPoints: [websecure]
|
||||||
|
rule: Host(\\\`myservice.htsn.io\\\`)
|
||||||
|
service: myservice
|
||||||
|
tls: {certResolver: cloudflare}
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: http://10.10.10.206:8080
|
||||||
|
EOF"'
|
||||||
|
|
||||||
|
# 2. Add Cloudflare DNS
|
||||||
|
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
|
||||||
|
-H "X-Auth-Email: cloudflare@htsn.io" \
|
||||||
|
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
--data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
|
||||||
|
|
||||||
|
# 3. Test (wait a few seconds for DNS propagation)
|
||||||
|
curl -I https://myservice.htsn.io
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Docker Service with Traefik Labels (Alternative)
|
||||||
|
|
||||||
|
If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
|
||||||
|
|
||||||
|
**Requirements**:
|
||||||
|
- Traefik must have access to Docker socket
|
||||||
|
- Service must be on same Docker network as Traefik
|
||||||
|
|
||||||
|
**Example docker-compose.yml**:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
version: "3.8"
|
||||||
|
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
image: myimage:latest
|
||||||
|
labels:
|
||||||
|
- "traefik.enable=true"
|
||||||
|
- "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
|
||||||
|
- "traefik.http.routers.myservice.entrypoints=websecure"
|
||||||
|
- "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
|
||||||
|
- "traefik.http.services.myservice.loadbalancer.server.port=8080"
|
||||||
|
networks:
|
||||||
|
- traefik
|
||||||
|
|
||||||
|
networks:
|
||||||
|
traefik:
|
||||||
|
external: true
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cloudflare API Reference
|
||||||
|
|
||||||
|
### API Credentials
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Email | cloudflare@htsn.io |
|
||||||
|
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
|
||||||
|
| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
|
||||||
|
|
||||||
|
### Common API Operations
|
||||||
|
|
||||||
|
Set credentials:
|
||||||
|
```bash
|
||||||
|
CF_EMAIL="cloudflare@htsn.io"
|
||||||
|
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
|
||||||
|
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
|
||||||
|
```
|
||||||
|
|
||||||
|
**List all DNS records**:
|
||||||
|
```bash
|
||||||
|
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
|
||||||
|
-H "X-Auth-Email: $CF_EMAIL" \
|
||||||
|
-H "X-Auth-Key: $CF_API_KEY" | jq
|
||||||
|
```
|
||||||
|
|
||||||
|
**Add A record**:
|
||||||
|
```bash
|
||||||
|
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
|
||||||
|
-H "X-Auth-Email: $CF_EMAIL" \
|
||||||
|
-H "X-Auth-Key: $CF_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
--data '{
|
||||||
|
"type":"A",
|
||||||
|
"name":"subdomain",
|
||||||
|
"content":"70.237.94.174",
|
||||||
|
"proxied":true
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Delete record**:
|
||||||
|
```bash
|
||||||
|
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
|
||||||
|
-H "X-Auth-Email: $CF_EMAIL" \
|
||||||
|
-H "X-Auth-Key: $CF_API_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Update record** (toggle proxy):
|
||||||
|
```bash
|
||||||
|
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
|
||||||
|
-H "X-Auth-Email: $CF_EMAIL" \
|
||||||
|
-H "X-Auth-Key: $CF_API_KEY" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
--data '{"proxied":false}'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Service Not Accessible
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Check if DNS resolves
|
||||||
|
dig myservice.htsn.io
|
||||||
|
|
||||||
|
# 2. Check if backend is reachable
|
||||||
|
curl -I http://10.10.10.XXX:PORT
|
||||||
|
|
||||||
|
# 3. Check Traefik logs
|
||||||
|
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
|
||||||
|
|
||||||
|
# 4. Check Traefik config is valid
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
|
||||||
|
|
||||||
|
# 5. Restart Traefik (if needed)
|
||||||
|
ssh pve 'pct exec 202 -- systemctl restart traefik'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Certificate Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check certificate status in acme.json
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
|
||||||
|
|
||||||
|
# Check certificate expiry
|
||||||
|
echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
|
||||||
|
```
|
||||||
|
|
||||||
|
### 502 Bad Gateway
|
||||||
|
|
||||||
|
**Cause**: Backend service is down or unreachable
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if backend is running
|
||||||
|
ssh backend-host 'systemctl status myservice'
|
||||||
|
|
||||||
|
# Check if port is open
|
||||||
|
nc -zv 10.10.10.XXX PORT
|
||||||
|
|
||||||
|
# Check firewall
|
||||||
|
ssh backend-host 'iptables -L -n | grep PORT'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 404 Not Found
|
||||||
|
|
||||||
|
**Cause**: Traefik can't match the request to a router
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check router rule matches domain
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
|
||||||
|
|
||||||
|
# Should be: rule: "Host(`myservice.htsn.io`)"
|
||||||
|
|
||||||
|
# Check DNS is pointing to correct IP
|
||||||
|
dig myservice.htsn.io
|
||||||
|
|
||||||
|
# Restart Traefik to reload config
|
||||||
|
ssh pve 'pct exec 202 -- systemctl restart traefik'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advanced Configuration Examples
|
||||||
|
|
||||||
|
### WebSocket Support
|
||||||
|
|
||||||
|
For services that use WebSockets (like Home Assistant):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
myservice-secure:
|
||||||
|
entryPoints:
|
||||||
|
- websecure
|
||||||
|
rule: "Host(`myservice.htsn.io`)"
|
||||||
|
service: myservice
|
||||||
|
tls:
|
||||||
|
certResolver: cloudflare
|
||||||
|
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
loadBalancer:
|
||||||
|
servers:
|
||||||
|
- url: "http://10.10.10.XXX:PORT"
|
||||||
|
# No special config needed - WebSockets work by default in Traefik v2+
|
||||||
|
```
|
||||||
|
|
||||||
|
### Custom Headers
|
||||||
|
|
||||||
|
Add custom headers (e.g., security headers):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
myservice-secure:
|
||||||
|
middlewares:
|
||||||
|
- myservice-headers
|
||||||
|
|
||||||
|
middlewares:
|
||||||
|
myservice-headers:
|
||||||
|
headers:
|
||||||
|
customResponseHeaders:
|
||||||
|
X-Frame-Options: "DENY"
|
||||||
|
X-Content-Type-Options: "nosniff"
|
||||||
|
Referrer-Policy: "strict-origin-when-cross-origin"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Basic Authentication
|
||||||
|
|
||||||
|
Protect a service with basic auth:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
http:
|
||||||
|
routers:
|
||||||
|
myservice-secure:
|
||||||
|
middlewares:
|
||||||
|
- myservice-auth
|
||||||
|
|
||||||
|
middlewares:
|
||||||
|
myservice-auth:
|
||||||
|
basicAuth:
|
||||||
|
users:
|
||||||
|
- "user:$apr1$..." # Generate with: htpasswd -nb user password
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Monthly Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Traefik status
|
||||||
|
ssh pve 'pct exec 202 -- systemctl status traefik'
|
||||||
|
|
||||||
|
# Review logs for errors
|
||||||
|
ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
|
||||||
|
|
||||||
|
# Check certificate expiry dates
|
||||||
|
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
|
||||||
|
|
||||||
|
# Verify all services responding
|
||||||
|
for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
|
||||||
|
echo "Testing $domain..."
|
||||||
|
curl -sI https://$domain | head -1
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Backup Traefik Config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup all configs
|
||||||
|
ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
|
||||||
|
|
||||||
|
# Copy to safe location
|
||||||
|
scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
|
||||||
|
- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
|
||||||
|
- [SERVICES.md](#) - Complete service inventory (coming soon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
605
UPS.md
Normal file
605
UPS.md
Normal file
@@ -0,0 +1,605 @@
|
|||||||
|
# UPS and Power Management
|
||||||
|
|
||||||
|
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
|
||||||
|
|
||||||
|
## Hardware
|
||||||
|
|
||||||
|
### Current UPS
|
||||||
|
|
||||||
|
| Specification | Value |
|
||||||
|
|---------------|-------|
|
||||||
|
| **Model** | CyberPower OR2200PFCRT2U |
|
||||||
|
| **Capacity** | 2200VA / 1320W |
|
||||||
|
| **Form Factor** | 2U rackmount |
|
||||||
|
| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
|
||||||
|
| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
|
||||||
|
| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
|
||||||
|
| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
|
||||||
|
| **Installed** | 2025-12-21 |
|
||||||
|
| **Status** | Active |
|
||||||
|
|
||||||
|
### ⚠️ Temporary Wiring Modification
|
||||||
|
|
||||||
|
**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
|
||||||
|
**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
|
||||||
|
**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
|
||||||
|
**Current draw**: ~1000-1350W total (safe margin)
|
||||||
|
**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
|
||||||
|
|
||||||
|
### Previous UPS
|
||||||
|
|
||||||
|
| Model | Capacity | Issue | Replaced |
|
||||||
|
|-------|----------|-------|----------|
|
||||||
|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
|
||||||
|
|
||||||
|
**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Power Draw Estimates
|
||||||
|
|
||||||
|
### Typical Load
|
||||||
|
|
||||||
|
| Component | Idle | Load | Notes |
|
||||||
|
|-----------|------|------|-------|
|
||||||
|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
|
||||||
|
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
|
||||||
|
| Network gear | ~50W | ~50W | Router, switches |
|
||||||
|
| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
|
||||||
|
|
||||||
|
**UPS Load**: ~33-50% typical, 70-80% under heavy load
|
||||||
|
|
||||||
|
### Runtime Calculation
|
||||||
|
|
||||||
|
At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
|
||||||
|
At **660W load** (50%): ~10-12 min estimated
|
||||||
|
At **1000W load** (75%): ~6-8 min estimated
|
||||||
|
|
||||||
|
**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## NUT (Network UPS Tools) Configuration
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
|
||||||
|
│
|
||||||
|
└──> Home Assistant (monitoring only)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
|
||||||
|
**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
|
||||||
|
|
||||||
|
### NUT Server Configuration (PVE)
|
||||||
|
|
||||||
|
#### 1. UPS Driver Config: `/etc/nut/ups.conf`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[cyberpower]
|
||||||
|
driver = usbhid-ups
|
||||||
|
port = auto
|
||||||
|
desc = "CyberPower OR2200PFCRT2U"
|
||||||
|
override.battery.charge.low = 20
|
||||||
|
override.battery.runtime.low = 120
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key settings**:
|
||||||
|
- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
|
||||||
|
- `port = auto`: Auto-detect USB device
|
||||||
|
- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
|
||||||
|
|
||||||
|
#### 2. NUT Server Config: `/etc/nut/upsd.conf`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
LISTEN 127.0.0.1 3493
|
||||||
|
LISTEN 10.10.10.120 3493
|
||||||
|
```
|
||||||
|
|
||||||
|
**Listens on**:
|
||||||
|
- Localhost (for local monitoring)
|
||||||
|
- LAN IP (for PVE2 to connect)
|
||||||
|
|
||||||
|
#### 3. User Config: `/etc/nut/upsd.users`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[admin]
|
||||||
|
password = upsadmin123
|
||||||
|
actions = SET
|
||||||
|
instcmds = ALL
|
||||||
|
|
||||||
|
[upsmon]
|
||||||
|
password = upsmon123
|
||||||
|
upsmon master
|
||||||
|
```
|
||||||
|
|
||||||
|
**Users**:
|
||||||
|
- `admin`: Full control, can run commands
|
||||||
|
- `upsmon`: Monitoring only (used by PVE2)
|
||||||
|
|
||||||
|
#### 4. Monitor Config: `/etc/nut/upsmon.conf`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
|
||||||
|
|
||||||
|
MINSUPPLIES 1
|
||||||
|
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
|
||||||
|
NOTIFYCMD /usr/sbin/upssched
|
||||||
|
POLLFREQ 5
|
||||||
|
POLLFREQALERT 5
|
||||||
|
HOSTSYNC 15
|
||||||
|
DEADTIME 15
|
||||||
|
POWERDOWNFLAG /etc/killpower
|
||||||
|
|
||||||
|
NOTIFYMSG ONLINE "UPS %s on line power"
|
||||||
|
NOTIFYMSG ONBATT "UPS %s on battery"
|
||||||
|
NOTIFYMSG LOWBATT "UPS %s battery is low"
|
||||||
|
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
|
||||||
|
NOTIFYMSG COMMOK "Communications with UPS %s established"
|
||||||
|
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
|
||||||
|
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
|
||||||
|
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
|
||||||
|
NOTIFYMSG NOCOMM "UPS %s is unavailable"
|
||||||
|
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
|
||||||
|
|
||||||
|
NOTIFYFLAG ONLINE SYSLOG+WALL
|
||||||
|
NOTIFYFLAG ONBATT SYSLOG+WALL
|
||||||
|
NOTIFYFLAG LOWBATT SYSLOG+WALL
|
||||||
|
NOTIFYFLAG FSD SYSLOG+WALL
|
||||||
|
NOTIFYFLAG COMMOK SYSLOG+WALL
|
||||||
|
NOTIFYFLAG COMMBAD SYSLOG+WALL
|
||||||
|
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
|
||||||
|
NOTIFYFLAG REPLBATT SYSLOG+WALL
|
||||||
|
NOTIFYFLAG NOCOMM SYSLOG+WALL
|
||||||
|
NOTIFYFLAG NOPARENT SYSLOG
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key settings**:
|
||||||
|
- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
|
||||||
|
- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
|
||||||
|
- `POLLFREQ 5`: Check UPS every 5 seconds
|
||||||
|
|
||||||
|
#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
|
||||||
|
|
||||||
|
```udev
|
||||||
|
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Purpose**: Ensure NUT can access USB UPS device
|
||||||
|
|
||||||
|
**Apply rule**:
|
||||||
|
```bash
|
||||||
|
udevadm control --reload-rules
|
||||||
|
udevadm trigger
|
||||||
|
```
|
||||||
|
|
||||||
|
### NUT Client Configuration (PVE2)
|
||||||
|
|
||||||
|
#### Monitor Config: `/etc/nut/upsmon.conf`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
|
||||||
|
|
||||||
|
MINSUPPLIES 1
|
||||||
|
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
|
||||||
|
POLLFREQ 5
|
||||||
|
POLLFREQALERT 5
|
||||||
|
HOSTSYNC 15
|
||||||
|
DEADTIME 15
|
||||||
|
POWERDOWNFLAG /etc/killpower
|
||||||
|
|
||||||
|
# Same NOTIFYMSG and NOTIFYFLAG as PVE
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Custom Shutdown Script
|
||||||
|
|
||||||
|
### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Graceful VM/CT shutdown when UPS battery low
|
||||||
|
|
||||||
|
LOG="/var/log/ups-shutdown.log"
|
||||||
|
|
||||||
|
log() {
|
||||||
|
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
|
||||||
|
}
|
||||||
|
|
||||||
|
log "=== UPS Shutdown Triggered ==="
|
||||||
|
log "Battery low - initiating graceful shutdown of VMs/CTs"
|
||||||
|
|
||||||
|
# Get list of running VMs (skip TrueNAS for now)
|
||||||
|
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
|
||||||
|
for VMID in $VMS; do
|
||||||
|
log "Stopping VM $VMID..."
|
||||||
|
qm shutdown $VMID
|
||||||
|
done
|
||||||
|
|
||||||
|
# Get list of running containers
|
||||||
|
CTS=$(pct list | awk '$2=="running" {print $1}')
|
||||||
|
for CTID in $CTS; do
|
||||||
|
log "Stopping CT $CTID..."
|
||||||
|
pct shutdown $CTID
|
||||||
|
done
|
||||||
|
|
||||||
|
# Wait for VMs/CTs to stop
|
||||||
|
log "Waiting 60 seconds for VMs/CTs to shut down..."
|
||||||
|
sleep 60
|
||||||
|
|
||||||
|
# Now stop TrueNAS (storage - must be last)
|
||||||
|
if qm status 100 | grep -q running; then
|
||||||
|
log "Stopping TrueNAS (VM 100) last..."
|
||||||
|
qm shutdown 100
|
||||||
|
sleep 30
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "All VMs/CTs stopped. Host will remain running until UPS dies."
|
||||||
|
log "=== UPS Shutdown Complete ==="
|
||||||
|
```
|
||||||
|
|
||||||
|
**Make executable**:
|
||||||
|
```bash
|
||||||
|
chmod +x /usr/local/bin/ups-shutdown.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Script behavior**:
|
||||||
|
1. Stops all VMs (except TrueNAS)
|
||||||
|
2. Stops all containers
|
||||||
|
3. Waits 60 seconds
|
||||||
|
4. Stops TrueNAS last (storage must be cleanly unmounted)
|
||||||
|
5. **Does NOT shut down Proxmox hosts** - intentionally left running
|
||||||
|
|
||||||
|
**Why not shut down hosts?**
|
||||||
|
- BIOS configured to "Restore on AC Power Loss"
|
||||||
|
- When power returns, servers auto-boot and start VMs in order
|
||||||
|
- Avoids need for manual intervention
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Power Failure Behavior
|
||||||
|
|
||||||
|
### When Power Fails
|
||||||
|
|
||||||
|
1. **UPS switches to battery** (`OB DISCHRG` status)
|
||||||
|
2. **NUT monitors runtime** - polls every 5 seconds
|
||||||
|
3. **At 120 seconds (2 min) remaining**:
|
||||||
|
- NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
|
||||||
|
- Script gracefully stops all VMs/CTs
|
||||||
|
- TrueNAS stopped last (storage integrity)
|
||||||
|
4. **Hosts remain running** until UPS battery depletes
|
||||||
|
5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
|
||||||
|
|
||||||
|
### When Power Returns
|
||||||
|
|
||||||
|
1. **UPS charges battery**, power returns to servers
|
||||||
|
2. **BIOS "Restore on AC Power Loss"** boots both servers
|
||||||
|
3. **Proxmox starts** and auto-starts VMs in configured order:
|
||||||
|
|
||||||
|
| Order | Wait | VMs/CTs | Reason |
|
||||||
|
|-------|------|---------|--------|
|
||||||
|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
|
||||||
|
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
|
||||||
|
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
|
||||||
|
| 4 | 5s | pihole, traefik, findshyt | Containers |
|
||||||
|
|
||||||
|
PVE2 VMs: order=1, wait=10s
|
||||||
|
|
||||||
|
**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## UPS Status Codes
|
||||||
|
|
||||||
|
| Code | Meaning | Action |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `OL` | Online (AC power) | Normal operation |
|
||||||
|
| `OB` | On Battery | Power outage - monitor runtime |
|
||||||
|
| `LB` | Low Battery | <2 min remaining - shutdown imminent |
|
||||||
|
| `CHRG` | Charging | Battery charging after power restored |
|
||||||
|
| `DISCHRG` | Discharging | On battery, draining |
|
||||||
|
| `FSD` | Forced Shutdown | NUT triggered shutdown |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring & Commands
|
||||||
|
|
||||||
|
### Check UPS Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full status
|
||||||
|
ssh pve 'upsc cyberpower@localhost'
|
||||||
|
|
||||||
|
# Key metrics only
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
|
||||||
|
|
||||||
|
# Example output:
|
||||||
|
# battery.charge: 100
|
||||||
|
# battery.runtime: 1234 (seconds remaining)
|
||||||
|
# ups.load: 33 (% load)
|
||||||
|
# ups.status: OL (online)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Control UPS Beeper
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Mute beeper (temporary - until next power event)
|
||||||
|
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
|
||||||
|
|
||||||
|
# Disable beeper (permanent)
|
||||||
|
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
|
||||||
|
|
||||||
|
# Enable beeper
|
||||||
|
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Shutdown Procedure
|
||||||
|
|
||||||
|
**Simulate low battery** (careful - this will shut down VMs!):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set a very high low battery threshold to trigger shutdown
|
||||||
|
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
|
||||||
|
|
||||||
|
# Watch it trigger (when runtime drops below 300 seconds)
|
||||||
|
ssh pve 'tail -f /var/log/ups-shutdown.log'
|
||||||
|
|
||||||
|
# Reset to normal
|
||||||
|
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Better test**: Run shutdown script manually without actually triggering NUT:
|
||||||
|
```bash
|
||||||
|
ssh pve '/usr/local/bin/ups-shutdown.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Home Assistant Integration
|
||||||
|
|
||||||
|
UPS metrics are exposed to Home Assistant via NUT integration.
|
||||||
|
|
||||||
|
### Available Sensors
|
||||||
|
|
||||||
|
| Entity ID | Description |
|
||||||
|
|-----------|-------------|
|
||||||
|
| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
|
||||||
|
| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
|
||||||
|
| `sensor.cyberpower_load` | Load % (0-100) |
|
||||||
|
| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
|
||||||
|
| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
|
||||||
|
| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
|
||||||
|
|
||||||
|
### Example Automations
|
||||||
|
|
||||||
|
**Send notification when on battery**:
|
||||||
|
```yaml
|
||||||
|
automation:
|
||||||
|
- alias: "UPS On Battery Alert"
|
||||||
|
trigger:
|
||||||
|
- platform: state
|
||||||
|
entity_id: sensor.cyberpower_status
|
||||||
|
to: "OB"
|
||||||
|
action:
|
||||||
|
- service: notify.mobile_app
|
||||||
|
data:
|
||||||
|
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Alert when battery low**:
|
||||||
|
```yaml
|
||||||
|
automation:
|
||||||
|
- alias: "UPS Low Battery Alert"
|
||||||
|
trigger:
|
||||||
|
- platform: numeric_state
|
||||||
|
entity_id: sensor.cyberpower_battery_runtime
|
||||||
|
below: 300
|
||||||
|
action:
|
||||||
|
- service: notify.mobile_app
|
||||||
|
data:
|
||||||
|
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing Results
|
||||||
|
|
||||||
|
### Full Power Failure Test (2025-12-21)
|
||||||
|
|
||||||
|
Complete end-to-end test of power failure and recovery:
|
||||||
|
|
||||||
|
| Event | Time | Duration | Notes |
|
||||||
|
|-------|------|----------|-------|
|
||||||
|
| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
|
||||||
|
| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
|
||||||
|
| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
|
||||||
|
| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
|
||||||
|
| **Power restored** | ~22:47 | - | Plugged back in |
|
||||||
|
| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
|
||||||
|
| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
|
||||||
|
| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
|
||||||
|
| **Total recovery** | - | **~7 min** | From power return to fully operational |
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
✅ VMs shut down gracefully
|
||||||
|
✅ Hosts remained running until UPS died (as intended)
|
||||||
|
✅ Auto-boot on power restoration worked
|
||||||
|
✅ VMs started in correct order with appropriate delays
|
||||||
|
✅ No data corruption or issues
|
||||||
|
|
||||||
|
**Runtime calculation**:
|
||||||
|
- Load: ~33% (440W estimated)
|
||||||
|
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
|
||||||
|
- Matches manufacturer estimate for 33% load
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proxmox Cluster Quorum Fix
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
|
||||||
|
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
|
||||||
|
|
||||||
|
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
|
||||||
|
|
||||||
|
### Solution
|
||||||
|
|
||||||
|
Modified `/etc/pve/corosync.conf` to enable 2-node mode:
|
||||||
|
|
||||||
|
```
|
||||||
|
quorum {
|
||||||
|
provider: corosync_votequorum
|
||||||
|
two_node: 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Effect**:
|
||||||
|
- Either node can operate independently if the other is down
|
||||||
|
- No more waiting for quorum when one server is offline
|
||||||
|
- Both nodes visible in single Proxmox interface when both up
|
||||||
|
|
||||||
|
**Applied**: 2025-12-21
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Monthly Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check UPS status
|
||||||
|
ssh pve 'upsc cyberpower@localhost'
|
||||||
|
|
||||||
|
# Check NUT server running
|
||||||
|
ssh pve 'systemctl status nut-server'
|
||||||
|
ssh pve 'systemctl status nut-monitor'
|
||||||
|
|
||||||
|
# Check NUT client running (PVE2)
|
||||||
|
ssh pve2 'systemctl status nut-monitor'
|
||||||
|
|
||||||
|
# Verify PVE2 can see UPS
|
||||||
|
ssh pve2 'upsc cyberpower@10.10.10.120'
|
||||||
|
|
||||||
|
# Check logs for errors
|
||||||
|
ssh pve 'journalctl -u nut-server -n 50'
|
||||||
|
ssh pve 'journalctl -u nut-monitor -n 50'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Battery Health
|
||||||
|
|
||||||
|
**Check battery stats**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep battery'
|
||||||
|
|
||||||
|
# Key metrics:
|
||||||
|
# battery.charge: 100 (should be near 100 when on AC)
|
||||||
|
# battery.runtime: 1200+ (seconds at current load)
|
||||||
|
# battery.voltage: ~24V (normal for 24V battery system)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
|
||||||
|
```bash
|
||||||
|
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
|
||||||
|
```
|
||||||
|
|
||||||
|
CyberPower batteries typically last 3-5 years.
|
||||||
|
|
||||||
|
### Firmware Updates
|
||||||
|
|
||||||
|
Check CyberPower website for firmware updates:
|
||||||
|
https://www.cyberpowersystems.com/support/firmware/
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### UPS Not Detected
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check USB connection
|
||||||
|
ssh pve 'lsusb | grep Cyber'
|
||||||
|
|
||||||
|
# Expected:
|
||||||
|
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
|
||||||
|
|
||||||
|
# Restart NUT driver
|
||||||
|
ssh pve 'systemctl restart nut-driver'
|
||||||
|
ssh pve 'systemctl status nut-driver'
|
||||||
|
```
|
||||||
|
|
||||||
|
### PVE2 Can't Connect
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verify NUT server listening
|
||||||
|
ssh pve 'netstat -tuln | grep 3493'
|
||||||
|
|
||||||
|
# Should show:
|
||||||
|
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
|
||||||
|
|
||||||
|
# Test connection from PVE2
|
||||||
|
ssh pve2 'telnet 10.10.10.120 3493'
|
||||||
|
|
||||||
|
# Check firewall (should allow port 3493)
|
||||||
|
ssh pve 'iptables -L -n | grep 3493'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Shutdown Script Not Running
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check script permissions
|
||||||
|
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
|
||||||
|
|
||||||
|
# Should be: -rwxr-xr-x (executable)
|
||||||
|
|
||||||
|
# Check logs
|
||||||
|
ssh pve 'cat /var/log/ups-shutdown.log'
|
||||||
|
|
||||||
|
# Test script manually
|
||||||
|
ssh pve '/usr/local/bin/ups-shutdown.sh'
|
||||||
|
```
|
||||||
|
|
||||||
|
### UPS Status Shows UNKNOWN
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Driver may not be compatible
|
||||||
|
ssh pve 'upsc cyberpower@localhost ups.status'
|
||||||
|
|
||||||
|
# Try different driver (in /etc/nut/ups.conf)
|
||||||
|
# driver = usbhid-ups
|
||||||
|
# or
|
||||||
|
# driver = blazer_usb
|
||||||
|
|
||||||
|
# Restart after change
|
||||||
|
ssh pve 'systemctl restart nut-driver nut-server'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Improvements
|
||||||
|
|
||||||
|
- [ ] Add email alerts for UPS events (power fail, low battery)
|
||||||
|
- [ ] Log runtime statistics to track battery degradation
|
||||||
|
- [ ] Set up Grafana dashboard for UPS metrics
|
||||||
|
- [ ] Test battery runtime at different load levels
|
||||||
|
- [ ] Upgrade to 20A circuit, restore original 5-20P plug
|
||||||
|
- [ ] Consider adding network management card for out-of-band UPS access
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
|
||||||
|
- [VMS.md](VMS.md) - VM startup order configuration
|
||||||
|
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
580
VMS.md
Normal file
580
VMS.md
Normal file
@@ -0,0 +1,580 @@
|
|||||||
|
# VMs and Containers
|
||||||
|
|
||||||
|
Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
| Server | VMs | LXCs | Total |
|
||||||
|
|--------|-----|------|-------|
|
||||||
|
| **PVE** (10.10.10.120) | 6 | 3 | 9 |
|
||||||
|
| **PVE2** (10.10.10.102) | 3 | 0 | 3 |
|
||||||
|
| **Total** | **9** | **3** | **12** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PVE (10.10.10.120) - Primary Server
|
||||||
|
|
||||||
|
### Virtual Machines
|
||||||
|
|
||||||
|
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|
||||||
|
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
|
||||||
|
| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
|
||||||
|
| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
|
||||||
|
| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
|
||||||
|
| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
|
||||||
|
| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
|
||||||
|
| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
|
||||||
|
| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
|
||||||
|
|
||||||
|
### LXC Containers
|
||||||
|
|
||||||
|
| CTID | Name | IP | RAM | Storage | Purpose |
|
||||||
|
|------|------|-----|-----|---------|---------|
|
||||||
|
| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
|
||||||
|
| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
|
||||||
|
| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PVE2 (10.10.10.102) - Secondary Server
|
||||||
|
|
||||||
|
### Virtual Machines
|
||||||
|
|
||||||
|
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|
||||||
|
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
|
||||||
|
| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
|
||||||
|
| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
|
||||||
|
| **302** | docker-host2 | 10.10.10.207 | 4 | 8GB | nvme-mirror3 | Docker host (n8n, automation) | - | ✅ Yes |
|
||||||
|
|
||||||
|
### LXC Containers
|
||||||
|
|
||||||
|
None on PVE2.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## VM Details
|
||||||
|
|
||||||
|
### 100 - TrueNAS (Storage Server)
|
||||||
|
|
||||||
|
**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: TrueNAS SCALE
|
||||||
|
- **vCPUs**: 8
|
||||||
|
- **RAM**: 32 GB
|
||||||
|
- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
|
||||||
|
- **Network**:
|
||||||
|
- Primary: 10 Gb (vmbr2)
|
||||||
|
- Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
|
||||||
|
|
||||||
|
**Hardware Passthrough**:
|
||||||
|
- LSI SAS2308 HBA (for EMC enclosure drives)
|
||||||
|
- Samsung NVMe (for ZFS caching)
|
||||||
|
|
||||||
|
**ZFS Pools**:
|
||||||
|
- `vault`: Main storage pool on EMC drives
|
||||||
|
- Boot pool on passed-through NVMe
|
||||||
|
|
||||||
|
**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 101 - Saltbox (Media Automation)
|
||||||
|
|
||||||
|
**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 16
|
||||||
|
- **RAM**: 16 GB
|
||||||
|
- **Storage**: nvme-mirror1
|
||||||
|
- **Network**: 10 Gb (vmbr2)
|
||||||
|
|
||||||
|
**GPU Passthrough**:
|
||||||
|
- NVIDIA TITAN RTX (for Plex hardware transcoding)
|
||||||
|
|
||||||
|
**Services**:
|
||||||
|
- Plex Media Server (plex.htsn.io)
|
||||||
|
- Sonarr, Radarr, Lidarr (TV/movie/music automation)
|
||||||
|
- SABnzbd, NZBGet (downloaders)
|
||||||
|
- Overseerr (request management)
|
||||||
|
- Tautulli (Plex stats)
|
||||||
|
- Organizr (dashboard)
|
||||||
|
- Authelia (SSO authentication)
|
||||||
|
- Traefik (reverse proxy - separate from CT 202)
|
||||||
|
|
||||||
|
**Managed By**: Saltbox Ansible playbooks
|
||||||
|
**See**: [SALTBOX.md](#) (coming soon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 105 - fs-dev (Development Environment)
|
||||||
|
|
||||||
|
**Purpose**: General development work, testing, prototyping
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 10
|
||||||
|
- **RAM**: 8 GB
|
||||||
|
- **Storage**: rpool
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 110 - Home Assistant (Home Automation)
|
||||||
|
|
||||||
|
**Purpose**: Smart home automation platform
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Home Assistant OS
|
||||||
|
- **vCPUs**: 2
|
||||||
|
- **RAM**: 2 GB
|
||||||
|
- **Storage**: rpool
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
**Access**:
|
||||||
|
- Web UI: https://homeassistant.htsn.io
|
||||||
|
- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
|
||||||
|
|
||||||
|
**Special Notes**:
|
||||||
|
- ❌ No QEMU agent (Home Assistant OS doesn't support it)
|
||||||
|
- No SSH server by default (access via web terminal)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 111 - lmdev1 (AI/LLM Development)
|
||||||
|
|
||||||
|
**Purpose**: AI model development, fine-tuning, inference
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 8
|
||||||
|
- **RAM**: 32 GB
|
||||||
|
- **Storage**: nvme-mirror1
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
**GPU Passthrough**:
|
||||||
|
- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
|
||||||
|
|
||||||
|
**Installed**:
|
||||||
|
- CUDA toolkit
|
||||||
|
- Python 3.11+
|
||||||
|
- PyTorch, TensorFlow
|
||||||
|
- Hugging Face transformers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 201 - Copyparty (File Sharing)
|
||||||
|
|
||||||
|
**Purpose**: Simple HTTP file sharing server
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 2
|
||||||
|
- **RAM**: 2 GB
|
||||||
|
- **Storage**: rpool
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
**Access**: https://copyparty.htsn.io
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 206 - docker-host (Docker Services)
|
||||||
|
|
||||||
|
**Purpose**: General-purpose Docker host for miscellaneous services
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 2
|
||||||
|
- **RAM**: 4 GB
|
||||||
|
- **Storage**: rpool
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
- **CPU**: `host` passthrough (for x86-64-v3 support)
|
||||||
|
|
||||||
|
**Services Running**:
|
||||||
|
- Excalidraw (excalidraw.htsn.io) - Whiteboard
|
||||||
|
- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
|
||||||
|
- Pulse (pulse.htsn.io) - Monitoring dashboard
|
||||||
|
|
||||||
|
**Docker Compose Files**: `/opt/*/docker-compose.yml`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 300 - gitea-vm (Git Server)
|
||||||
|
|
||||||
|
**Purpose**: Self-hosted Git server
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 2
|
||||||
|
- **RAM**: 4 GB
|
||||||
|
- **Storage**: nvme-mirror3 (PVE2)
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
**Access**: https://git.htsn.io
|
||||||
|
|
||||||
|
**Repositories**:
|
||||||
|
- homelab-docs (this documentation)
|
||||||
|
- Personal projects
|
||||||
|
- Private repos
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 301 - trading-vm (AI Trading Platform)
|
||||||
|
|
||||||
|
**Purpose**: Algorithmic trading system with AI models
|
||||||
|
|
||||||
|
**Specs**:
|
||||||
|
- **OS**: Ubuntu 22.04
|
||||||
|
- **vCPUs**: 16
|
||||||
|
- **RAM**: 32 GB
|
||||||
|
- **Storage**: nvme-mirror3 (PVE2)
|
||||||
|
- **Network**: 1 Gb (vmbr0)
|
||||||
|
|
||||||
|
**GPU Passthrough**:
|
||||||
|
- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
|
||||||
|
|
||||||
|
**Software**:
|
||||||
|
- Trading algorithms
|
||||||
|
- AI models for market prediction
|
||||||
|
- Real-time data feeds
|
||||||
|
- Backtesting infrastructure
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LXC Container Details
|
||||||
|
|
||||||
|
### 200 - Pi-hole (DNS & Ad Blocking)
|
||||||
|
|
||||||
|
**Purpose**: Network-wide DNS server and ad blocker
|
||||||
|
|
||||||
|
**Type**: LXC (unprivileged)
|
||||||
|
**OS**: Ubuntu 22.04
|
||||||
|
**IP**: 10.10.10.10
|
||||||
|
**Storage**: rpool
|
||||||
|
|
||||||
|
**Access**:
|
||||||
|
- Web UI: http://10.10.10.10/admin
|
||||||
|
- Public URL: https://pihole.htsn.io
|
||||||
|
|
||||||
|
**Configuration**:
|
||||||
|
- Upstream DNS: Cloudflare (1.1.1.1)
|
||||||
|
- DHCP: Disabled (router handles DHCP)
|
||||||
|
- Interface: All interfaces
|
||||||
|
|
||||||
|
**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 202 - Traefik (Reverse Proxy)
|
||||||
|
|
||||||
|
**Purpose**: Primary reverse proxy for all public-facing services
|
||||||
|
|
||||||
|
**Type**: LXC (unprivileged)
|
||||||
|
**OS**: Ubuntu 22.04
|
||||||
|
**IP**: 10.10.10.250
|
||||||
|
**Storage**: rpool
|
||||||
|
|
||||||
|
**Configuration**: `/etc/traefik/`
|
||||||
|
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
|
||||||
|
|
||||||
|
**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
|
||||||
|
|
||||||
|
**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 205 - FindShyt (Custom App)
|
||||||
|
|
||||||
|
**Purpose**: Custom application (details TBD)
|
||||||
|
|
||||||
|
**Type**: LXC (unprivileged)
|
||||||
|
**OS**: Ubuntu 22.04
|
||||||
|
**IP**: 10.10.10.8
|
||||||
|
**Storage**: rpool
|
||||||
|
|
||||||
|
**Access**: https://findshyt.htsn.io
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## VM Startup Order & Dependencies
|
||||||
|
|
||||||
|
### Power-On Sequence
|
||||||
|
|
||||||
|
When servers boot (after power failure or restart), VMs/CTs start in this order:
|
||||||
|
|
||||||
|
#### PVE (10.10.10.120)
|
||||||
|
|
||||||
|
| Order | Wait | VMID | Name | Reason |
|
||||||
|
|-------|------|------|------|--------|
|
||||||
|
| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
|
||||||
|
| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
|
||||||
|
| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
|
||||||
|
| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
|
||||||
|
|
||||||
|
**Configure startup order** (already set):
|
||||||
|
```bash
|
||||||
|
# View current config
|
||||||
|
ssh pve 'qm config 100 | grep -E "startup|onboot"'
|
||||||
|
|
||||||
|
# Set startup order (example)
|
||||||
|
ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
|
||||||
|
ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
|
||||||
|
```
|
||||||
|
|
||||||
|
#### PVE2 (10.10.10.102)
|
||||||
|
|
||||||
|
| Order | Wait | VMID | Name |
|
||||||
|
|-------|------|------|------|
|
||||||
|
| **1** | 10s | 300, 301 | All VMs |
|
||||||
|
|
||||||
|
**Less critical** - no dependencies between PVE2 VMs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource Allocation Summary
|
||||||
|
|
||||||
|
### Total Allocated (PVE)
|
||||||
|
|
||||||
|
| Resource | Allocated | Physical | % Used |
|
||||||
|
|----------|-----------|----------|--------|
|
||||||
|
| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
|
||||||
|
| **RAM** | 98 GB | 128 GB | 77% |
|
||||||
|
|
||||||
|
**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
|
||||||
|
|
||||||
|
### Total Allocated (PVE2)
|
||||||
|
|
||||||
|
| Resource | Allocated | Physical | % Used |
|
||||||
|
|----------|-----------|----------|--------|
|
||||||
|
| **vCPUs** | 18 | 64 | 28% |
|
||||||
|
| **RAM** | 36 GB | 128 GB | 28% |
|
||||||
|
|
||||||
|
**PVE2** has significant headroom for additional VMs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding a New VM
|
||||||
|
|
||||||
|
### Quick Template
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create VM
|
||||||
|
ssh pve 'qm create VMID \
|
||||||
|
--name myvm \
|
||||||
|
--memory 4096 \
|
||||||
|
--cores 2 \
|
||||||
|
--net0 virtio,bridge=vmbr0 \
|
||||||
|
--scsihw virtio-scsi-pci \
|
||||||
|
--scsi0 nvme-mirror1:32 \
|
||||||
|
--boot order=scsi0 \
|
||||||
|
--ostype l26 \
|
||||||
|
--agent enabled=1'
|
||||||
|
|
||||||
|
# Attach ISO for installation
|
||||||
|
ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
|
||||||
|
|
||||||
|
# Start VM
|
||||||
|
ssh pve 'qm start VMID'
|
||||||
|
|
||||||
|
# Access console
|
||||||
|
ssh pve 'qm vncproxy VMID' # Then connect with VNC client
|
||||||
|
# Or via Proxmox web UI
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cloud-Init Template (Faster)
|
||||||
|
|
||||||
|
Use cloud-init for automated VM deployment:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download cloud image
|
||||||
|
ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
|
||||||
|
|
||||||
|
# Create VM
|
||||||
|
ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
|
||||||
|
|
||||||
|
# Import disk
|
||||||
|
ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
|
||||||
|
|
||||||
|
# Attach disk
|
||||||
|
ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
|
||||||
|
|
||||||
|
# Add cloud-init drive
|
||||||
|
ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
|
||||||
|
|
||||||
|
# Set boot disk
|
||||||
|
ssh pve 'qm set VMID --boot order=scsi0'
|
||||||
|
|
||||||
|
# Configure cloud-init (user, SSH key, network)
|
||||||
|
ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
|
||||||
|
|
||||||
|
# Enable QEMU agent
|
||||||
|
ssh pve 'qm set VMID --agent enabled=1'
|
||||||
|
|
||||||
|
# Resize disk (cloud images are small by default)
|
||||||
|
ssh pve 'qm resize VMID scsi0 +30G'
|
||||||
|
|
||||||
|
# Start VM
|
||||||
|
ssh pve 'qm start VMID'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adding a New LXC Container
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download template (if not already downloaded)
|
||||||
|
ssh pve 'pveam update'
|
||||||
|
ssh pve 'pveam available | grep ubuntu'
|
||||||
|
ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
|
||||||
|
|
||||||
|
# Create container
|
||||||
|
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
|
||||||
|
--hostname mycontainer \
|
||||||
|
--memory 2048 \
|
||||||
|
--cores 2 \
|
||||||
|
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
|
||||||
|
--rootfs local-zfs:8 \
|
||||||
|
--unprivileged 1 \
|
||||||
|
--features nesting=1 \
|
||||||
|
--start 1'
|
||||||
|
|
||||||
|
# Set root password
|
||||||
|
ssh pve 'pct exec CTID -- passwd'
|
||||||
|
|
||||||
|
# Add SSH key
|
||||||
|
ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
|
||||||
|
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
|
||||||
|
ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GPU Passthrough Configuration
|
||||||
|
|
||||||
|
### Current GPU Assignments
|
||||||
|
|
||||||
|
| GPU | Location | Passed To | VMID | Purpose |
|
||||||
|
|-----|----------|-----------|------|---------|
|
||||||
|
| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
|
||||||
|
| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
|
||||||
|
| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
|
||||||
|
|
||||||
|
### How to Pass GPU to VM
|
||||||
|
|
||||||
|
1. **Identify GPU PCI ID**:
|
||||||
|
```bash
|
||||||
|
ssh pve 'lspci | grep -i nvidia'
|
||||||
|
# Example output:
|
||||||
|
# 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
|
||||||
|
# 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Pass GPU to VM** (include both VGA and Audio):
|
||||||
|
```bash
|
||||||
|
ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
|
||||||
|
# If multi-function device (GPU + Audio), use:
|
||||||
|
ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Configure VM for GPU**:
|
||||||
|
```bash
|
||||||
|
# Set machine type to q35
|
||||||
|
ssh pve 'qm set VMID --machine q35'
|
||||||
|
|
||||||
|
# Set BIOS to OVMF (UEFI)
|
||||||
|
ssh pve 'qm set VMID --bios ovmf'
|
||||||
|
|
||||||
|
# Add EFI disk
|
||||||
|
ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Reboot VM** and install NVIDIA drivers inside the VM
|
||||||
|
|
||||||
|
**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backup Priority
|
||||||
|
|
||||||
|
See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
|
||||||
|
|
||||||
|
### Critical VMs (Must Backup)
|
||||||
|
|
||||||
|
| Priority | VMID | Name | Reason |
|
||||||
|
|----------|------|------|--------|
|
||||||
|
| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
|
||||||
|
| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
|
||||||
|
| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
|
||||||
|
| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
|
||||||
|
| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
|
||||||
|
|
||||||
|
### Medium Priority
|
||||||
|
|
||||||
|
| VMID | Name | Notes |
|
||||||
|
|------|------|-------|
|
||||||
|
| 200 | pihole | Easy to rebuild, but DNS config valuable |
|
||||||
|
| 202 | traefik | Config files backed up separately |
|
||||||
|
|
||||||
|
### Low Priority (Ephemeral/Rebuildable)
|
||||||
|
|
||||||
|
| VMID | Name | Notes |
|
||||||
|
|------|------|-------|
|
||||||
|
| 105 | fs-dev | Development - code is in Git |
|
||||||
|
| 111 | lmdev1 | Ephemeral development |
|
||||||
|
| 201 | copyparty | Simple app, easy to redeploy |
|
||||||
|
| 206 | docker-host | Docker Compose files backed up separately |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List all VMs
|
||||||
|
ssh pve 'qm list'
|
||||||
|
ssh pve2 'qm list'
|
||||||
|
|
||||||
|
# List all containers
|
||||||
|
ssh pve 'pct list'
|
||||||
|
|
||||||
|
# Start/stop VM
|
||||||
|
ssh pve 'qm start VMID'
|
||||||
|
ssh pve 'qm stop VMID'
|
||||||
|
ssh pve 'qm shutdown VMID' # Graceful
|
||||||
|
|
||||||
|
# Start/stop container
|
||||||
|
ssh pve 'pct start CTID'
|
||||||
|
ssh pve 'pct stop CTID'
|
||||||
|
ssh pve 'pct shutdown CTID' # Graceful
|
||||||
|
|
||||||
|
# VM console
|
||||||
|
ssh pve 'qm terminal VMID'
|
||||||
|
|
||||||
|
# Container console
|
||||||
|
ssh pve 'pct enter CTID'
|
||||||
|
|
||||||
|
# Clone VM
|
||||||
|
ssh pve 'qm clone VMID NEW_VMID --name newvm'
|
||||||
|
|
||||||
|
# Delete VM
|
||||||
|
ssh pve 'qm destroy VMID'
|
||||||
|
|
||||||
|
# Delete container
|
||||||
|
ssh pve 'pct destroy CTID'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- [STORAGE.md](STORAGE.md) - Storage pool assignments
|
||||||
|
- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
|
||||||
|
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
|
||||||
|
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
|
||||||
|
- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-12-22
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
{"web":{"client_id":"693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com","project_id":"spheric-method-482514-f8","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["https://my.home-assistant.io/redirect/oauth"]}}
|
||||||
41
data/scripts/internet-watchdog.sh
Normal file
41
data/scripts/internet-watchdog.sh
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# Internet Watchdog - Reboots if internet is unreachable for 5 minutes
|
||||||
|
LOG_FILE="/var/log/internet-watchdog.log"
|
||||||
|
FAIL_COUNT=0
|
||||||
|
MAX_FAILS=5
|
||||||
|
CHECK_INTERVAL=60
|
||||||
|
|
||||||
|
log() {
|
||||||
|
echo "$(date "+%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_internet() {
|
||||||
|
for endpoint in 1.1.1.1 8.8.8.8 208.67.222.222; do
|
||||||
|
if ping -c 1 -W 5 "$endpoint" > /dev/null 2>&1; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
log "Watchdog started"
|
||||||
|
|
||||||
|
while true; do
|
||||||
|
if check_internet; then
|
||||||
|
if [ $FAIL_COUNT -gt 0 ]; then
|
||||||
|
log "Internet restored after $FAIL_COUNT failures"
|
||||||
|
fi
|
||||||
|
FAIL_COUNT=0
|
||||||
|
else
|
||||||
|
FAIL_COUNT=$((FAIL_COUNT + 1))
|
||||||
|
log "Internet check failed ($FAIL_COUNT/$MAX_FAILS)"
|
||||||
|
|
||||||
|
if [ $FAIL_COUNT -ge $MAX_FAILS ]; then
|
||||||
|
log "CRITICAL: $MAX_FAILS consecutive failures - REBOOTING"
|
||||||
|
sync
|
||||||
|
sleep 2
|
||||||
|
reboot
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
sleep $CHECK_INTERVAL
|
||||||
|
done
|
||||||
23
data/scripts/memory-monitor.sh
Normal file
23
data/scripts/memory-monitor.sh
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
LOG_DIR="/data/logs"
|
||||||
|
LOG_FILE="$LOG_DIR/memory-history.log"
|
||||||
|
mkdir -p "$LOG_DIR"
|
||||||
|
|
||||||
|
while true; do
|
||||||
|
# Rotate if over 10MB
|
||||||
|
if [ -f "$LOG_FILE" ]; then
|
||||||
|
SIZE=$(wc -c < "$LOG_FILE" 2>/dev/null || echo 0)
|
||||||
|
if [ "$SIZE" -gt 10485760 ]; then
|
||||||
|
mv "$LOG_FILE" "$LOG_FILE.old"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "========== $(date +%Y-%m-%d\ %H:%M:%S) ==========" >> "$LOG_FILE"
|
||||||
|
echo "--- MEMORY ---" >> "$LOG_FILE"
|
||||||
|
free -m >> "$LOG_FILE"
|
||||||
|
echo "--- TOP MEMORY PROCESSES ---" >> "$LOG_FILE"
|
||||||
|
ps -eo pid,rss,comm --sort=-rss | head -12 >> "$LOG_FILE"
|
||||||
|
echo "" >> "$LOG_FILE"
|
||||||
|
|
||||||
|
sleep 600
|
||||||
|
done
|
||||||
Reference in New Issue
Block a user