Compare commits

...

25 Commits

Author SHA1 Message Date
Hutson
38a7a2c52e Auto-sync: 20260123-015626 2026-01-23 01:56:27 -05:00
Hutson
52d8f2f133 Add central configuration reference section
Reference ~/.secrets, ~/.hosts, and ~/.ssh/config for centralized
credentials and host management. Includes homelab-specific variables
for Syncthing, Home Assistant, n8n, and Cloudflare.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 15:13:16 -05:00
Hutson
80b6ab43d3 Auto-sync: 20260120-145048 2026-01-20 14:50:49 -05:00
Hutson
6932ee1ca9 Auto-sync: 20260116-161159 2026-01-16 16:12:19 -05:00
Hutson
42cfdd8552 Auto-sync: 20260116-155016 2026-01-16 15:50:17 -05:00
Hutson
d54447949e Add Oura Ring integration and automations documentation
- Document HACS and Oura Ring v2 integration setup
- Add OAuth credentials for Oura developer portal
- Document 9 Oura automations:
  - Sleep/wake detection (HR-based thermostat control)
  - Health alerts (low readiness, SpO2, fever detection)
  - Sleep comfort (temperature-based thermostat adjustment)
  - Activity reminders (sedentary alert)
- Add Nest thermostat to integrations list
- Mark completed TODOs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 15:25:21 -05:00
Hutson
4535969566 Auto-sync: 20260116-152013 2026-01-16 15:20:14 -05:00
Hutson
8c1cbf3dac Auto-sync: 20260116-150510 2026-01-16 15:05:12 -05:00
Hutson
d38de8bfb1 Auto-sync: 20260115-110247 2026-01-15 11:02:48 -05:00
Hutson
db7ac68312 Auto-sync: 20260114-183121 2026-01-14 18:31:23 -05:00
Hutson
bd3ed4e4ef Auto-sync: 20260114-002941 2026-01-14 00:29:42 -05:00
Hutson
e7c8d7f86f Auto-sync: 20260113-134342 2026-01-13 13:43:43 -05:00
Hutson
1dcb7ff9e5 Auto-sync: 20260113-093539 2026-01-13 09:35:40 -05:00
Hutson
f234fe96cb Auto-sync: 20260113-015009 2026-01-13 01:50:10 -05:00
Hutson
1abd618b52 Auto-sync: 20260113-013507 2026-01-13 01:35:08 -05:00
Hutson
35fba5a6ae Auto-sync: 20260113-012006 2026-01-13 01:20:07 -05:00
Hutson
eb698f0c38 Auto-sync: 20260111-164757 2026-01-11 16:47:58 -05:00
Hutson
d66ed5c55a Auto-sync: 20260111-161755 2026-01-11 16:17:56 -05:00
Hutson
5ac698db0d Auto-sync: 20260107-000953 2026-01-07 00:09:54 -05:00
Hutson
7eacc846e6 Auto-sync: 20260105-213809 2026-01-05 21:38:10 -05:00
Hutson
b832cc9e57 Auto-sync: 20260105-212307 2026-01-05 21:23:08 -05:00
Hutson
54a71124ae Auto-sync: 20260105-172251 2026-01-05 17:22:52 -05:00
Hutson
eddd98c57f Auto-sync: 20260105-122831 2026-01-05 12:28:33 -05:00
Hutson
56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
Hutson
23e9df68c9 Update Happy Coder docs with complete setup flow and troubleshooting
- Expand Mobile Access Setup with full authentication steps
  (HAPPY_SERVER_URL, happy auth login, happy connect claude, local claude login)
- Fix launchd path: ~/Library/LaunchAgents/ not /Library/LaunchDaemons/
- Add Common Issues troubleshooting table with fixes for:
  - Invalid API key (Claude not logged in locally)
  - Failed to start daemon (stale lock files)
  - Sessions not showing (missing HAPPY_SERVER_URL)
  - Slow responses (Cloudflare proxy enabled)
- Update DNS note: Cloudflare proxy disabled for WebSocket performance
- Add .zshrc to Files & Configuration table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 13:28:30 -05:00
33 changed files with 9873 additions and 1022 deletions

View File

@@ -0,0 +1,5 @@
# This directory is a Syncthing folder marker.
# Do not delete.
folderID: homelab
created: 2025-12-23T00:39:52-05:00

190
AUTOMATION-WELCOME-HOME.md Normal file
View File

@@ -0,0 +1,190 @@
# Welcome Home Automation
## Overview
Automatically turns on lights when you arrive home after sunset, creating a warm welcome.
## Status
- **Created:** 2026-01-14
- **State:** Active (enabled)
- **Entity ID:** `automation.welcome_home`
- **Last Triggered:** Never (newly created)
## How It Works
### Trigger
- Activates when **person.hutson** enters **zone.home** (100m radius)
- GPS tracking via device_tracker.honor (Honor phone)
### Conditions
The automation only runs when it's dark:
- After sunset (with 30-minute early start) **OR**
- Before sunrise
This prevents lights from turning on during daytime arrivals.
### Actions
When triggered, the following lights turn on:
| Light | Brightness | Purpose |
|-------|------------|---------|
| **Living Room** | 75% | Main ambient lighting |
| **Living Room Lamp** | 60% | Softer accent light |
| **Kitchen** | 80% | Task lighting for entry |
## Climate Control Note
No climate/heating entities were found in your Home Assistant setup. To add heating control in the future:
1. Integrate your thermostat/HVAC with Home Assistant
2. Add a climate action to this automation (see customization below)
## Customization
### Adjust Trigger Distance
The home zone has a 100m radius. To change this:
```yaml
# In Home Assistant UI: Settings → Areas → Zones → Home
# Or via API:
curl -X PUT \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"latitude": 35.6542655, "longitude": -78.7417665, "radius": 150}' \
"http://10.10.10.210:8123/api/config/zone/zone.home"
```
### Add More Lights
To add additional lights (e.g., Office, Front Porch):
```bash
HA_TOKEN="your-token-here"
# Get current config
curl -s -H "Authorization: Bearer $HA_TOKEN" \
"http://10.10.10.210:8123/api/config/automation/config/welcome_home" > automation.json
# Edit automation.json to add more light actions
# Then update:
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d @automation.json \
"http://10.10.10.210:8123/api/config/automation/config/welcome_home"
```
### Add Climate Control (when available)
Add this action to the automation:
```json
{
"service": "climate.set_temperature",
"target": {
"entity_id": "climate.thermostat"
},
"data": {
"temperature": 72,
"hvac_mode": "heat"
}
}
```
### Use a Scene Instead
To activate a predefined scene instead of individual lights:
```json
{
"service": "scene.turn_on",
"target": {
"entity_id": "scene.living_room_relax"
}
}
```
Available scenes include:
- `scene.living_room_relax`
- `scene.living_room_dimmed`
- `scene.all_nightlight`
## Testing
### Manual Trigger
```bash
HA_TOKEN="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIwZThjZmJjMzVlNDA0NzYwOTMzMjg3MTQ5ZjkwOGU2NyIsImlhdCI6MTc2NTk5MjQ4OCwiZXhwIjoyMDgxMzUyNDg4fQ.r743tsb3E5NNlrwEEu9glkZdiI4j_3SKIT1n5PGUytY"
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/trigger"
```
### Check Last Triggered
```bash
curl -s -H "Authorization: Bearer $HA_TOKEN" \
"http://10.10.10.210:8123/api/states/automation.welcome_home" | \
python3 -c "import json, sys; print(json.load(sys.stdin)['attributes']['last_triggered'])"
```
## Disable/Enable
### Disable
```bash
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/turn_off"
```
### Enable
```bash
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/turn_on"
```
## Monitoring
### View in Home Assistant UI
1. Go to http://10.10.10.210:8123
2. Settings → Automations & Scenes → Automations
3. Find "Welcome Home"
### Check Automation State
The automation is currently: **ON**
### Troubleshooting
If the automation doesn't trigger:
1. Check person.hutson GPS accuracy (should be < 50m)
2. Verify zone.home coordinates match your actual home location
3. Check automation was triggered during dark hours
4. Review Home Assistant logs for errors
## Related Documentation
- [Home Assistant API](./HOMEASSISTANT.md)
- [Personal Assistant Integration](../personal-assistant/CLAUDE.md)
- [Smart Home Control](../personal-assistant/docs/services-matrix.md)
## Future Enhancements
Potential improvements:
- Add motion sensor override (don't trigger if motion already detected)
- Integrate with calendar (different scenes for work vs personal time)
- Add climate control when thermostat is integrated
- Create "leaving home" automation to turn off lights
- Add notification to phone when automation triggers
- Adjust brightness based on time of day
- Add office lights during work hours
---
*Created: 2026-01-14*
*Last Updated: 2026-01-14*

358
BACKUP-STRATEGY.md Normal file
View File

@@ -0,0 +1,358 @@
# Backup Strategy
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
## Executive Summary
### What We Have ✅
- **Syncthing**: File synchronization across 5+ devices
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
- **Proxmox**: Built-in backup capabilities (not yet configured)
### What We DON'T Have 🚨
- ❌ No documented VM/CT backups
- ❌ No ZFS snapshot schedule
- ❌ No offsite backups
- ❌ No disaster recovery plan
- ❌ No tested restore procedures
- ❌ No configuration backups
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
---
## Current State Analysis
### Syncthing (File Synchronization)
**What it is**: Real-time file sync across devices
**What it is NOT**: A backup solution
| Folder | Devices | Size | Protected? |
|--------|---------|------|------------|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
**Limitations**:
- ❌ Accidental deletion → deleted everywhere
- ❌ Ransomware/corruption → spreads everywhere
- ❌ No point-in-time recovery
- ❌ No version history (unless file versioning enabled - not documented)
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
### ZFS on TrueNAS (Potential Backup Target)
**Current Status**: ❓ Unknown - snapshots may or may not be configured
**Needs Investigation**:
```bash
# Check if snapshots exist
ssh truenas 'zfs list -t snapshot'
# Check if automated snapshots are configured
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
# Check snapshot schedule via TrueNAS API/UI
```
**If configured**, ZFS snapshots provide:
- ✅ Point-in-time recovery
- ✅ Protection against accidental deletion
- ✅ Fast rollback capability
- ⚠️ Still single location (no offsite protection)
### Proxmox VM/CT Backups
**Current Status**: ❓ Unknown - no backup jobs documented
**Needs Investigation**:
```bash
# Check backup configuration
ssh pve 'pvesh get /cluster/backup'
# Check if any backups exist
ssh pve 'ls -lh /var/lib/vz/dump/'
ssh pve2 'ls -lh /var/lib/vz/dump/'
```
**Critical VMs Needing Backup**:
| VM/CT | VMID | Priority | Notes |
|-------|------|----------|-------|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
| homeassistant | 110 | 🟡 HIGH | Home automation config |
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
---
## Recommended Backup Strategy
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
**ZFS Snapshots on TrueNAS**
Schedule automatic snapshots for all datasets:
| Dataset | Frequency | Retention |
|---------|-----------|-----------|
| vault/documents | Every 15 min | 1 hour |
| vault/documents | Hourly | 24 hours |
| vault/documents | Daily | 30 days |
| vault/documents | Weekly | 12 weeks |
| vault/documents | Monthly | 12 months |
**Implementation**:
```bash
# Via TrueNAS UI: Storage → Snapshots → Add
# Or via CLI:
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
```
**Proxmox VM Backups**
Configure weekly backups to local storage:
```bash
# Create backup job via Proxmox UI:
# Datacenter → Backup → Add
# - Schedule: Weekly (Sunday 2 AM)
# - Storage: local-zfs or nvme-mirror1
# - Mode: Snapshot (fast)
# - Retention: 4 backups
```
**Or via CLI**:
```bash
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
```
### Tier 2: Offsite Backups (CRITICAL GAP)
**Option A: Cloud Storage (Recommended)**
Use **rclone** or **restic** to sync critical data to cloud:
| Provider | Cost | Pros | Cons |
|----------|------|------|------|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
**Implementation Example (Backblaze B2)**:
```bash
# Install on TrueNAS
ssh truenas 'pkg install rclone restic'
# Configure B2
rclone config # Follow prompts for B2
# Daily backup critical folders
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
```
**Option B: Offsite TrueNAS Replication**
- Set up second TrueNAS at friend/family member's house
- Use ZFS replication to sync snapshots
- Requires: Static IP or Tailscale, trust
**Option C: USB Drive Rotation**
- Weekly backup to external USB drive
- Rotate 2-3 drives (one always offsite)
- Manual but simple
### Tier 3: Configuration Backups
**Proxmox Configuration**
```bash
# Backup /etc/pve (configs are already in cluster filesystem)
# But also backup to external location:
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
# Copy to safe location
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
```
**VM-Specific Configs**
- Traefik configs: `/etc/traefik/` on CT 202
- Saltbox configs: `/srv/git/saltbox/` on VM 101
- Home Assistant: `/config/` on VM 110
**Script to backup all configs**:
```bash
#!/bin/bash
# Save as ~/bin/backup-homelab-configs.sh
DATE=$(date +%Y%m%d)
BACKUP_DIR=~/Backups/homelab-configs/$DATE
mkdir -p $BACKUP_DIR
# Proxmox configs
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
# Traefik
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
# Saltbox
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
# Home Assistant
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
echo "Configs backed up to $BACKUP_DIR"
```
---
## Disaster Recovery Scenarios
### Scenario 1: Single VM Failure
**Impact**: Medium
**Recovery Time**: 30-60 minutes
1. Restore from Proxmox backup:
```bash
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
```
2. Start VM and verify
3. Update IP if needed
### Scenario 2: TrueNAS Failure
**Impact**: CATASTROPHIC (all storage lost)
**Recovery Time**: Unknown - NO PLAN
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backup of critical datasets
- Documented ZFS pool creation steps
- Share configuration export
### Scenario 3: Complete PVE Server Failure
**Impact**: SEVERE
**Recovery Time**: 4-8 hours
**Current State**: ⚠️ PARTIALLY RECOVERABLE
**Needed**:
- VM backups stored on TrueNAS or PVE2
- Proxmox reinstall procedure
- Network config documentation
### Scenario 4: Complete Site Disaster (Fire/Flood)
**Impact**: TOTAL LOSS
**Recovery Time**: Unknown
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backups (cloud or physical)
- Critical data prioritization
- Restore procedures
---
## Action Plan
### Immediate (Next 7 Days)
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
```bash
ssh truenas 'zfs list -t snapshot'
ssh pve 'ls -lh /var/lib/vz/dump/'
```
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
### Short-term (Next 30 Days)
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
- [ ] **Install backup tools**: rclone or restic on TrueNAS
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
### Long-term (Next 90 Days)
- [ ] **Implement monitoring**: Alerts for backup failures
- [ ] **Quarterly restore test**: Verify backups actually work
- [ ] **Backup rotation policy**: Automate old backup cleanup
- [ ] **Configuration backup automation**: Weekly cron job
---
## Monitoring & Validation
### Backup Health Checks
```bash
# Check last ZFS snapshot
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
# Check Proxmox backup status
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
# Check cloud sync status (if using rclone)
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
```
### Alerts to Set Up
- Email alert if no snapshot created in 24 hours
- Email alert if Proxmox backup fails
- Email alert if cloud sync fails
- Weekly backup status report
---
## Cost Estimate
**Monthly Backup Costs**:
| Component | Cost | Notes |
|-----------|------|-------|
| Local storage (already owned) | $0 | Using existing TrueNAS |
| Proxmox backups (local) | $0 | Using existing storage |
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
**One-time**:
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
- [VMS.md](VMS.md) - VM inventory and prioritization
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED

View File

@@ -36,12 +36,12 @@ Investigated UPS power limit issues across both Proxmox servers.
[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
@@ -108,12 +108,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
@@ -127,12 +127,12 @@ curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
[Unit]
Description=Set CPU governor to schedutil for power savings
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
@@ -194,4 +194,4 @@ Not useful when:
- `general_profit` is negative
### What is Memory Ballooning?
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.

1336
CLAUDE.md

File diff suppressed because it is too large Load Diff

339
GATEWAY.md Normal file
View File

@@ -0,0 +1,339 @@
# UniFi Gateway (UCG-Fiber)
Documentation for the UniFi Cloud Gateway Fiber (10.10.10.1) - the primary network gateway and router.
## Overview
| Property | Value |
|----------|-------|
| **Device** | UniFi Cloud Gateway Fiber (UCG-Fiber) |
| **IP Address** | 10.10.10.1 |
| **SSH User** | root |
| **SSH Auth** | SSH key (`~/.ssh/id_ed25519`) |
| **Host Aliases** | `ucg-fiber`, `gateway` |
| **Firmware** | v4.4.9 (as of 2026-01-02) |
| **UniFi Core** | 4.4.19 |
| **RAM** | 2.9 GB (shared with UniFi apps) |
---
## SSH Access
SSH key authentication is configured. Use host aliases:
```bash
# Quick access
ssh ucg-fiber 'hostname'
ssh gateway 'free -m'
# Or use IP directly
ssh root@10.10.10.1 'uptime'
```
**Note**: SSH key may need re-deployment after firmware updates if UniFi clears authorized_keys.
---
## Monitoring Services
Two custom monitoring services run on the gateway to prevent and diagnose issues.
### Internet Watchdog Service
**Purpose**: Auto-reboots gateway if internet connectivity is lost for 5+ minutes
**Location**: `/data/scripts/internet-watchdog.sh`
**How it works**:
1. Pings 1.1.1.1, 8.8.8.8, 208.67.222.222 every 60 seconds
2. If all three fail, increments failure counter
3. After 5 consecutive failures (~5 minutes), triggers reboot
4. Logs all activity to `/var/log/internet-watchdog.log`
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog'
# View recent logs
ssh ucg-fiber 'tail -50 /var/log/internet-watchdog.log'
# Stop temporarily (if troubleshooting)
ssh ucg-fiber 'systemctl stop internet-watchdog'
# Restart
ssh ucg-fiber 'systemctl restart internet-watchdog'
```
**Log Format**:
```
2026-01-02 22:45:01 - Watchdog started
2026-01-02 22:46:01 - Internet check failed (1/5)
2026-01-02 22:47:01 - Internet restored after 1 failures
```
---
### Memory Monitor Service
**Purpose**: Logs memory usage and top processes every 10 minutes for diagnostics
**Location**: `/data/scripts/memory-monitor.sh`
**Log File**: `/data/logs/memory-history.log`
**How it works**:
1. Every 10 minutes, logs current memory usage (`free -m`)
2. Logs top 12 memory-consuming processes
3. Auto-rotates log when it exceeds 10MB (keeps one .old file)
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status memory-monitor'
# View recent memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Check current memory usage
ssh ucg-fiber 'free -m'
# See top memory consumers right now
ssh ucg-fiber 'ps -eo pid,rss,comm --sort=-rss | head -12'
```
**Log Format**:
```
========== 2026-01-02 22:30:00 ==========
--- MEMORY ---
total used free shared buff/cache available
Mem: 2892 1890 102 456 899 1002
Swap: 512 88 424
--- TOP MEMORY PROCESSES ---
PID RSS COMMAND
1234 327456 unifi-protect
2345 252108 mongod
3456 236544 java
...
```
---
## Known Memory Consumers
| Process | Typical Memory | Purpose |
|---------|----------------|---------|
| unifi-protect | ~320 MB | Camera/NVR management |
| mongod | ~250 MB | UniFi configuration database |
| java (controller) | ~230 MB | UniFi Network controller |
| postgres | ~180 MB | PostgreSQL database |
| unifi-core | ~150 MB | UniFi OS core |
| tailscaled | ~80 MB | Tailscale VPN |
**Total available**: ~2.9 GB
**Typical usage**: ~1.8-2.0 GB (leaves ~1 GB free)
**Warning threshold**: <500 MB free
**Critical**: <200 MB free or swap >50% used
---
## Disabled Services
The following services were disabled to reduce memory usage:
| Service | Memory Saved | Reason Disabled |
|---------|--------------|-----------------|
| UniFi Connect | ~200 MB | Not needed (cameras use Protect) |
To re-enable if needed:
```bash
ssh ucg-fiber 'systemctl enable unifi-connect && systemctl start unifi-connect'
```
---
## Common Issues
### Gateway Freeze / Network Loss
**Symptoms**:
- All devices lose internet
- Cannot ping 10.10.10.1
- Physical reboot required
**Root Cause**: Memory exhaustion causing soft lockup
**Prevention**:
1. Internet watchdog auto-reboots after 5 min outage
2. Memory monitor logs help identify runaway processes
3. UniFi Connect disabled to free ~200 MB
**Post-Incident Analysis**:
```bash
# Check memory history for spike before freeze
ssh ucg-fiber 'grep -B5 "Swap:" /data/logs/memory-history.log | tail -50'
# Check watchdog logs
ssh ucg-fiber 'cat /var/log/internet-watchdog.log'
# Check system logs for errors
ssh ucg-fiber 'dmesg | tail -100'
ssh ucg-fiber 'journalctl -p err --since "1 hour ago"'
```
---
### High Memory Usage
**Check current state**:
```bash
ssh ucg-fiber 'free -m && echo "---" && ps -eo pid,rss,comm --sort=-rss | head -15'
```
**If swap is heavily used**:
```bash
# Check swap usage
ssh ucg-fiber 'cat /proc/swaps'
# See what's in swap
ssh ucg-fiber 'for pid in $(ls /proc | grep -E "^[0-9]+$"); do
swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk "{print \$2}");
[ "$swap" -gt 10000 ] 2>/dev/null && echo "$pid: ${swap}kB - $(cat /proc/$pid/comm)";
done | sort -t: -k2 -rn | head -10'
```
**Consider reboot if**:
- Available memory <200 MB
- Swap usage >300 MB
- System becoming unresponsive
---
### Tailscale Issues
**Check Tailscale status**:
```bash
ssh ucg-fiber 'tailscale status'
```
**Common errors and fixes**:
| Error | Fix |
|-------|-----|
| `DNS resolution failed` | Check upstream DNS (Pi-hole at 10.10.10.10) |
| `TLS handshake failed` | Usually temporary; Tailscale auto-reconnects |
| `Not connected` | `ssh ucg-fiber 'tailscale up'` |
---
## Firmware Updates
**Check current version**:
```bash
ssh ucg-fiber 'ubnt-systool version'
```
**Update process**:
1. Check UniFi site for latest stable firmware
2. Download via UI or CLI
3. Schedule update during low-usage time
**After update**:
- Verify SSH key still works
- Check custom services still running
- Verify Tailscale reconnects
**Re-deploy SSH key if needed**:
```bash
ssh-copy-id -i ~/.ssh/id_ed25519 root@10.10.10.1
```
---
## Service Locations
| File | Purpose |
|------|---------|
| `/data/scripts/internet-watchdog.sh` | Watchdog script |
| `/data/scripts/memory-monitor.sh` | Memory monitor script |
| `/etc/systemd/system/internet-watchdog.service` | Watchdog systemd unit |
| `/etc/systemd/system/memory-monitor.service` | Memory monitor systemd unit |
| `/var/log/internet-watchdog.log` | Watchdog log |
| `/data/logs/memory-history.log` | Memory history log |
**Note**: `/data/` persists across firmware updates. `/var/log/` may not.
---
## Quick Reference Commands
```bash
# System status
ssh ucg-fiber 'uptime && free -m'
# Check both monitoring services
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# Memory history (last hour)
ssh ucg-fiber 'tail -60 /data/logs/memory-history.log'
# Watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# Network devices (ARP table)
ssh ucg-fiber 'cat /proc/net/arp'
# Tailscale status
ssh ucg-fiber 'tailscale status'
# System logs
ssh ucg-fiber 'journalctl -p warning --since "1 hour ago" | head -50'
```
---
## Backup Considerations
Custom services in `/data/scripts/` persist across firmware updates but may need:
- Systemd services re-enabled after major updates
- Script permissions re-applied if wiped
**Backup critical files**:
```bash
# Copy scripts locally for reference
scp ucg-fiber:/data/scripts/*.sh ~/Projects/homelab/data/scripts/
```
---
## Related Documentation
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH configuration and host aliases
- [NETWORK.md](NETWORK.md) - Network architecture
- [MONITORING.md](MONITORING.md) - Overall monitoring strategy
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant integration
---
## Incident History
### 2025-12-27 to 2025-12-29: Gateway Freeze
**Timeline**:
- Dec 7: Firmware update to v4.4.9
- Dec 24: Last healthy system logs
- Dec 27-29: "No internet detected" errors in logs
- Dec 29+: Complete silence (gateway frozen)
- Jan 2: Physical reboot restored access
**Root Cause**: Memory exhaustion causing soft lockup (no crash dump saved)
**Resolution**:
- Deployed internet-watchdog service
- Deployed memory-monitor service
- Disabled UniFi Connect (~200 MB saved)
- Configured SSH key auth
---
**Last Updated**: 2026-01-02

455
HARDWARE.md Normal file
View File

@@ -0,0 +1,455 @@
# Hardware Inventory
Complete hardware specifications for all homelab equipment.
## Servers
### PVE (10.10.10.120) - Primary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Cores**: 32 cores / 64 threads
- **Base Clock**: 3.5 GHz
- **Boost Clock**: 4.2 GHz
- **TDP**: 280W
- **Architecture**: Zen 2 (7nm)
- **Socket**: sTRX4
- **Features**: ECC support, PCIe 4.0
#### RAM
- **Capacity**: 128 GB
- **Type**: DDR4 ECC Registered
- **Speed**: Unknown (needs investigation)
- **Channels**: 8-channel (quad-channel per socket)
- **Idle Power**: ~30-40W
#### Storage
**OS/VM Storage:**
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
**Total Storage**: ~9 TB usable
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
**Total GPU Power**: 75W + 280W = 355W (under load)
#### Network Cards
| Interface | Model | Speed | Purpose | Bridge |
|-----------|-------|-------|---------|--------|
| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
#### Storage Controllers
| Model | Interface | Purpose |
|-------|-----------|---------|
| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
#### Motherboard
- **Model**: Unknown - needs investigation
- **Chipset**: AMD TRX40
- **Form Factor**: ATX/EATX
- **PCIe Slots**: Multiple PCIe 4.0 slots
- **Features**: IOMMU support, ECC memory
#### Power Supply
- **Model**: Unknown
- **Wattage**: Likely 1000W+ (needs investigation)
- **Type**: ATX, 80+ certification unknown
#### Cooling
- **CPU Cooler**: Unknown - likely large tower or AIO
- **Case Fans**: Unknown quantity
- **Note**: CPU temps 70-80°C under load (healthy)
---
### PVE2 (10.10.10.102) - Secondary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Specs**: Same as PVE (32C/64T, 280W TDP)
#### RAM
- **Capacity**: 128 GB DDR4 ECC
- **Same specs as PVE**
#### Storage
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
#### Network Cards
| Interface | Model | Speed | Purpose |
|-----------|-------|-------|---------|
| nic1 | Unknown (onboard) | 1 Gb | Management |
**Note**: MTU set to 9000 for jumbo frames
#### Motherboard
- **Model**: Unknown
- **Chipset**: AMD TRX40
- **Similar to PVE**
---
## Network Equipment
### UniFi Dream Machine Pro (UCG-Fiber)
- **Model**: UniFi Cloud Gateway Fiber
- **IP**: 10.10.10.1
- **Ports**: Multiple 1Gb + SFP+ uplink
- **Features**: Router, firewall, VPN, IDS/IPS
- **MTU**: 9216 (supports jumbo frames)
- **Tailscale**: Installed for VPN failover
### Switches
**Details needed** - investigate current switch setup:
- 10Gb switch for high-speed connections?
- 1Gb switch for general devices?
- PoE capabilities?
```bash
# Check what's connected to 10Gb interfaces
ssh pve 'ip link show enp35s0f0'
ssh pve 'ip link show enp35s0f1'
```
---
## Storage Hardware
### EMC Storage Enclosure
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
- **Model**: EMC KTN-STL4 (or similar)
- **Form Factor**: 4U rackmount
- **Drive Bays**: 25x 3.5" SAS/SATA
- **Controllers**: Dual LCC (Link Control Cards)
- **Connection**: SAS via LSI SAS2308 HBA
- **Passed to**: TrueNAS VM (VMID 100)
**Current Status**:
- LCC A: Active (working)
- LCC B: Failed (replacement ordered)
**Drive Inventory**: Unknown - needs audit
```bash
# Get drive list from TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'lsblk'
```
### NVMe Drives
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
### SSDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
### HDDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
---
## UPS
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Input** | NEMA 5-15P (rewired from 5-20P) |
| **Outlets** | 2x 5-20R + 6x 5-15R |
| **Output** | PFC Sinewave |
| **Runtime** | ~15-20 min @ 33% load |
| **Interface** | USB (connected to PVE) |
**See [UPS.md](UPS.md) for configuration details**
---
## Client Devices
### Mac Mini (Hutson's Workstation)
- **Model**: Unknown generation
- **CPU**: Unknown
- **RAM**: Unknown
- **Storage**: Unknown
- **Network**: 1Gb Ethernet (en0) - MTU 9000
- **Tailscale IP**: 100.108.89.58
- **Local IP**: 10.10.10.125 (static)
- **Purpose**: Primary workstation, Happy Coder daemon host
### MacBook (Mobile)
- **Model**: Unknown
- **Network**: Wi-Fi + Ethernet adapter
- **Tailscale IP**: Unknown
- **Purpose**: Mobile work, development
### Windows PC
- **Model**: Unknown
- **CPU**: Unknown
- **Network**: 1Gb Ethernet
- **IP**: 10.10.10.150
- **Purpose**: Gaming, Windows development, Syncthing node
### Phone (Android)
- **Model**: Unknown
- **IP**: 10.10.10.54 (when on Wi-Fi)
- **Purpose**: Syncthing mobile node, Happy Coder client
---
## Rack Layout (If Applicable)
**Needs documentation** - Current rack configuration unknown
Suggested format:
```
U42: Blank panel
U41: UPS (CyberPower 2U)
U40: UPS (CyberPower 2U)
U39: Switch (10Gb)
U38-U35: EMC Storage Enclosure (4U)
U34: PVE Server
U33: PVE2 Server
...
```
---
## Power Consumption
### Measured Power Draw
| Component | Idle | Typical | Peak | Notes |
|-----------|------|---------|------|-------|
| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
| Network Gear | ~50W | ~50W | ~50W | Router + switches |
| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
**UPS Capacity**: 1320W
**Typical Load**: 33-50% (safe margin)
**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
### Power Optimizations Applied
**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
- KSMD disabled: ~60-80W saved
- CPU governors: ~60-120W saved
- Syncthing rescans: ~60-80W saved
- HDD spindown: ~10-16W saved when idle
- **Total savings**: ~150-300W
---
## Thermal Management
### CPU Cooling
**PVE & PVE2**:
- CPU cooler: Unknown model
- Thermal paste: Unknown, likely needs refresh if temps >85°C
- Target temp: 70-80°C under load
- Max safe: 90°C Tctl (Threadripper PRO spec)
### GPU Cooling
All GPUs are passively managed (stock coolers):
- TITAN RTX: 2-3W idle, 280W load
- RTX A6000: 11W idle, 300W load
- Quadro P2000: 25W constant (Plex active)
### Case Airflow
**Unknown** - needs investigation:
- Case model?
- Fan configuration?
- Positive or negative pressure?
---
## Cable Management
### Network Cables
| Connection | Type | Length | Speed |
|------------|------|--------|-------|
| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
| PVE2 → Router | Cat6 | Unknown | 1Gb |
| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
### Power Cables
**Critical**: All servers on UPS battery-backed outlets
---
## Maintenance Schedule
### Annual Maintenance
- [ ] Clean dust from servers (every 6-12 months)
- [ ] Check thermal paste on CPUs (every 2-3 years)
- [ ] Test UPS battery runtime (annually)
- [ ] Verify all fans operational
- [ ] Check for bulging capacitors on PSUs
### Drive Health
```bash
# Check SMART status on all drives
ssh pve 'smartctl -a /dev/nvme0'
ssh pve2 'smartctl -a /dev/sda'
ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
```
### Temperature Monitoring
```bash
# Check all temps (needs lm-sensors installed)
ssh pve 'sensors'
ssh pve2 'sensors'
```
---
## Warranty & Purchase Info
**Needs documentation**:
- When were servers purchased?
- Where were components bought?
- Any warranties still active?
- Replacement part sources?
---
## Upgrade Path
### Short-term Upgrades (< 6 months)
- [ ] 20A circuit for UPS (restore original 5-20P plug)
- [ ] Document missing hardware specs
- [ ] Label all cables
- [ ] Create rack diagram
### Medium-term Upgrades (6-12 months)
- [ ] Additional 10Gb NIC for PVE2?
- [ ] More NVMe storage?
- [ ] Upgrade network switches?
- [ ] Replace EMC enclosure with newer model?
### Long-term Upgrades (1-2 years)
- [ ] CPU upgrade to newer Threadripper?
- [ ] RAM expansion to 256GB?
- [ ] Additional GPU for AI workloads?
- [ ] Migrate to PCIe 5.0 storage?
---
## Investigation Needed
High-priority items to document:
- [ ] Get exact motherboard model (both servers)
- [ ] Get PSU model and wattage
- [ ] CPU cooler models
- [ ] Network switch models and configuration
- [ ] Complete drive inventory in EMC enclosure
- [ ] RAM speed and timings
- [ ] Case models
- [ ] Exact NVMe models for all drives
**Commands to gather info**:
```bash
# Motherboard
ssh pve 'dmidecode -t baseboard'
# CPU details
ssh pve 'lscpu'
# RAM details
ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
# Storage devices
ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
# Network cards
ssh pve 'lspci | grep -i network'
# GPU details
ssh pve 'lspci | grep -i vga'
ssh pve 'nvidia-smi -L' # If nvidia-smi available
```
---
## Related Documentation
- [VMS.md](VMS.md) - VM resource allocation
- [STORAGE.md](STORAGE.md) - Storage pools and usage
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
- [UPS.md](UPS.md) - UPS configuration
- [NETWORK.md](NETWORK.md) - Network configuration
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many specs need investigation

View File

@@ -130,16 +130,232 @@ curl -s -H "Authorization: Bearer $HA_TOKEN" \
- **Philips Hue** - Lights
- **Sonos** - Speakers
- **Nest** - Thermostat (climate.thermostat)
- **Motion Sensors** - Various locations
- **NUT (Network UPS Tools)** - UPS monitoring (added 2025-12-21)
- **VeSync** - Levoit humidifier control (added 2026-01-14)
- **HomeKit Controller** - Homebridge bridge for Govee sensors (added 2026-01-14)
- **Oura Ring v2** - Sleep/health tracking via HACS (added 2026-01-16)
- **HACS** - Home Assistant Community Store for custom integrations
### NUT / UPS Integration
Monitors the CyberPower OR2200PFCRT2U UPS connected to PVE.
**Connection:**
- Host: 10.10.10.120
- Port: 3493
- Username: upsmon
- Password: upsmon123
**Entities:**
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery percentage |
| `sensor.cyberpower_load` | Current load % |
| `sensor.cyberpower_input_voltage` | Input voltage |
| `sensor.cyberpower_output_voltage` | Output voltage |
| `sensor.cyberpower_status` | Status (Online, On Battery, etc.) |
| `sensor.cyberpower_status_data` | Raw status (OL, OB, LB, CHRG) |
**Dashboard Card Example:**
```yaml
type: entities
title: UPS Status
entities:
- entity: sensor.cyberpower_status
name: Status
- entity: sensor.cyberpower_battery_charge
name: Battery
- entity: sensor.cyberpower_load
name: Load
- entity: sensor.cyberpower_input_voltage
name: Input Voltage
```
### VeSync / Levoit LV600S Integration
Controls the Levoit LV600S humidifier via VeSync cloud API.
**Account:** vesync@htsn.io
**Entities:**
| Entity ID | Description |
|-----------|-------------|
| `humidifier.lv600s` | Main humidifier on/off control |
| `sensor.lv600s_humidity` | Built-in humidity sensor (reads high near mist) |
| `number.lv600s_mist_level` | Mist intensity (1-9) |
| `switch.lv600s_display` | Display on/off |
| `binary_sensor.lv600s_low_water` | Low water warning |
| `binary_sensor.lv600s_water_tank_lifted` | Tank removed detection |
### Oura Ring Integration (HACS)
Monitors sleep, activity, and health metrics from Oura Ring via HACS custom integration.
**Installation:** HACS → Integrations → Oura Ring v2
**OAuth Credentials (Oura Developer Portal):**
- Client ID: `e925a2a0-7767-4390-8b80-3a385a5b3ddc`
- Client Secret: `xFSFSfUPihet1foWQRLAMUQbL9-kChqT_CjtHHpAxZs`
- Redirect URI: `https://my.home-assistant.io/redirect/oauth`
**Key Entities:**
| Entity ID | Description |
|-----------|-------------|
| `sensor.oura_ring_readiness_score` | Daily readiness (0-100) |
| `sensor.oura_ring_sleep_score` | Sleep quality (0-100) |
| `sensor.oura_ring_current_heart_rate` | Current HR (bpm) |
| `sensor.oura_ring_average_sleep_heart_rate` | Average HR during sleep |
| `sensor.oura_ring_lowest_sleep_heart_rate` | Lowest HR during sleep |
| `sensor.oura_ring_temperature_deviation` | Body temp deviation (°C) |
| `sensor.oura_ring_spo2_average` | Blood oxygen (%) |
| `sensor.oura_ring_steps` | Daily step count |
| `sensor.oura_ring_activity_score` | Activity score (0-100) |
**Troubleshooting:**
- If sensors show "unavailable", check config entry state: `setup_retry` usually means API returned no data
- Force sync the Oura app on your phone, then reload the integration
- The integration polls Oura's API periodically; data updates after ring syncs to cloud
### HomeKit Controller / Homebridge Integration
Connects to Homebridge running on Mac Mini to access BLE devices (Govee sensors).
**Homebridge Details:**
- Host: Mac Mini (localhost)
- Port: 51826
- PIN: 031-45-154
- Config: `~/.homebridge/config.json`
- Logs: `~/.homebridge/homebridge.log`
- LaunchAgent: `~/Library/LaunchAgents/com.homebridge.server.plist`
**Govee H5074 Entities:**
| Entity ID | Description |
|-----------|-------------|
| `sensor.goveeh5074_5059_humidity` | Room humidity (accurate reading) |
| `sensor.goveeh5074_5059_temperature` | Room temperature |
| `sensor.goveeh5074_5059_battery` | Sensor battery level |
**Homebridge Management:**
```bash
# Check status
launchctl list | grep homebridge
# View logs
tail -f ~/.homebridge/homebridge.log
# Restart Homebridge
launchctl stop com.homebridge.server
launchctl start com.homebridge.server
# Stop Homebridge
launchctl unload ~/Library/LaunchAgents/com.homebridge.server.plist
# Start Homebridge
launchctl load ~/Library/LaunchAgents/com.homebridge.server.plist
```
## Automations
TODO: Document key automations
### Guitar Room Humidity Control
Maintains 45-47% humidity for guitar storage (Lowden recommends 49% ±2%).
**Automations:**
| Automation | Trigger | Action |
|------------|---------|--------|
| `guitar_room_humidity_low_turn_on_humidifier` | Govee H5074 < 45% | Turn ON humidifier, set mist to 6 |
| `guitar_room_humidity_reached_turn_off_humidifier` | Govee H5074 > 47% | Turn OFF humidifier |
**Why two thresholds (hysteresis):**
- Prevents rapid on/off cycling
- 45% turn-on, 47% turn-off creates a 2% buffer
- Target range: 45-47% (conservatively below Lowden's 49% spec)
### Oura Ring Health & Sleep Automations
Uses Oura Ring biometrics for smart thermostat control and health alerts.
**Sleep/Wake Detection:**
| Automation | Trigger | Conditions | Action |
|------------|---------|------------|--------|
| `oura_sleep_detected_bedtime_mode` | HR < 55 bpm | Home, after 10pm | Thermostat → 66°F, front door light off, Telegram notify |
| `oura_wake_up_detected_morning_mode` | HR > 65 bpm | Home, 5-11am, thermostat < 68°F | Thermostat → 69°F, Telegram notify |
**Health Alerts:**
| Automation | Trigger | Action |
|------------|---------|--------|
| `oura_low_readiness_alert` | 8am daily, readiness < 70 | Telegram: suggest rest day |
| `oura_spo2_health_alert` | SpO2 < 94% | Urgent Telegram: health warning |
| `oura_fever_detection_alert` | Temp deviation > 1°C | Telegram: possible illness alert |
| `oura_sedentary_reminder` | 2pm weekdays, steps < 500 | Telegram: reminder to move |
**Sleep Comfort & Recovery:**
| Automation | Trigger | Conditions | Action |
|------------|---------|------------|--------|
| `oura_poor_sleep_recovery_mode` | 7am daily | Home, sleep score < 70 | Thermostat → 71°F (warmer for recovery) |
| `oura_sleep_temp_adjustment_too_hot` | Temp deviation > +0.5°C | Home, 10pm-6am, HR < 60 | Thermostat → 64°F |
| `oura_sleep_temp_adjustment_too_cold` | Temp deviation < -0.3°C | Home, 10pm-6am, HR < 60 | Thermostat → 68°F |
**Notification Setup:**
All notifications use `rest_command.notify_telegram` - ensure this is configured in `configuration.yaml`:
```yaml
rest_command:
notify_telegram:
url: "https://api.telegram.org/bot<TOKEN>/sendMessage"
method: POST
content_type: "application/json"
payload: '{"chat_id": "<CHAT_ID>", "text": "{{ message }}"}'
```
## SSH Access (Terminal & SSH Add-on)
The Terminal & SSH add-on provides remote shell access to Home Assistant OS.
**Connection:**
```bash
ssh root@10.10.10.210 -p 22
```
**Authentication:** SSH key from Mac Mini (`~/.ssh/id_ed25519.pub`)
**Hostname:** `core-ssh`
**Features:**
- Direct shell access to Home Assistant OS
- Access to Home Assistant CLI (`ha` command)
- File system access for debugging
## MCP Server Integration
Home Assistant has a built-in Model Context Protocol (MCP) Server integration for AI assistant connectivity.
**Status:** Enabled (configured with "Assist" service)
**Endpoint:** `http://10.10.10.210:8123/api/mcp`
**Claude Code Configuration:** Added to `~/.cursor/mcp.json`:
```json
{
"homeassistant": {
"type": "http",
"url": "http://10.10.10.210:8123/api/mcp",
"headers": {
"Authorization": "Bearer <HA_API_TOKEN>"
}
}
}
```
**Note:** The MCP server uses the Assist API to expose entities and services to AI clients.
## TODO
- [ ] Set static IP (currently DHCP at .210, should be .110)
- [ ] Add API token to this document
- [ ] Document installed integrations
- [ ] Document automations
- [x] Add API token to this document
- [x] Document installed integrations
- [x] Document automations
- [ ] Set up Traefik reverse proxy (ha.htsn.io)
- [x] Install Terminal & SSH add-on
- [x] Enable MCP Server integration

View File

@@ -45,7 +45,7 @@
| 10.10.10.1 | router | Gateway/Firewall |
| 10.10.10.102 | pve2 | Proxmox Server 2 |
| 10.10.10.120 | pve | Proxmox Server 1 (Primary) |
| 10.10.10.123 | mac-mini | Mac Mini (Syncthing node) |
| 10.10.10.125 | mac-mini | Mac Mini (Syncthing node) |
| 10.10.10.150 | windows-pc | Windows PC (Syncthing node) |
| 10.10.10.147 | macbook | MacBook Pro (Syncthing node) |
| 10.10.10.200 | truenas | TrueNAS (Storage/Syncthing hub) |

View File

@@ -45,6 +45,7 @@ This document tracks all IP addresses in the homelab infrastructure.
|------|------|------------|---------|--------|
| 300 | gitea-vm | 10.10.10.220 | Git server | Running |
| 301 | trading-vm | 10.10.10.221 | AI trading platform (RTX A6000) | Running |
| 302 | docker-host2 | 10.10.10.207 | Docker services (n8n, future apps) | Running |
## Workstations & Personal Devices
@@ -69,6 +70,10 @@ This document tracks all IP addresses in the homelab infrastructure.
| CopyParty | cp.htsn.io | 10.10.10.201:3923 | Traefik-Primary |
| LMDev | lmdev.htsn.io | 10.10.10.111 | Traefik-Primary |
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 | Traefik-Primary |
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 | Traefik-Primary |
| n8n | n8n.htsn.io | 10.10.10.207:5678 | Traefik-Primary |
| PA API | pa.htsn.io | 10.10.10.207:8401 | Traefik-Primary (Tailscale only) |
| Crafty Controller | mc.htsn.io | 10.10.10.207:8443 | Traefik-Primary |
| Plex | plex.htsn.io | 10.10.10.100:32400 | Traefik-Saltbox |
| Sonarr | sonarr.htsn.io | 10.10.10.100:8989 | Traefik-Saltbox |
| Radarr | radarr.htsn.io | 10.10.10.100:7878 | Traefik-Saltbox |
@@ -92,6 +97,7 @@ This document tracks all IP addresses in the homelab infrastructure.
- .200 - TrueNAS
- .201 - CopyParty
- .206 - Docker-host
- .207 - Docker-host2
- .220 - Gitea
- .221 - Trading VM
- .250 - Traefik-Primary
@@ -110,7 +116,7 @@ This document tracks all IP addresses in the homelab infrastructure.
- 10.10.10.148 - 10.10.10.149 (2 IPs)
- 10.10.10.151 - 10.10.10.199 (49 IPs)
- 10.10.10.202 - 10.10.10.205 (4 IPs)
- 10.10.10.207 - 10.10.10.219 (13 IPs)
- 10.10.10.208 - 10.10.10.219 (12 IPs)
- 10.10.10.222 - 10.10.10.249 (28 IPs)
- 10.10.10.251 - 10.10.10.254 (4 IPs)
@@ -123,6 +129,19 @@ This document tracks all IP addresses in the homelab infrastructure.
| Portainer Agent | 9001 | Remote management from other Portainer |
| Gotenberg | 3000 | PDF generation API |
## Docker Host 2 Services (10.10.10.207) - PVE2
| Service | Port | Purpose |
|---------|------|---------|
| PA API | 8401 | Personal Assistant API (pa.htsn.io) - Tailscale only |
| MetaMCP | 12008 | MCP Aggregator/Gateway (metamcp.htsn.io) |
| n8n | 5678 | Workflow automation |
| Crafty Controller | 8443 | Minecraft server management (mc.htsn.io) |
| Minecraft Java | 25565 | Minecraft Java Edition server |
| Minecraft Bedrock | 19132/udp | Minecraft Bedrock Edition (Geyser) |
| Trading Redis | 6379 | Redis for trading platform |
| Trading TimescaleDB | 5433 | TimescaleDB for trading platform |
## Syncthing API Endpoints
| Device | IP | Port | API Key |
@@ -132,6 +151,16 @@ This document tracks all IP addresses in the homelab infrastructure.
| Android Phone | 10.10.10.54 | 8384 | Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM |
| TrueNAS | 10.10.10.200 | 8384 | (check TrueNAS config) |
## Mac Mini Services (10.10.10.125)
| Service | Port | Purpose |
|---------|------|---------|
| MCP Bridge | 8400 | HTTP bridge for MCP tool execution (PA API backend) |
| Beeper Desktop | 23373 | Message aggregation (Telegram, iMessage, SMS) |
| Proton Bridge IMAP | 1143 | Personal email access |
| Proton Bridge SMTP | 1025 | Personal email sending |
| Syncthing | 8384 | File sync API |
## Notes
- **MTU 9000** (jumbo frames) enabled on storage networks

618
MAINTENANCE.md Normal file
View File

@@ -0,0 +1,618 @@
# Maintenance Procedures and Schedules
Regular maintenance procedures for homelab infrastructure to ensure reliability and performance.
## Overview
| Frequency | Tasks | Estimated Time |
|-----------|-------|----------------|
| **Daily** | Quick health check | 2-5 min |
| **Weekly** | Service status, logs review | 15-30 min |
| **Monthly** | Updates, backups verification | 1-2 hours |
| **Quarterly** | Full system audit, testing | 2-4 hours |
| **Annual** | Hardware maintenance, planning | 4-8 hours |
---
## Daily Maintenance (Automated)
### Quick Health Check Script
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily via cron**:
```bash
# Add to crontab
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
## Weekly Maintenance
### Service Status Review
**Check all critical services**:
```bash
# Proxmox services
ssh pve 'systemctl status pve-cluster pvedaemon pveproxy'
ssh pve2 'systemctl status pve-cluster pvedaemon pveproxy'
# NUT (UPS monitoring)
ssh pve 'systemctl status nut-server nut-monitor'
ssh pve2 'systemctl status nut-monitor'
# Container services
ssh pve 'pct exec 200 -- systemctl status pihole-FTL' # Pi-hole
ssh pve 'pct exec 202 -- systemctl status traefik' # Traefik
# VM services (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "systemctl status nfs-server smbd"' # TrueNAS
```
### Log Review
**Check for errors in critical logs**:
```bash
# Proxmox system logs
ssh pve 'journalctl -p err -b | tail -50'
ssh pve2 'journalctl -p err -b | tail -50'
# VM logs (if QEMU agent available)
ssh pve 'qm guest exec 100 -- bash -c "journalctl -p err --since today"'
# Traefik access logs
ssh pve 'pct exec 202 -- tail -100 /var/log/traefik/access.log'
```
### Syncthing Sync Status
**Check for sync errors**:
```bash
# Check all folder errors
for folder in documents downloads desktop movies pictures notes config; do
echo "=== $folder ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=$folder" | jq
done
```
**See**: [SYNCTHING.md](SYNCTHING.md)
---
## Monthly Maintenance
### System Updates
#### Proxmox Updates
**Check for updates**:
```bash
ssh pve 'apt update && apt list --upgradable'
ssh pve2 'apt update && apt list --upgradable'
```
**Apply updates**:
```bash
# PVE
ssh pve 'apt update && apt dist-upgrade -y'
# PVE2
ssh pve2 'apt update && apt dist-upgrade -y'
# Reboot if kernel updated
ssh pve 'reboot'
ssh pve2 'reboot'
```
**⚠️ Important**:
- Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap) before major updates
- Test on PVE2 first if possible
- Ensure all VMs are backed up before updating
- Monitor VMs after reboot - some may need manual restart
#### Container Updates (LXC)
```bash
# Update all containers
ssh pve 'for ctid in 200 202 205; do pct exec $ctid -- bash -c "apt update && apt upgrade -y"; done'
```
#### VM Updates
**Update VMs individually via SSH**:
```bash
# Ubuntu/Debian VMs
ssh truenas 'apt update && apt upgrade -y'
ssh docker-host 'apt update && apt upgrade -y'
ssh fs-dev 'apt update && apt upgrade -y'
# Check if reboot required
ssh truenas '[ -f /var/run/reboot-required ] && echo "Reboot required"'
```
### ZFS Scrubs
**Schedule**: Run monthly on all pools
**PVE**:
```bash
# Start scrub on all pools
ssh pve 'zpool scrub nvme-mirror1'
ssh pve 'zpool scrub nvme-mirror2'
ssh pve 'zpool scrub rpool'
# Check scrub status
ssh pve 'zpool status | grep -A2 scrub'
```
**PVE2**:
```bash
ssh pve2 'zpool scrub nvme-mirror3'
ssh pve2 'zpool scrub local-zfs2'
ssh pve2 'zpool status | grep -A2 scrub'
```
**TrueNAS**:
```bash
# Scrub via TrueNAS web UI or SSH
ssh truenas 'zpool scrub vault'
ssh truenas 'zpool status vault | grep -A2 scrub'
```
**Automate scrubs**:
```bash
# Add to crontab (run on 1st of month at 2 AM)
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub nvme-mirror2
0 2 1 * * /sbin/zpool scrub rpool
```
**See**: [STORAGE.md](STORAGE.md) for pool details
### SMART Tests
**Run extended SMART tests monthly**:
```bash
# TrueNAS drives (via QEMU agent)
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do smartctl -t long \$dev; done"'
# Check results after 4-8 hours
ssh pve 'qm guest exec 100 -- bash -c "smartctl --scan | while read dev type; do echo \"=== \$dev ===\"; smartctl -a \$dev | grep -E \"Model|Serial|test result|Reallocated|Current_Pending\"; done"'
# PVE drives
ssh pve 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
# PVE2 drives
ssh pve2 'for dev in /dev/nvme0 /dev/nvme1 /dev/sda /dev/sdb; do [ -e "$dev" ] && smartctl -t long $dev; done'
```
**Automate SMART tests**:
```bash
# Add to crontab (run on 15th of month at 3 AM)
0 3 15 * * /usr/sbin/smartctl -t long /dev/nvme0
0 3 15 * * /usr/sbin/smartctl -t long /dev/sda
```
### Certificate Renewal Verification
**Check SSL certificate expiry**:
```bash
# Check Traefik certificates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme.json | jq ".letsencrypt.Certificates[] | {domain: .domain.main, expires: .Dates.NotAfter}"'
# Check specific service
echo | openssl s_client -servername git.htsn.io -connect git.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
**Certificates should auto-renew 30 days before expiry via Traefik**
**See**: [TRAEFIK.md](TRAEFIK.md) for certificate management
### Backup Verification
**⚠️ TODO**: No backup strategy currently in place
**See**: [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for implementation plan
---
## Quarterly Maintenance
### Full System Audit
**Check all systems comprehensively**:
1. **ZFS Pool Health**:
```bash
ssh pve 'zpool status -v'
ssh pve2 'zpool status -v'
ssh truenas 'zpool status -v vault'
```
Look for: errors, degraded vdevs, resilver operations
2. **SMART Health**:
```bash
# Run SMART health check script
~/bin/smart-health-check.sh
```
Look for: reallocated sectors, pending sectors, failures
3. **Disk Space Trends**:
```bash
# Check growth rate
ssh pve 'zpool list -o name,size,allocated,free,fragmentation'
ssh truenas 'df -h /mnt/vault'
```
Plan for expansion if >80% full
4. **VM Resource Usage**:
```bash
# Check if VMs need more/less resources
ssh pve 'qm list'
ssh pve 'pvesh get /nodes/pve/status'
```
5. **Network Performance**:
```bash
# Test bandwidth between critical nodes
iperf3 -s # On one host
iperf3 -c 10.10.10.120 # From another
```
6. **Temperature Monitoring**:
```bash
# Check max temps over past quarter
# TODO: Set up Prometheus/Grafana for historical data
ssh pve 'sensors'
ssh pve2 'sensors'
```
### Service Dependency Testing
**Test critical paths**:
1. **Power failure recovery** (if safe to test):
- See [UPS.md](UPS.md) for full procedure
- Verify VM startup order works
- Confirm all services come back online
2. **Failover testing**:
- Tailscale subnet routing (PVE → UCG-Fiber)
- NUT monitoring (PVE server → PVE2 client)
3. **Backup restoration** (when backups implemented):
- Test restoring a VM from backup
- Test restoring files from Syncthing versioning
### Documentation Review
- [ ] Update IP assignments in [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md)
- [ ] Review and update service URLs in [SERVICES.md](SERVICES.md)
- [ ] Check for missing hardware specs in [HARDWARE.md](HARDWARE.md)
- [ ] Update any changed procedures in this document
---
## Annual Maintenance
### Hardware Maintenance
**Physical cleaning**:
```bash
# Shut down servers (coordinate with users)
ssh pve 'shutdown -h now'
ssh pve2 'shutdown -h now'
# Clean dust from:
# - CPU heatsinks
# - GPU fans
# - Case fans
# - PSU vents
# - Storage enclosure fans
# Check for:
# - Bulging capacitors on PSU/motherboard
# - Loose cables
# - Fan noise/vibration
```
**Thermal paste inspection** (every 2-3 years):
- Check CPU temps vs baseline
- If temps >85°C under load, consider reapplying paste
- Threadripper PRO: Tctl max safe = 90°C
**See**: [HARDWARE.md](HARDWARE.md) for component details
### UPS Battery Test
**Runtime test**:
```bash
# Check battery health
ssh pve 'upsc cyberpower@localhost | grep battery'
# Perform runtime test (coordinate power loss)
# 1. Note current runtime estimate
# 2. Unplug UPS from wall
# 3. Let battery drain to 20%
# 4. Note actual runtime vs estimate
# 5. Plug back in before shutdown triggers
# Battery replacement if:
# - Runtime < 10 min at typical load
# - Battery age > 3-5 years
# - Battery charge < 100% when on AC for 24h
```
**See**: [UPS.md](UPS.md) for full UPS details
### Drive Replacement Planning
**Check drive age and health**:
```bash
# Get drive hours and health
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -a $dev | grep -E "Model|Serial|Power_On_Hours|Reallocated|Pending";
done'
```
**Replace drives if**:
- Reallocated sectors > 0
- Pending sectors > 0
- SMART pre-fail warnings
- Age > 5 years for HDDs (3-5 years for SSDs/NVMe)
- Hours > 50,000 for consumer drives
**Budget for replacements**:
- HDDs: WD Red 6TB (~$150/drive)
- NVMe: Samsung/Kingston 2TB (~$150-200/drive)
### Capacity Planning
**Review growth trends**:
```bash
# Storage growth (compare to last year)
ssh pve 'zpool list'
ssh truenas 'df -h /mnt/vault'
# Network bandwidth (if monitoring in place)
# Review Grafana dashboards
# Power consumption
ssh pve 'upsc cyberpower@localhost ups.load'
```
**Plan expansions**:
- Storage: Add drives if >70% full
- RAM: Check if VMs hitting limits
- Network: Upgrade if bandwidth saturation
- UPS: Upgrade if load >80%
### License and Subscription Review
**Proxmox subscription** (if applicable):
- Community (free) or Enterprise subscription?
- Check for updates to pricing/features
**Service subscriptions**:
- Domain registration (htsn.io)
- Cloudflare plan (currently free)
- Let's Encrypt (free, no action needed)
---
## Update Schedules
### Proxmox
| Component | Frequency | Notes |
|-----------|-----------|-------|
| Security patches | Weekly | Via `apt upgrade` |
| Minor updates | Monthly | Test on PVE2 first |
| Major versions | Quarterly | Read release notes, plan downtime |
| Kernel updates | Monthly | Requires reboot |
**Update procedure**:
1. Check [Proxmox release notes](https://pve.proxmox.com/wiki/Roadmap)
2. Backup VM configs: `vzdump --dumpdir /tmp`
3. Update: `apt update && apt dist-upgrade`
4. Reboot if kernel changed: `reboot`
5. Verify VMs auto-started: `qm list`
### Containers (LXC)
| Container | Update Frequency | Package Manager |
|-----------|------------------|-----------------|
| Pi-hole (200) | Weekly | `apt` |
| Traefik (202) | Monthly | `apt` |
| FindShyt (205) | As needed | `apt` |
**Update command**:
```bash
ssh pve 'pct exec CTID -- bash -c "apt update && apt upgrade -y"'
```
### VMs
| VM | Update Frequency | Notes |
|----|------------------|-------|
| TrueNAS | Monthly | Via web UI or `apt` |
| Saltbox | Weekly | Managed by Saltbox updates |
| HomeAssistant | Monthly | Via HA supervisor |
| Docker-host | Weekly | `apt` + Docker images |
| Trading-VM | As needed | Via SSH |
| Gitea-VM | Monthly | Via web UI + `apt` |
**Docker image updates**:
```bash
ssh docker-host 'docker-compose pull && docker-compose up -d'
```
### Firmware Updates
| Component | Check Frequency | Update Method |
|-----------|----------------|---------------|
| Motherboard BIOS | Annually | Manual flash (high risk) |
| GPU firmware | Rarely | `nvidia-smi` or manual |
| SSD/NVMe firmware | Quarterly | Vendor tools |
| HBA firmware | Annually | LSI tools |
| UPS firmware | Annually | PowerPanel or manual |
**⚠️ Warning**: BIOS/firmware updates carry risk. Only update if:
- Critical security issue
- Needed for hardware compatibility
- Fixing known bug affecting you
---
## Testing Checklists
### Pre-Update Checklist
Before ANY system update:
- [ ] Check current system state: `uptime`, `qm list`, `zpool status`
- [ ] Verify backups are current (when backup system in place)
- [ ] Check for critical VMs/services that can't have downtime
- [ ] Review update changelog/release notes
- [ ] Test on non-critical system first (PVE2 or test VM)
- [ ] Plan rollback strategy if update fails
- [ ] Notify users if downtime expected
### Post-Update Checklist
After system update:
- [ ] Verify system booted correctly: `uptime`
- [ ] Check all VMs/CTs started: `qm list`, `pct list`
- [ ] Test critical services:
- [ ] Pi-hole DNS: `nslookup google.com 10.10.10.10`
- [ ] Traefik routing: `curl -I https://plex.htsn.io`
- [ ] NFS/SMB shares: Test mount from VM
- [ ] Syncthing sync: Check all devices connected
- [ ] Review logs for errors: `journalctl -p err -b`
- [ ] Check temperatures: `sensors`
- [ ] Verify UPS monitoring: `upsc cyberpower@localhost`
### Disaster Recovery Test
**Quarterly test** (when backup system in place):
- [ ] Simulate VM failure: Restore from backup
- [ ] Simulate storage failure: Import pool on different system
- [ ] Simulate network failure: Verify Tailscale failover
- [ ] Simulate power failure: Test UPS shutdown procedure (if safe)
- [ ] Document recovery time and issues
---
## Log Rotation
**System logs** are automatically rotated by systemd-journald and logrotate.
**Check log sizes**:
```bash
# Journalctl size
ssh pve 'journalctl --disk-usage'
# Traefik logs
ssh pve 'pct exec 202 -- du -sh /var/log/traefik/'
```
**Configure retention**:
```bash
# Limit journald to 500MB
ssh pve 'echo "SystemMaxUse=500M" >> /etc/systemd/journald.conf'
ssh pve 'systemctl restart systemd-journald'
```
**Traefik log rotation** (already configured):
```bash
# /etc/logrotate.d/traefik on CT 202
/var/log/traefik/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
```
---
## Monitoring Integration
**TODO**: Set up automated monitoring for these procedures
**When monitoring is implemented** (see [MONITORING.md](MONITORING.md)):
- ZFS scrub completion/errors
- SMART test failures
- Certificate expiry warnings (<30 days)
- Update availability notifications
- Disk space thresholds (>80%)
- Temperature warnings (>85°C)
---
## Related Documentation
- [MONITORING.md](MONITORING.md) - Automated health checks and alerts
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup implementation plan
- [UPS.md](UPS.md) - Power failure procedures
- [STORAGE.md](STORAGE.md) - ZFS pool management
- [HARDWARE.md](HARDWARE.md) - Hardware specifications
- [SERVICES.md](SERVICES.md) - Service inventory
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Manual procedures only - monitoring automation needed

711
MINECRAFT.md Normal file
View File

@@ -0,0 +1,711 @@
# Minecraft Servers
Minecraft servers running on docker-host2 via Crafty Controller 4.
---
## Servers Overview
| Server | Address | Port | Version | Status |
|--------|---------|------|---------|--------|
| **Hutworld** | hutworld.htsn.io | 25565 | Paper 1.21.11 | Running |
| **Backrooms** | backrooms.htsn.io | 25566 | Paper 1.21.4 | Running |
### Web Map
| Setting | Value |
|---------|-------|
| **URL** | https://map.htsn.io |
| **Username** | hutworld |
| **Password** | Suwanna123 |
| **Plugin** | BlueMap 5.15 |
| **Port** | 8100 (exposed via Docker) |
---
## Quick Reference
### Hutworld (Main Server)
| Setting | Value |
|---------|-------|
| **Web GUI** | https://mc.htsn.io |
| **Game Server (Java)** | hutworld.htsn.io:25565 |
| **Game Server (Bedrock)** | hutworld.htsn.io:19132 |
| **Host** | docker-host2 (10.10.10.207) |
| **Server Type** | Paper 1.21.11 |
| **World Name** | hutworld |
| **Memory** | 4GB min / 8GB max |
### Backrooms (Horror/Exploration)
| Setting | Value |
|---------|-------|
| **Web GUI** | https://mc.htsn.io |
| **Game Server (Java)** | backrooms.htsn.io:25566 |
| **Host** | docker-host2 (10.10.10.207) |
| **Server Type** | Paper 1.21.4 |
| **World Name** | backrooms |
| **Memory** | 512MB min / 1.5GB max |
| **Datapack** | The Backrooms v2.2.0 |
**Backrooms Features:**
- 50+ custom dimensions based on Backrooms lore
- Use `/execute in backrooms:level0 run tp @s ~ ~ ~` to travel to Level 0
- Horror-themed exploration gameplay
- No client mods required (datapack only)
---
## Crafty Controller Access
| Setting | Value |
|---------|-------|
| **URL** | https://mc.htsn.io |
| **Username** | admin |
| **Password** | See `/crafty/data/config/default-creds.txt` on docker-host2 |
**Get password:**
```bash
ssh docker-host2 'cat ~/crafty/data/config/default-creds.txt'
```
---
## Current Status
### Completed
- [x] Crafty Controller 4.4.7 deployed on docker-host2
- [x] Traefik reverse proxy configured (mc.htsn.io → 10.10.10.207:8443)
- [x] DNS A record created for hutworld.htsn.io (non-proxied, points to public IP)
- [x] Port forwarding configured via UniFi API:
- TCP/UDP 25565 → 10.10.10.207 (Java Edition)
- UDP 19132 → 10.10.10.207 (Bedrock via Geyser)
- [x] Server files transferred from Windows PC (D:\Minecraft\mcss\servers\hutworld)
- [x] Server imported into Crafty and running
- [x] Paper upgraded from 1.21.5 to 1.21.11
- [x] Plugins updated (GSit 3.1.1, LuckPerms 5.5.22)
- [x] Orphaned plugin data cleaned up
- [x] LuckPerms database restored with original permissions
- [x] Automated backups to TrueNAS configured (every 6 hours)
### Pending
- [ ] Install SilkSpawners plugin (allows mining spawners with Silk Touch)
- [ ] Change Crafty admin password to something memorable
- [ ] Test external connectivity from outside network
---
## Import Instructions
To import the hutworld server in Crafty:
1. Go to **Servers** → Click **+ Create New Server**
2. Select **Import Server** tab
3. Fill in:
- **Server Name:** `Hutworld`
- **Import Path:** `/crafty/import/hutworld`
- **Server JAR:** `paper.jar`
- **Min RAM:** `2048` (2GB)
- **Max RAM:** `6144` (6GB)
- **Server Port:** `25565`
4. Click **Import Server**
5. Go to server → Click **Start**
---
## Server Configuration
### World Data
| World | Description |
|-------|-------------|
| hutworld | Main overworld |
| hutworld_nether | Nether dimension |
| hutworld_the_end | End dimension |
### Installed Plugins
| Plugin | Version | Purpose |
|--------|---------|---------|
| EssentialsX | 2.20.1 | Core server commands |
| EssentialsXChat | 2.20.1 | Chat formatting |
| EssentialsXSpawn | 2.20.1 | Spawn management |
| Geyser-Spigot | Latest | Bedrock Edition support |
| floodgate | Latest | Bedrock authentication |
| GSit | 3.1.1 | Sit/lay/crawl animations |
| LuckPerms | 5.5.22 | Permissions management |
| PluginPortal | 2.2.2 | Plugin management |
| Vault | 1.7.3 | Economy/permissions API |
| ViaVersion | Latest | Multi-version support |
| ViaBackwards | 5.2.1 | Older client support |
| randomtp | Latest | Random teleportation |
| BlueMap | 5.15 | 3D web map with player tracking |
| WorldEdit | 7.3.10 | World editing and terraforming |
**Removed plugins** (cleaned up 2026-01-03):
- GriefPrevention, Multiverse-Core, Multiverse-Portals, ProtocolLib, WorldGuard (disabled/orphaned)
---
## Docker Configuration
**Location:** `~/crafty/docker-compose.yml` on docker-host2
```yaml
services:
crafty:
image: registry.gitlab.com/crafty-controller/crafty-4:4.4.7
container_name: crafty
restart: unless-stopped
environment:
- TZ=America/New_York
ports:
- "8443:8443" # Web GUI (HTTPS)
- "8123:8123" # Crafty HTTP
- "25565:25565" # Minecraft Java
- "25566:25566" # Additional server
- "19132:19132/udp" # Minecraft Bedrock (Geyser)
- "8100:8100" # BlueMap web server
volumes:
- ./data/backups:/crafty/backups
- ./data/logs:/crafty/logs
- ./data/servers:/crafty/servers
- ./data/config:/crafty/app/config
- ./data/import:/crafty/import
```
---
## Traefik Configuration
**File:** `/etc/traefik/conf.d/crafty.yaml` on CT 202 (10.10.10.250)
```yaml
http:
routers:
crafty-secure:
entryPoints:
- websecure
rule: "Host(`mc.htsn.io`)"
service: crafty
tls:
certResolver: cloudflare
priority: 50
services:
crafty:
loadBalancer:
servers:
- url: "https://10.10.10.207:8443"
serversTransport: crafty-transport@file
serversTransports:
crafty-transport:
insecureSkipVerify: true
```
---
## Port Forwarding (UniFi)
Configured via UniFi controller on UCG-Fiber (10.10.10.1):
| Rule Name | Port | Protocol | Destination | Status |
|-----------|------|----------|-------------|--------|
| Minecraft Java | 25565 | TCP/UDP | 10.10.10.207:25565 | Active |
| Minecraft Bedrock | 19132 | UDP | 10.10.10.207:19132 | Active |
| Minecraft Backrooms | 25566 | TCP/UDP | 10.10.10.207:25566 | Active |
---
## DNS Records (Cloudflare)
| Record | Type | Value | Proxied |
|--------|------|-------|---------|
| mc.htsn.io | CNAME | htsn.io | Yes (for web GUI) |
| hutworld.htsn.io | A | 70.237.94.174 | No (direct for game traffic) |
| backrooms.htsn.io | A | 70.237.94.174 | No (direct for game traffic) |
**Note:** Game traffic (25565, 25566, 19132) cannot be proxied through Cloudflare - only HTTP/HTTPS works with Cloudflare proxy.
---
## LuckPerms Web Editor
After server is running:
1. Open Crafty console for Hutworld server
2. Run command: `/lp editor`
3. A unique URL will be generated (cloud-hosted by LuckPerms)
4. Open the URL in browser to manage permissions
The editor is hosted by LuckPerms, so no additional port forwarding is needed.
---
## Backup Configuration
### Automated Backups to TrueNAS
Backups run automatically every 2 hours and are stored on TrueNAS for both servers.
| Setting | Value |
|---------|-------|
| **Destination** | TrueNAS (10.10.10.200) |
| **Path** | `/mnt/vault/users/backups/minecraft/` |
| **Frequency** | Every 2 hours (12 backups per day) |
| **Retention** | 30 backups per server (~2.5 days of history) |
| **Hutworld Size** | ~2-7 GB per backup |
| **Backrooms Size** | ~100-150 MB per backup |
| **Script** | `/home/hutson/minecraft-backup-all.sh` on docker-host2 |
| **Log** | `/home/hutson/minecraft-backup.log` on docker-host2 |
### Backup Scripts
**Main Script:** `~/minecraft-backup-all.sh` on docker-host2 (backs up both servers)
**Legacy Script:** `~/minecraft-backup.sh` on docker-host2 (Hutworld only)
```bash
#!/bin/bash
# Minecraft Server Backup Script
# Backs up Crafty server data to TrueNAS
BACKUP_SRC="$HOME/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60"
BACKUP_DEST="hutson@10.10.10.200:/mnt/vault/users/backups/minecraft"
DATE=$(date +%Y-%m-%d_%H%M)
BACKUP_NAME="hutworld-$DATE.tar.gz"
LOCAL_BACKUP="/tmp/$BACKUP_NAME"
# Create compressed backup (exclude large unnecessary files)
tar -czf "$LOCAL_BACKUP" \
--exclude="*.jar" \
--exclude="cache" \
--exclude="libraries" \
--exclude=".paper-remapped" \
-C "$HOME/crafty/data/servers" \
19f604a9-f037-442d-9283-0761c73cfd60
# Transfer to TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$LOCAL_BACKUP" "$BACKUP_DEST/"
# Clean up local temp file
rm -f "$LOCAL_BACKUP"
# Keep only last 30 backups on TrueNAS
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 '
cd /mnt/vault/users/backups/minecraft
ls -t hutworld-*.tar.gz 2>/dev/null | tail -n +31 | xargs -r rm -f
'
```
### Cron Schedule
```bash
# View current schedule
ssh docker-host2 'crontab -l | grep minecraft'
# Output: 0 */2 * * * /home/hutson/minecraft-backup-all.sh >> /home/hutson/minecraft-backup.log 2>&1
```
### Manual Backup Commands
```bash
# Run backup manually
ssh docker-host2 '~/minecraft-backup.sh'
# Check backup log
ssh docker-host2 'tail -20 ~/minecraft-backup.log'
# List backups on TrueNAS
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lh /mnt/vault/users/backups/minecraft/'
```
### Restore from Backup
```bash
# 1. Stop the server in Crafty web UI
# 2. Copy backup from TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no \
hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz \
/tmp/
# 3. Extract to server directory (backup existing first)
ssh docker-host2 'cd ~/crafty/data/servers && \
mv 19f604a9-f037-442d-9283-0761c73cfd60 19f604a9-f037-442d-9283-0761c73cfd60.old && \
tar -xzf /tmp/hutworld-YYYY-MM-DD_HHMM.tar.gz'
# 4. Start server in Crafty web UI
```
---
## Admin Commands
### Give Mob Spawner (1.21+ Syntax)
In Minecraft 1.21+, the NBT syntax changed. Use `minecraft:give` to bypass Essentials:
```
minecraft:give <player> spawner[block_entity_data={id:"minecraft:mob_spawner",SpawnData:{entity:{id:"minecraft:<mob_type>"}}}]
```
**Examples:**
```bash
# Magma cube spawner
minecraft:give suwann spawner[block_entity_data={id:"minecraft:mob_spawner",SpawnData:{entity:{id:"minecraft:magma_cube"}}}]
# Zombie spawner
minecraft:give suwann spawner[block_entity_data={id:"minecraft:mob_spawner",SpawnData:{entity:{id:"minecraft:zombie"}}}]
# Skeleton spawner
minecraft:give suwann spawner[block_entity_data={id:"minecraft:mob_spawner",SpawnData:{entity:{id:"minecraft:skeleton"}}}]
# Blaze spawner
minecraft:give suwann spawner[block_entity_data={id:"minecraft:mob_spawner",SpawnData:{entity:{id:"minecraft:blaze"}}}]
```
**Note:** Must use `minecraft:give` prefix to use vanilla command instead of Essentials `/give`.
### RCON Access
For remote console access to the server:
| Setting | Value |
|---------|-------|
| **Host** | 10.10.10.207 |
| **Port** | 25575 |
| **Password** | HutworldRCON2026 |
Example using mcrcon:
```bash
mcrcon -H 10.10.10.207 -P 25575 -p HutworldRCON2026
```
### BlueMap Commands
```bash
# Start full world render
/bluemap render
# Pause rendering
/bluemap pause
# Resume rendering
/bluemap resume
# Check render status
/bluemap status
# Reload BlueMap config
/bluemap reload
```
---
## Common Tasks
### Start/Stop Server
Via Crafty web UI at https://mc.htsn.io, or:
```bash
# Check Crafty container status
ssh docker-host2 'docker ps | grep crafty'
# Restart Crafty container
ssh docker-host2 'cd ~/crafty && docker compose restart'
# View Crafty logs
ssh docker-host2 'docker logs -f crafty'
```
### Backup Server
See [Backup Configuration](#backup-configuration) for full details.
```bash
# Run backup manually
ssh docker-host2 '~/minecraft-backup.sh'
# Check recent backups
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lht /mnt/vault/users/backups/minecraft/ | head -5'
```
### Update Plugins
1. Download new plugin JAR
2. Upload via Crafty Files tab, or:
```bash
scp plugin.jar docker-host2:~/crafty/data/servers/hutworld/plugins/
```
3. Restart server in Crafty
### Check Server Logs
Via Crafty web UI (Logs tab), or:
```bash
ssh docker-host2 'tail -f ~/crafty/data/servers/hutworld/logs/latest.log'
```
---
## Troubleshooting
### Plugin Permission Issues (IMPORTANT)
**Root Cause**: Crafty Docker container requires all files to be owned by `<user>:root` (not `<user>:<user>`) for permissions to work correctly.
**Permanent Fix**:
```bash
# Fix all permissions immediately
ssh docker-host2 'sudo chown -R hutson:root ~/crafty/data/servers/ && \
sudo find ~/crafty/data/servers/ -type d -exec chmod 2775 {} \; && \
sudo find ~/crafty/data/servers/ -type f -exec chmod 664 {} \;'
```
**Prevention**:
1. **Always upload plugins through Crafty web UI** - this ensures correct permissions
2. **Or use the import directory**: Copy to `~/crafty/data/import/` then restart container
3. **Never directly copy files** to the servers directory
**Check for permission issues**:
```bash
# Use the permission check script (recommended)
ssh docker-host2 '~/check-crafty-permissions.sh'
# Or manually check for wrong group ownership
ssh docker-host2 'find ~/crafty/data/servers -type f ! -group root -ls'
ssh docker-host2 'find ~/crafty/data/servers -type d ! -group root -ls'
```
**Permission Check Script**: Located at `~/check-crafty-permissions.sh` on docker-host2
- Automatically detects permission issues
- Offers to fix them with one command
- Ignores temporary files that are expected to have different permissions
### Crafty Shows Server Offline or "Another Instance Running"
**Cause**: This happens when the server was started manually (not through Crafty) or when Crafty loses track of the server process.
**Fix**:
```bash
# 1. Kill any orphaned server processes
ssh docker-host2 'docker exec crafty pkill -f "paper.jar"'
# 2. Restart Crafty container to clear state
ssh docker-host2 'cd ~/crafty && docker compose restart'
# 3. Wait 30-60 seconds - Crafty will auto-start the server
```
**Prevention**:
- Always use Crafty web UI to start/stop servers
- Never manually start the server with java command
- If you must restart, use the container restart method above
### Server won't start
```bash
# Check Crafty container logs
ssh docker-host2 'docker logs crafty --tail 50'
# Check server logs
ssh docker-host2 'cat ~/crafty/data/servers/hutworld/logs/latest.log | tail -100'
# Check Java version in container
ssh docker-host2 'docker exec crafty java -version'
```
### Can't connect externally
1. Verify port forwarding is active:
```bash
ssh root@10.10.10.1 'iptables -t nat -L -n | grep 25565'
```
2. Test from external network:
```bash
nc -zv hutworld.htsn.io 25565
```
3. Check if server is listening:
```bash
ssh docker-host2 'netstat -tlnp | grep 25565'
```
### Bedrock players can't connect
1. Verify Geyser plugin is installed and enabled
2. Check Geyser config: `~/crafty/data/servers/hutworld/plugins/Geyser-Spigot/config.yml`
3. Ensure UDP 19132 is forwarded and not blocked
### Corrupted plugin JARs (ZipException)
If you see `java.util.zip.ZipException: zip END header not found`:
1. **Check all plugins for corruption:**
```bash
ssh docker-host2 'cd ~/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60/plugins && \
for jar in *.jar; do unzip -t "$jar" > /dev/null 2>&1 && echo "OK: $jar" || echo "CORRUPT: $jar"; done'
```
2. **Re-download corrupted plugins from Hangar/Modrinth/SpigotMC**
3. **Restart server**
### Session lock errors
If server fails with `session.lock: already locked`:
```bash
# Kill stale Java processes and remove locks
ssh docker-host2 'docker exec crafty bash -c "pkill -f paper.jar; rm -f /crafty/servers/*/hutworld*/session.lock"'
```
### Permission denied errors in Docker
If world files show `AccessDeniedException`:
```bash
# Fix permissions (crafty user is UID 1000)
ssh docker-host2 'docker exec crafty bash -c "chown -R 1000:0 /crafty/servers/19f604a9-f037-442d-9283-0761c73cfd60/ && chmod -R u+rwX /crafty/servers/19f604a9-f037-442d-9283-0761c73cfd60/"'
```
### LuckPerms missing users/permissions
If LuckPerms shows a fresh database (missing users like Suwan):
1. **Check if original database exists:**
```bash
ssh docker-host2 'ls -la ~/crafty/data/import/hutworld/plugins/LuckPerms/*.db'
```
2. **Restore from import backup:**
```bash
# Stop server in Crafty UI first
ssh docker-host2 'cp ~/crafty/data/import/hutworld/plugins/LuckPerms/luckperms-h2-v2.mv.db \
~/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60/plugins/LuckPerms/'
```
3. **Or restore from TrueNAS backup:**
```bash
# List available backups
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 \
'ls -lt /mnt/vault/users/backups/minecraft/'
# Extract LuckPerms database from backup
sshpass -p 'GrilledCh33s3#' scp hutson@10.10.10.200:/mnt/vault/users/backups/minecraft/hutworld-YYYY-MM-DD_HHMM.tar.gz /tmp/
tar -xzf /tmp/hutworld-*.tar.gz -C /tmp --strip-components=2 \
'*/plugins/LuckPerms/luckperms-h2-v2.mv.db'
```
4. **Restart server in Crafty UI**
---
## Migration History
### 2026-01-04: Backup System (Updated 2026-01-13)
- Configured automated backups to TrueNAS
- **Updated frequency:** Every 2 hours (was 6 hours)
- **Updated retention:** 30 backups (~2.5 days) (was 14 backups)
- Created backup script with compression and cleanup
- Storage: `/mnt/vault/users/backups/minecraft/`
### 2026-01-03: Server Fixes & Updates
**Updates:**
- Upgraded Paper from 1.21.5 to 1.21.11 (build 69)
- Updated GSit from 2.3.2 to 3.1.1
- Fixed corrupted LuckPerms JAR (re-downloaded 5.5.22)
- Restored original LuckPerms database with user permissions
**Cleanup:**
- Removed disabled plugins: Dynmap, Graves
- Removed orphaned data folders: GriefPreventionData, SilkSpawners_v2, Graves, ViaRewind
**Fixes:**
- Fixed memory allocation (was attempting 2TB, set to 2GB min / 4GB max)
- Fixed file permissions for Docker container access
### 2026-01-03: Initial Migration
**Source:** Windows PC (10.10.10.150) - D:\Minecraft\mcss\servers\hutworld
**Steps completed:**
1. Compressed hutworld folder on Windows (2.4GB zip)
2. Transferred via SCP to docker-host2
3. Unzipped to ~/crafty/data/import/hutworld
4. Downloaded Paper 1.21.5 JAR (later upgraded to 1.21.11)
5. Imported server into Crafty Controller
6. Configured port forwarding (updated existing 25565 rule, added 19132)
7. Created DNS record for hutworld.htsn.io
**Original MCSS config preserved:** `mcss_server_config.json`
---
## Related Documentation
- [IP Assignments](IP-ASSIGNMENTS.md) - Network configuration
- [Traefik](TRAEFIK.md) - Reverse proxy setup
- [VMs](VMS.md) - docker-host2 details
- [Gateway](GATEWAY.md) - UCG-Fiber configuration
---
## Resources
- [Crafty Controller Docs](https://docs.craftycontrol.com/)
- [Paper MC](https://papermc.io/)
- [Geyser MC](https://geysermc.org/)
- [LuckPerms](https://luckperms.net/)
---
**Last Updated:** 2026-01-11
---
## Migration History (Hutworld)
### 2026-01-13: Server Infrastructure Upgrades ✅
- **RAM Upgraded:** Increased from 2GB/4GB to 4GB/8GB (min/max)
- **Storage Expanded:** VM disk increased from 32GB to 64GB (33% used)
- **RCON Enabled:** Remote console access configured on port 25575 - TESTED & WORKING
- **WorldEdit Installed:** Version 7.3.10 for world editing capabilities
- **Auto-Start Configured:** Server auto-starts with Crafty container
- **Docker Cleanup:** Freed 1.1GB by removing unused images and containers
- **Container Fixed:** Recreated with proper port mappings for RCON access
### 2026-01-11: BlueMap Web Map Added
- Installed BlueMap 5.15 plugin (supports MC 1.21.11)
- Exposed port 8100 in docker-compose.yml for BlueMap web server
- Configured Traefik routing: map.htsn.io → 10.10.10.207:8100
- Added basic auth password protection via Traefik middleware
- Fixed corrupted ViaVersion/ViaBackwards plugins (re-downloaded from Hangar)
- Fixed Docker file permission issues (chown to UID 1000)
- Documented 1.21+ spawner give command syntax
---
## Migration History (Backrooms)
### 2026-01-05: Backrooms Server Created
- Created new Backrooms server in Crafty Controller
- Installed Paper 1.21.4 build 232 (recommended version for datapack)
- Installed The Backrooms datapack v2.2.0 from Modrinth
- DNS record created for backrooms.htsn.io
- Memory configured for 512MB-1.5GB (VM memory constrained)
- Server running on port 25566
- **Pending:** Port forwarding for external access

711
MONITORING.md Normal file
View File

@@ -0,0 +1,711 @@
# Monitoring and Alerting
Documentation for system monitoring, health checks, and alerting across the homelab.
## Current Monitoring Status
| Component | Monitored? | Method | Alerts | Notes |
|-----------|------------|--------|--------|-------|
| **Gateway** | ✅ Yes | Custom services | ✅ Auto-reboot | Internet watchdog + memory monitor |
| **UPS** | ✅ Yes | NUT + Home Assistant | ❌ No | Battery, load, runtime tracked |
| **Syncthing** | ✅ Partial | API (manual checks) | ❌ No | Connection status available |
| **Server temps** | ✅ Partial | Manual checks | ❌ No | Via `sensors` command |
| **VM status** | ✅ Partial | Proxmox UI | ❌ No | Manual monitoring |
| **ZFS health** | ❌ No | Manual `zpool status` | ❌ No | No automated checks |
| **Disk health (SMART)** | ❌ No | Manual `smartctl` | ❌ No | No automated checks |
| **Network** | ✅ Partial | Gateway watchdog | ✅ Auto-reboot | Connectivity check every 60s |
| **Services** | ❌ No | - | ❌ No | No health checks |
| **Backups** | ❌ No | - | ❌ No | No verification |
| **Claude Code** | ✅ Yes | Prometheus + Grafana | ✅ Yes | Token usage, burn rate, cost tracking |
**Overall Status**: ⚠️ **PARTIAL** - Gateway monitoring active, Claude Code active, most else is manual
---
## Existing Monitoring
### UPS Monitoring (NUT)
**Status**: ✅ **Active and working**
**What's monitored**:
- Battery charge percentage
- Runtime remaining (seconds)
- Load percentage
- Input/output voltage
- UPS status (OL/OB/LB)
**Access**:
```bash
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
# Key metrics
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
```
**Home Assistant Integration**:
- Sensors: `sensor.cyberpower_*`
- Can be used for automation/alerts
- Currently: No alerts configured
**See**: [UPS.md](UPS.md)
---
### Gateway Monitoring
**Status**: ✅ **Active with auto-recovery**
Two custom systemd services monitor the UCG-Fiber gateway (10.10.10.1):
**1. Internet Watchdog** (`internet-watchdog.service`)
- Pings external DNS (1.1.1.1, 8.8.8.8, 208.67.222.222) every 60 seconds
- Auto-reboots gateway after 5 consecutive failures (~5 minutes)
- Logs to `/var/log/internet-watchdog.log`
**2. Memory Monitor** (`memory-monitor.service`)
- Logs memory usage and top processes every 10 minutes
- Logs to `/data/logs/memory-history.log`
- Auto-rotates when log exceeds 10MB
**Quick Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# View watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# View memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Current memory usage
ssh ucg-fiber 'free -m && ps -eo pid,rss,comm --sort=-rss | head -12'
```
**See**: [GATEWAY.md](GATEWAY.md)
---
### Claude Code Token Monitoring
**Status**: ✅ **Active with alerts**
Monitors Claude Code token usage across all machines to track subscription consumption and prevent hitting weekly limits.
**Architecture**:
```
Claude Code (MacBook/Mac Mini)
▼ (OTLP HTTP push every 60s)
OTEL Collector (docker-host:4318)
▼ (Prometheus exporter on :8889)
Prometheus (docker-host:9090) ─── scrapes ───► otel-collector:8889
├──► Grafana Dashboard
└──► Alertmanager (burn rate alerts)
```
**Note**: Uses Prometheus exporter instead of Remote Write because Claude Code sends Delta temporality metrics, which Remote Write doesn't support.
**Monitored Devices**:
All Claude Code sessions on any device automatically push metrics via OTLP.
**What's monitored**:
- Token usage (input/output/cache) over time
- Burn rate (tokens/hour)
- Cost tracking (USD)
- Usage by model (Opus, Sonnet, Haiku)
- Session count
- Per-device breakdown
**Dashboard**: https://grafana.htsn.io/d/claude-code-usage/claude-code-token-usage
**Alerts Configured**:
| Alert | Threshold | Severity |
|-------|-----------|----------|
| High Burn Rate | >100k tokens/hour for 15min | Warning |
| Weekly Limit Risk | Projected >5M tokens/week | Critical |
| No Metrics | Scrape fails for 5min | Info |
**Configuration Files**:
- Shell config: `~/.zshrc` (on each Mac - synced via Syncthing)
- OTEL Collector: `/opt/monitoring/otel-collector/config.yaml` (docker-host)
- Alert rules: `/opt/monitoring/prometheus/rules/claude-code.yml` (docker-host)
**Shell Environment Setup** (in `~/.zshrc`):
```bash
# Claude Code OpenTelemetry Metrics (push to OTEL Collector)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://10.10.10.206:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_METRIC_EXPORT_INTERVAL=60000
```
**Note**: These can be set either in shell environment (`~/.zshrc`) or in `~/.claude/settings.json` under the `env` block. Both methods work.
**OTEL Collector Config** (`/opt/monitoring/otel-collector/config.yaml`):
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
exporters:
prometheus:
endpoint: 0.0.0.0:8889
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
```
**Prometheus Scrape Config** (add to `/opt/monitoring/prometheus/prometheus.yml`):
```yaml
- job_name: "claude-code"
static_configs:
- targets: ["otel-collector:8889"]
labels:
group: "claude-code"
```
**Useful PromQL Queries**:
```promql
# Total tokens by model
sum(claude_code_token_usage_tokens_total) by (model)
# Burn rate (tokens/hour)
sum(rate(claude_code_token_usage_tokens_total[1h])) * 3600
# Total cost by model
sum(claude_code_cost_usage_USD_total) by (model)
# Usage by type (input, output, cacheRead, cacheCreation)
sum(claude_code_token_usage_tokens_total) by (type)
# Projected weekly usage (rough estimate)
sum(increase(claude_code_token_usage_tokens_total[24h])) * 7
```
**Important Notes**:
- After changing `~/.zshrc`, start a new terminal/shell session before running Claude Code
- Metrics only flow while Claude Code is running
- Weekly subscription resets Monday 1am (America/New_York)
- Verify env vars are set: `env | grep OTEL`
**Added**: 2026-01-16
---
### Syncthing Monitoring
**Status**: ⚠️ **Partial** - API available, no automated monitoring
**What's available**:
- Device connection status
- Folder sync status
- Sync errors
- Bandwidth usage
**Manual Checks**:
```bash
# Check connections (Mac Mini)
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
# Check folder status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/db/status?folder=documents" | jq
# Check errors
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/folder/errors?folder=documents" | jq
```
**Needs**: Automated monitoring script + alerts
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Temperature Monitoring
**Status**: ⚠️ **Manual only**
**Current Method**:
```bash
# CPU temperature (Threadripper Tctl)
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do \
label=$(cat ${f%_input}_label 2>/dev/null); \
if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Thresholds**:
- Healthy: 70-80°C under load
- Warning: >85°C
- Critical: >90°C (throttling)
**Needs**: Automated monitoring + alert if >85°C
---
### Proxmox VM Monitoring
**Status**: ⚠️ **Manual only**
**Current Access**:
- Proxmox Web UI: Node → Summary
- CLI: `ssh pve 'qm list'`
**Metrics Available** (via Proxmox):
- CPU usage per VM
- RAM usage per VM
- Disk I/O
- Network I/O
- VM uptime
**Needs**: API-based monitoring + alerts for VM down
---
## Recommended Monitoring Stack
### Option 1: Prometheus + Grafana (Recommended)
**Why**:
- Industry standard
- Extensive integrations
- Beautiful dashboards
- Flexible alerting
**Architecture**:
```
Grafana (dashboard) → Prometheus (metrics DB) → Exporters (data collection)
Alertmanager (alerts)
```
**Required Exporters**:
| Exporter | Monitors | Install On |
|----------|----------|------------|
| node_exporter | CPU, RAM, disk, network | PVE, PVE2, TrueNAS, all VMs |
| zfs_exporter | ZFS pool health | PVE, PVE2, TrueNAS |
| smartmon_exporter | Drive SMART data | PVE, PVE2, TrueNAS |
| nut_exporter | UPS metrics | PVE |
| proxmox_exporter | VM/CT stats | PVE, PVE2 |
| cadvisor | Docker containers | Saltbox, docker-host |
**Deployment**:
```bash
# Create monitoring VM
ssh pve 'qm create 210 --name monitoring --memory 4096 --cores 2 \
--net0 virtio,bridge=vmbr0'
# Install Prometheus + Grafana (via Docker)
# /opt/monitoring/docker-compose.yml
```
**Estimated Setup Time**: 4-6 hours
---
### Option 2: Uptime Kuma (Simpler Alternative)
**Why**:
- Lightweight
- Easy to set up
- Web-based dashboard
- Built-in alerts (email, Slack, etc.)
**What it monitors**:
- HTTP/HTTPS endpoints
- Ping (ICMP)
- Ports (TCP)
- Docker containers
**Deployment**:
```bash
ssh docker-host 'mkdir -p /opt/uptime-kuma'
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
ports:
- "3001:3001"
volumes:
- ./data:/app/data
restart: unless-stopped
EOF
# Access: http://10.10.10.206:3001
# Add Traefik config for uptime.htsn.io
```
**Estimated Setup Time**: 1-2 hours
---
### Option 3: Netdata (Real-time Monitoring)
**Why**:
- Real-time metrics (1-second granularity)
- Auto-discovers services
- Low overhead
- Beautiful web UI
**Deployment**:
```bash
# Install on each server
ssh pve 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
ssh pve2 'bash <(curl -Ss https://my-netdata.io/kickstart.sh)'
# Access:
# http://10.10.10.120:19999 (PVE)
# http://10.10.10.102:19999 (PVE2)
```
**Parent-Child Setup** (optional):
- Configure PVE as parent
- Stream metrics from PVE2 → PVE
- Single dashboard for both servers
**Estimated Setup Time**: 1 hour
---
## Critical Metrics to Monitor
### Server Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **CPU usage** | >90% for 5 min | Alert |
| **CPU temp** | >85°C | Alert |
| **CPU temp** | >90°C | Critical alert |
| **RAM usage** | >95% | Alert |
| **Disk space** | >80% | Warning |
| **Disk space** | >90% | Alert |
| **Load average** | >CPU count | Alert |
### Storage Health
| Metric | Threshold | Action |
|--------|-----------|--------|
| **ZFS pool errors** | >0 | Alert immediately |
| **ZFS pool degraded** | Any degraded vdev | Critical alert |
| **ZFS scrub failed** | Last scrub error | Alert |
| **SMART reallocated sectors** | >0 | Warning |
| **SMART pending sectors** | >0 | Alert |
| **SMART failure** | Pre-fail | Critical - replace drive |
### UPS
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Battery charge** | <20% | Warning |
| **Battery charge** | <10% | Alert |
| **On battery** | >5 min | Alert |
| **Runtime** | <5 min | Critical |
### Network
| Metric | Threshold | Action |
|--------|-----------|--------|
| **Device unreachable** | >2 min down | Alert |
| **High packet loss** | >5% | Warning |
| **Bandwidth saturation** | >90% | Warning |
### VMs/Services
| Metric | Threshold | Action |
|--------|-----------|--------|
| **VM stopped** | Critical VM down | Alert immediately |
| **Service unreachable** | HTTP 5xx or timeout | Alert |
| **Backup failed** | Any backup failure | Alert |
| **Certificate expiry** | <30 days | Warning |
| **Certificate expiry** | <7 days | Alert |
---
## Alert Destinations
### Email Alerts
**Recommended**: Set up SMTP relay for email alerts
**Options**:
1. Gmail SMTP (free, rate-limited)
2. SendGrid (free tier: 100 emails/day)
3. Mailgun (free tier available)
4. Self-hosted mail server (complex)
**Configuration Example** (Prometheus Alertmanager):
```yaml
# /etc/alertmanager/alertmanager.yml
receivers:
- name: 'email'
email_configs:
- to: 'hutson@example.com'
from: 'alerts@htsn.io'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alerts@htsn.io'
auth_password: 'app-password-here'
```
---
### Push Notifications
**Options**:
- **Pushover**: $5 one-time, reliable
- **Pushbullet**: Free tier available
- **Telegram Bot**: Free
- **Discord Webhook**: Free
- **Slack**: Free tier available
**Recommended**: Pushover or Telegram for mobile alerts
---
### Home Assistant Alerts
Since Home Assistant is already running, use it for alerts:
**Automation Example**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_charge
below: 20
action:
- service: notify.mobile_app
data:
message: "⚠️ UPS battery at {{ states('sensor.cyberpower_battery_charge') }}%"
- alias: "Server High Temperature"
trigger:
- platform: template
value_template: "{{ sensor.pve_cpu_temp > 85 }}"
action:
- service: notify.mobile_app
data:
message: "🔥 PVE CPU temperature: {{ states('sensor.pve_cpu_temp') }}°C"
```
**Needs**: Sensors for CPU temp, disk space, etc. in Home Assistant
---
## Monitoring Scripts
### Daily Health Check
Save as `~/bin/homelab-health-check.sh`:
```bash
#!/bin/bash
# Daily homelab health check
echo "=== Homelab Health Check ==="
echo "Date: $(date)"
echo ""
echo "=== Server Status ==="
ssh pve 'uptime' 2>/dev/null || echo "PVE: UNREACHABLE"
ssh pve2 'uptime' 2>/dev/null || echo "PVE2: UNREACHABLE"
echo ""
echo "=== CPU Temperatures ==="
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE: $(($(cat $f)/1000))°C"; fi; done'
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2: $(($(cat $f)/1000))°C"; fi; done'
echo ""
echo "=== UPS Status ==="
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
echo ""
echo "=== ZFS Pools ==="
ssh pve 'zpool status -x' 2>/dev/null
ssh pve2 'zpool status -x' 2>/dev/null
ssh truenas 'zpool status -x vault'
echo ""
echo "=== Disk Space ==="
ssh pve 'df -h | grep -E "Filesystem|/dev/(nvme|sd)"'
ssh truenas 'df -h /mnt/vault'
echo ""
echo "=== VM Status ==="
ssh pve 'qm list | grep running | wc -l' | xargs echo "PVE VMs running:"
ssh pve2 'qm list | grep running | wc -l' | xargs echo "PVE2 VMs running:"
echo ""
echo "=== Syncthing Connections ==="
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections" | \
python3 -c "import sys,json; d=json.load(sys.stdin)['connections']; \
[print(f\"{v.get('name',k[:7])}: {'UP' if v['connected'] else 'DOWN'}\") for k,v in d.items()]"
echo ""
echo "=== Check Complete ==="
```
**Run daily**:
```cron
0 9 * * * ~/bin/homelab-health-check.sh | mail -s "Homelab Health Check" hutson@example.com
```
---
### ZFS Scrub Checker
```bash
#!/bin/bash
# Check last ZFS scrub status
echo "=== ZFS Scrub Status ==="
for host in pve pve2; do
echo "--- $host ---"
ssh $host 'zpool status | grep -A1 scrub'
echo ""
done
echo "--- TrueNAS ---"
ssh truenas 'zpool status vault | grep -A1 scrub'
```
---
### SMART Health Checker
```bash
#!/bin/bash
# Check SMART health on all drives
echo "=== SMART Health Check ==="
echo "--- TrueNAS Drives ---"
ssh truenas 'smartctl --scan | while read dev type; do
echo "=== $dev ===";
smartctl -H $dev | grep -E "SMART overall|PASSED|FAILED";
done'
echo "--- PVE Drives ---"
ssh pve 'for dev in /dev/nvme* /dev/sd*; do
[ -e "$dev" ] && echo "=== $dev ===" && smartctl -H $dev | grep -E "SMART|PASSED|FAILED";
done'
```
---
## Dashboard Recommendations
### Grafana Dashboard Layout
**Page 1: Overview**
- Server uptime
- CPU usage (all servers)
- RAM usage (all servers)
- Disk space (all pools)
- Network traffic
- UPS status
**Page 2: Storage**
- ZFS pool health
- SMART status for all drives
- I/O latency
- Scrub progress
- Disk temperatures
**Page 3: VMs**
- VM status (up/down)
- VM resource usage
- VM disk I/O
- VM network traffic
**Page 4: Services**
- Service health checks
- HTTP response times
- Certificate expiry dates
- Syncthing sync status
---
## Implementation Plan
### Phase 1: Basic Monitoring (Week 1)
- [ ] Install Uptime Kuma or Netdata
- [ ] Add HTTP checks for all services
- [ ] Configure UPS alerts in Home Assistant
- [ ] Set up daily health check email
**Estimated Time**: 4-6 hours
---
### Phase 2: Advanced Monitoring (Week 2-3)
- [ ] Install Prometheus + Grafana
- [ ] Deploy node_exporter on all servers
- [ ] Deploy zfs_exporter
- [ ] Deploy smartmon_exporter
- [ ] Create Grafana dashboards
**Estimated Time**: 8-12 hours
---
### Phase 3: Alerting (Week 4)
- [ ] Configure Alertmanager
- [ ] Set up email/push notifications
- [ ] Create alert rules for all critical metrics
- [ ] Test all alert paths
- [ ] Document alert procedures
**Estimated Time**: 4-6 hours
---
## Related Documentation
- [GATEWAY.md](GATEWAY.md) - Gateway monitoring and troubleshooting
- [UPS.md](UPS.md) - UPS monitoring details
- [STORAGE.md](STORAGE.md) - ZFS health checks
- [SERVICES.md](SERVICES.md) - Service inventory
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant automations
- [MAINTENANCE.md](MAINTENANCE.md) - Regular maintenance checks
---
**Last Updated**: 2026-01-02
**Status**: ⚠️ **Partial monitoring - Gateway active, other systems need implementation**

382
N8N-INTEGRATIONS.md Normal file
View File

@@ -0,0 +1,382 @@
# n8n Homelab Integrations - Quick Start Guide
n8n is running on your homelab network (10.10.10.207) and can access all local services. This guide sets up useful automations.
---
## Network Access Verified
n8n can connect to:
-**Home Assistant** (10.10.10.110:8123)
-**Prometheus** (10.10.10.206:9090)
-**Grafana** (10.10.10.206:3001)
-**Syncthing** (10.10.10.200:8384)
-**PiHole** (10.10.10.10)
-**Gitea** (10.10.10.220:3000)
-**Proxmox** (10.10.10.120:8006, 10.10.10.102:8006)
-**TrueNAS** (10.10.10.200)
-**All external APIs** (via internet)
---
## Initial Setup (First-Time)
1. Open **https://n8n.htsn.io**
2. Complete the setup wizard:
- **Owner Email:** hutson@htsn.io
- **Owner Name:** Hutson
- **Password:** (choose secure password)
3. Skip data sharing (optional)
---
## Credentials to Add in n8n
Go to **Settings → Credentials** and add:
### 1. Home Assistant
| Field | Value |
|-------|-------|
| **Credential Type** | Home Assistant API |
| **Host** | `http://10.10.10.110:8123` |
| **Access Token** | (get from Home Assistant) |
**Get Token:** Home Assistant → Profile → Long-Lived Access Tokens → Create Token
---
### 2. Prometheus
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.206:9090` |
| **Authentication** | None |
---
### 3. Grafana
| Field | Value |
|-------|-------|
| **Credential Type** | Grafana API |
| **URL** | `http://10.10.10.206:3001` |
| **API Key** | (create in Grafana) |
**Get API Key:** Grafana → Administration → Service Accounts → Create → Add Token
---
### 4. Syncthing
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.200:8384` |
| **Header Name** | `X-API-Key` |
| **Header Value** | `VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs` |
---
### 5. Telegram Bot
| Field | Value |
|-------|-------|
| **Credential Type** | Telegram API |
| **Access Token** | `8450212653:AAHoVBlNUuA0vtrVPMNUfSgJh_gmFMxlrBg` |
**Your Chat ID:** `1004084736`
---
### 6. Proxmox
| Field | Value |
|-------|-------|
| **Credential Type** | HTTP Request (Generic) |
| **URL** | `http://10.10.10.120:8006` |
| **Authentication** | API Token |
| **Token** | (use monitoring@pve token if needed) |
---
## Starter Workflows
### Workflow 1: Homelab Health Check (Every Hour)
**Nodes:**
1. **Schedule Trigger** (every hour)
2. **HTTP Request** → Prometheus query for down hosts
- URL: `http://10.10.10.206:9090/api/v1/query`
- Query param: `query=up{job=~"node.*"} == 0`
3. **If** → Check if any hosts are down
4. **Telegram** → Send alert if hosts down
**PromQL Query:**
```
up{job=~"node.*"} == 0
```
---
### Workflow 2: Daily Backup Status
**Nodes:**
1. **Schedule Trigger** (8am daily)
2. **HTTP Request** → Query Syncthing sync status
- URL: `http://10.10.10.200:8384/rest/db/status?folder=backup`
- Header: `X-API-Key: VFJ7XZPJoWvkYj6fKzpQxc9u3XC8KUBs`
3. **Function** → Check if folder is syncing
4. **Telegram** → Send daily status report
---
### Workflow 3: High CPU Alert
**Nodes:**
1. **Schedule Trigger** (every 5 minutes)
2. **HTTP Request** → Prometheus CPU query
- URL: `http://10.10.10.206:9090/api/v1/query`
- Query: `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
3. **If** → CPU > 90%
4. **Telegram** → Send alert
---
### Workflow 4: UPS Power Event
**Webhook Trigger Setup:**
1. Create webhook trigger in n8n
2. Get webhook URL: `https://n8n.htsn.io/webhook/ups-alert`
3. Configure NUT to call webhook on power events
**Nodes:**
1. **Webhook Trigger** → Receive UPS event
2. **Switch** → Route by event type (on battery, low battery, online)
3. **Telegram** → Send appropriate alert
---
### Workflow 5: Gitea → Deploy on Push
**Nodes:**
1. **Webhook Trigger** → Gitea push event
2. **If** → Check if branch is `main`
3. **SSH** → Connect to target server
4. **Execute Command**`git pull && docker-compose up -d`
5. **Telegram** → Notify deployment complete
---
### Workflow 6: Syncthing Folder Behind Alert
**Nodes:**
1. **Schedule Trigger** (every 30 minutes)
2. **HTTP Request** → Get all folder statuses
- URL: `http://10.10.10.200:8384/rest/stats/folder`
3. **Function** → Check if any folder has errors or is significantly behind
4. **If** → Errors found
5. **Telegram** → Alert with folder name and status
---
### Workflow 7: Grafana Alert Forwarder
**Purpose:** Forward Grafana alerts to Telegram
**Nodes:**
1. **Webhook Trigger** → Grafana webhook
2. **Function** → Parse alert data
3. **Telegram** → Format and send alert
**Grafana Setup:**
- Contact Point → Add webhook: `https://n8n.htsn.io/webhook/grafana-alerts`
---
### Workflow 8: Daily Homelab Summary
**Nodes:**
1. **Schedule Trigger** (9am daily)
2. **Multiple HTTP Requests in parallel:**
- Prometheus: System uptime
- Prometheus: Average CPU usage (24h)
- Prometheus: Disk usage
- Syncthing: Sync status (all folders)
- PiHole: Queries blocked (24h)
3. **Function** → Format data as summary
4. **Telegram** → Send daily report
**Example Output:**
```
🏠 Homelab Daily Summary
✅ All systems operational
⏱️ Uptime: 14 days
📊 Avg CPU: 12%
💾 Disk: 45% used
🔄 Syncthing: All folders in sync
🛡️ PiHole: 2,341 queries blocked
Last updated: 2025-12-27 09:00
```
---
### Workflow 9: VM State Change Monitor
**Nodes:**
1. **Schedule Trigger** (every 1 minute)
2. **HTTP Request** → Query Proxmox API for VM list
3. **Function** → Compare with previous state (use Set node)
4. **If** → VM state changed
5. **Telegram** → Notify VM started/stopped
---
### Workflow 10: Internet Speed Test Alert
**Nodes:**
1. **Schedule Trigger** (every 6 hours)
2. **HTTP Request** → Prometheus speedtest exporter
3. **If** → Download speed < 500 Mbps
4. **Telegram** → Alert about slow internet
---
## Advanced Integration Ideas
### Home Assistant Automations
- Turn on lights when server room temperature > 80°F
- Trigger workflows from HA button press
- Send sensor data to external services
### Proxmox Automation
- Auto-snapshot VMs before updates
- Clone VMs for testing
- Monitor resource usage and rebalance
### Media Management
- Notify when new Plex content added
- Auto-organize downloads
- Send weekly watch statistics
### Backup Monitoring
- Verify all Syncthing folders synced
- Alert on ZFS scrub errors
- Monitor snapshot ages
### Security
- Alert on failed SSH attempts (from logs)
- Monitor SSL certificate expiration
- Track unusual network traffic patterns
---
## n8n Best Practices
1. **Error Handling:** Always add error workflows to catch failures
2. **Rate Limiting:** Don't query APIs too frequently
3. **Credentials:** Never hardcode - always use credential store
4. **Testing:** Use manual trigger during development
5. **Logging:** Add Set nodes to track workflow state
6. **Backups:** Export workflows regularly (Settings → Export)
---
## Useful PromQL Queries for n8n
**CPU Usage:**
```promql
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
**Memory Usage:**
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
**Disk Usage:**
```promql
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
```
**Hosts Down:**
```promql
up{job=~"node.*"} == 0
```
**Syncthing Disconnected:**
```promql
up{job=~"syncthing.*"} == 0
```
---
## Webhook URLs
After creating webhooks in n8n, you'll get URLs like:
- `https://n8n.htsn.io/webhook/your-webhook-name`
These can be called from:
- Grafana alerts
- Home Assistant automations
- Gitea webhooks
- Custom scripts
- UPS monitoring (NUT)
---
## Testing Credentials
Test each credential after adding:
1. Create simple workflow with manual trigger
2. Add HTTP Request node with credential
3. Execute and check response
4. Verify data returned correctly
---
## Troubleshooting
**Can't reach local service:**
- Verify service IP and port
- Check if service requires HTTPS
- Test with `curl` from docker-host2 first
**Webhook not triggering:**
- Check n8n is accessible: `curl https://n8n.htsn.io/webhook/test`
- Verify webhook URL in external service
- Check n8n execution logs
**Workflow fails silently:**
- Enable "Execute on Error" workflow
- Check workflow execution list
- Add Function nodes to log data
**API authentication fails:**
- Verify credential is saved
- Check API token hasn't expired
- Test with curl manually first
---
## Next Steps
1. **Add Credentials** - Start with Telegram and Prometheus
2. **Create Test Workflow** - Simple hourly health check
3. **Test Telegram** - Verify messages arrive
4. **Build Gradually** - Add one workflow at a time
5. **Export Backups** - Save workflows regularly
---
## Resources
- **n8n Docs:** https://docs.n8n.io
- **Community Workflows:** https://n8n.io/workflows
- **Your n8n:** https://n8n.htsn.io
- **Your API Docs:** [N8N.md](N8N.md)
**Last Updated:** 2025-12-27

346
N8N.md Normal file
View File

@@ -0,0 +1,346 @@
# n8n - Workflow Automation
n8n is an extendable workflow automation tool deployed on docker-host2 for automating tasks across your homelab and external services.
---
## Quick Reference
| Setting | Value |
|---------|-------|
| **URL** | https://n8n.htsn.io |
| **Local IP** | 10.10.10.207:5678 |
| **Server** | docker-host2 (PVE2 VMID 302) |
| **Database** | PostgreSQL (containerized) |
| **API Endpoint** | http://10.10.10.207:5678/api/v1/ |
---
## Claude Code Integration (MCP)
### n8n-MCP Server
The n8n-MCP server gives Claude Code deep knowledge of all 545+ n8n nodes, enabling it to build complete workflows from natural language descriptions.
**Installation:** Already configured in `~/Library/Application Support/Claude/claude_desktop_config.json`
```json
{
"mcpServers": {
"n8n-nodes": {
"command": "npx",
"args": ["-y", "@czlonkowski/n8n-mcp"]
}
}
}
```
**What This Enables:**
- ✅ Build n8n workflows from natural language
- ✅ Get detailed help with node parameters and options
- ✅ Best practices for n8n node usage
- ✅ Debug workflow issues with full node context
**Example Prompts:**
```
"Create an n8n workflow to monitor Prometheus and send Telegram alerts"
"Build a workflow that triggers when Syncthing has errors"
"What's the best n8n node to parse JSON responses?"
```
**How It Works:**
- MCP server provides offline documentation for all n8n nodes
- No connection to your n8n instance required
- Claude builds workflows that you can then import into https://n8n.htsn.io
**Resources:**
- [n8n-MCP GitHub](https://github.com/czlonkowski/n8n-mcp)
- [MCP Documentation](https://docs.n8n.io/advanced-ai/accessing-n8n-mcp-server/)
---
## API Access
### API Key
```
X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo
```
### API Examples
**List Workflows:**
```bash
curl -H "X-N8N-API-KEY: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiI3NTdiMDA5YS1hMjM2LTQ5MzUtODkwNS0xZDY1MjYzZWE2OWYiLCJpc3MiOiJuOG4iLCJhdWQiOiJwdWJsaWMtYXBpIiwiaWF0IjoxNzY2ODEwMzA3fQ.RIZAbpDa7LiUPWk48qOscJ9-d9gRAA0afMDX_V3oSVo" \
http://10.10.10.207:5678/api/v1/workflows
```
**Get Workflow by ID:**
```bash
curl -H "X-N8N-API-KEY: YOUR_API_KEY" \
http://10.10.10.207:5678/api/v1/workflows/{id}
```
**Trigger Workflow:**
```bash
curl -X POST \
-H "X-N8N-API-KEY: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"data": {"key": "value"}}' \
http://10.10.10.207:5678/api/v1/workflows/{id}/execute
```
**API Documentation:** https://docs.n8n.io/api/
---
## Deployment Details
### Docker Compose
**Location:** `/opt/n8n/docker-compose.yml` on docker-host2
**Services:**
- `n8n` - Main application (port 5678)
- `postgres` - Database backend
**Volumes:**
- `n8n_data` - Workflow data, credentials, settings
- `postgres_data` - Database storage
### Environment Configuration
```yaml
N8N_HOST: n8n.htsn.io
N8N_PORT: 5678
N8N_PROTOCOL: https
NODE_ENV: production
WEBHOOK_URL: https://n8n.htsn.io/
GENERIC_TIMEZONE: America/Los_Angeles
DB_TYPE: postgresdb
DB_POSTGRESDB_HOST: postgres
DB_POSTGRESDB_DATABASE: n8n
DB_POSTGRESDB_USER: n8n
DB_POSTGRESDB_PASSWORD: n8n_secure_password_2024
```
### Resource Limits
- **Memory**: 512MB-1GB (soft/hard)
- **CPU**: Shared (4 vCPUs on host)
---
## Common Tasks
### Restart n8n
```bash
ssh docker-host2 'cd /opt/n8n && docker compose restart n8n'
```
### View Logs
```bash
ssh docker-host2 'docker logs -f n8n'
```
### Backup Workflows
Workflows are stored in PostgreSQL. To backup:
```bash
ssh docker-host2 'docker exec n8n-postgres pg_dump -U n8n n8n > /tmp/n8n-backup-$(date +%Y%m%d).sql'
```
### Update n8n
```bash
ssh docker-host2 'cd /opt/n8n && docker compose pull n8n && docker compose up -d n8n'
```
---
## Traefik Configuration
**File:** `/etc/traefik/conf.d/n8n.yaml` on CT 202
```yaml
http:
routers:
n8n-secure:
entryPoints:
- websecure
rule: "Host(`n8n.htsn.io`)"
service: n8n
tls:
certResolver: cloudflare
priority: 50
n8n-redirect:
entryPoints:
- web
rule: "Host(`n8n.htsn.io`)"
middlewares:
- n8n-https-redirect
service: n8n
priority: 50
services:
n8n:
loadBalancer:
servers:
- url: "http://10.10.10.207:5678"
middlewares:
n8n-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
---
## Monitoring
### Prometheus
n8n exposes metrics at `http://10.10.10.207:5678/metrics` (if enabled)
### Grafana
n8n metrics can be visualized in Grafana dashboards
### Uptime Monitoring
Add to Pulse: https://pulse.htsn.io
- Monitor: https://n8n.htsn.io
- Check interval: 60s
---
## Troubleshooting
### n8n won't start
```bash
ssh docker-host2 'docker logs n8n | tail -50'
ssh docker-host2 'docker logs n8n-postgres | tail -50'
```
### Database connection issues
```bash
# Check postgres health
ssh docker-host2 'docker exec n8n-postgres pg_isready -U n8n'
# Restart postgres
ssh docker-host2 'cd /opt/n8n && docker compose restart postgres'
```
### SSL/HTTPS issues
```bash
# Check Traefik config
ssh root@10.10.10.250 'cat /etc/traefik/conf.d/n8n.yaml'
# Reload Traefik
ssh root@10.10.10.250 'systemctl reload traefik'
```
### API not responding
```bash
# Test API locally
curl -H "X-N8N-API-KEY: YOUR_KEY" http://10.10.10.207:5678/api/v1/workflows
# Check if n8n container is healthy
ssh docker-host2 'docker ps | grep n8n'
```
### Remove "This message was sent automatically by n8n" signature from Telegram messages
**Problem:** n8n Telegram node adds attribution signature to all messages by default.
**Solution:** Use the correct parameter name `appendAttribution` (camelCase, not snake_case) in `additionalFields`:
```bash
# Get workflow
curl -H "X-N8N-API-KEY: $(cat /tmp/n8n-key.txt)" \
http://10.10.10.207:5678/api/v1/workflows/WORKFLOW_ID > workflow.json
# Update all Telegram nodes (using jq)
cat workflow.json | jq '.nodes = (.nodes | map(
if .type == "n8n-nodes-base.telegram" then
.parameters.additionalFields.appendAttribution = false
else
.
end
))' | jq '{name, nodes, connections, settings, staticData}' > workflow-fixed.json
# Upload updated workflow
curl -X PUT \
-H "X-N8N-API-KEY: $(cat /tmp/n8n-key.txt)" \
-H 'Content-Type: application/json' \
-d @workflow-fixed.json \
http://10.10.10.207:5678/api/v1/workflows/WORKFLOW_ID
# Restart n8n to reload workflow
ssh docker-host2 'cd /opt/n8n && docker compose restart n8n'
```
**Important Notes:**
- Parameter must be `appendAttribution` (camelCase), not `append_attribution` or `append_n8n_attribution`
- Must restart n8n after updating workflow for changes to take effect
- This applies to all Telegram message nodes in the workflow
**Fixed:** 2026-01-23
---
## Integration Examples
### Homelab Automation Ideas
1. **Backup Notifications** - Send Telegram alerts when backups complete
2. **Server Monitoring** - Query Prometheus and alert on high CPU/memory
3. **Media Management** - Trigger Sonarr/Radarr downloads
4. **Home Assistant Integration** - Automate smart home workflows
5. **Git Webhooks** - Deploy changes from Gitea automatically
6. **Syncthing Monitoring** - Alert when sync folders get behind
7. **UPS Alerts** - Notify on power events from NUT
---
## Security Notes
- API key provides full access to all workflows and data
- Store API key securely (added to this doc for homelab reference)
- n8n credentials are encrypted at rest in PostgreSQL
- HTTPS enforced via Traefik
- No public internet exposure (only via Tailscale)
---
## Quick Start
**New to n8n?** Start here: **[N8N-INTEGRATIONS.md](N8N-INTEGRATIONS.md)** ⭐
This guide includes:
- ✅ Network access verification
- ✅ Credential setup for all homelab services
- ✅ 10 ready-to-use starter workflows
- ✅ Home Assistant, Prometheus, Syncthing, Telegram integrations
- ✅ Troubleshooting tips
---
## Related Documentation
- [n8n Homelab Integrations Guide](N8N-INTEGRATIONS.md) - **START HERE**
- [docker-host2 VM details](VMS.md)
- [Traefik reverse proxy](TRAEFIK.md)
- [IP Assignments](IP-ASSIGNMENTS.md)
- [Pulse Setup](PULSE-SETUP.md)
**Last Updated:** 2025-12-26

339
PA-API.md Normal file
View File

@@ -0,0 +1,339 @@
# Personal Assistant API
Backend API for the Personal Assistant system - provides Claude-powered voice/text interface to all PA capabilities (calendar, tasks, messages, smart home, etc.).
---
## Quick Reference
| Setting | Value |
|---------|-------|
| **Domain** | pa.htsn.io |
| **Local IP** | 10.10.10.207:8401 |
| **Server** | docker-host2 (PVE2 VMID 302) |
| **Compose** | `/opt/pa-api/docker-compose.yml` |
| **Access** | Tailscale only (not publicly exposed) |
| **GitHub** | Private repo: `pa-api` |
---
## Architecture
```
Android/Telegram
┌─────────────────┐
│ PA API │ ← Claude SDK, model routing
│ docker-host2 │
│ :8401 │
└────────┬────────┘
┌────┴────┐
│ │
▼ ▼
┌───────┐ ┌──────────┐
│ Rube │ │MCP Bridge│ ← Mac Mini (Beeper, Proton, etc.)
│ Exa │ │ :8400 │
│ etc. │ └──────────┘
└───────┘
```
**PA API handles:**
- Claude SDK integration (no CLI startup delay)
- Model routing (Haiku/Sonnet/Opus)
- Session management
- Direct API tools (Exa, Ref, Rube, Airtable)
**MCP Bridge handles:**
- Tools requiring Mac Mini (Beeper, Proton Bridge, filesystem)
- Runs on Mac Mini at 10.10.10.125:8400
---
## API Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/chat` | POST | Main query endpoint (streaming SSE) |
| `/health` | GET | Health check |
### POST /chat
**Request:**
```json
{
"message": "What's on my calendar today?",
"session_id": "abc123"
}
```
**Response (Server-Sent Events):**
```
data: {"type": "model", "name": "sonnet"}
data: {"type": "chunk", "text": "You have "}
data: {"type": "chunk", "text": "3 meetings today..."}
data: {"type": "done", "full_text": "You have 3 meetings today..."}
```
### Model Routing
| Query Type | Model | Examples |
|------------|-------|----------|
| Simple facts | Haiku | "How old is X?", "What's 15% of 80?" |
| PA queries | Sonnet | "What's on my calendar?", "Add task" |
| Complex reasoning | Opus | "Help me plan my week" |
**Override:** Say "Use Opus" to force model selection (sticky per session).
---
## Deployment
### Docker Compose
Location: `/opt/pa-api/docker-compose.yml`
```yaml
version: '3.8'
services:
pa-api:
image: pa-api:latest
build: .
container_name: pa-api
restart: unless-stopped
ports:
- "8401:8401"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- MCP_BRIDGE_URL=http://10.10.10.125:8400
- EXA_API_KEY=${EXA_API_KEY}
# Add other API keys as needed
volumes:
- ./data:/app/data
networks:
- pa-network
networks:
pa-network:
driver: bridge
```
### Environment Variables
| Variable | Purpose |
|----------|---------|
| `ANTHROPIC_API_KEY` | Claude API access |
| `MCP_BRIDGE_URL` | Mac Mini bridge endpoint |
| `EXA_API_KEY` | Exa web search |
| `AIRTABLE_API_KEY` | Airtable access |
Store in `/opt/pa-api/.env` (not committed to git).
---
## Traefik Configuration
File: `/etc/traefik/conf.d/pa-api.yaml` (on CT 202)
```yaml
http:
routers:
pa-api:
rule: "Host(`pa.htsn.io`)"
entryPoints:
- websecure
service: pa-api
tls:
certResolver: cloudflare
services:
pa-api:
loadBalancer:
servers:
- url: "http://10.10.10.207:8401"
```
**Note:** This service is Tailscale-only. The Traefik route exists for convenience but should not be exposed publicly via Cloudflare.
---
## Common Tasks
### Start/Stop Service
```bash
# SSH to docker-host2
ssh docker-host2
# Start
cd /opt/pa-api && docker-compose up -d
# Stop
cd /opt/pa-api && docker-compose down
# View logs
docker logs -f pa-api
# Restart
docker-compose restart pa-api
```
### Update Service
```bash
ssh docker-host2
cd /opt/pa-api
git pull
docker-compose build
docker-compose up -d
```
### Health Check
```bash
# From any machine on network
curl http://10.10.10.207:8401/health
# Test chat endpoint
curl -X POST http://10.10.10.207:8401/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello", "session_id": "test"}'
```
---
## MCP Bridge (Mac Mini)
The MCP Bridge runs on Mac Mini and exposes MCP tools as HTTP endpoints.
| Setting | Value |
|---------|-------|
| **Location** | Mac Mini (10.10.10.125) |
| **Port** | 8400 |
| **Purpose** | Execute MCP tools (Beeper, Proton, TickTick, HA, etc.) |
### Bridge Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/tools` | GET | List available tools |
| `/execute` | POST | Execute a tool |
| `/health` | GET | Health check |
### Start MCP Bridge
```bash
# SSH to Mac Mini
ssh macmini
# Start bridge (managed by launchd)
launchctl load ~/Library/LaunchAgents/com.hutson.mcp-bridge.plist
# Check status
curl http://localhost:8400/health
```
---
## Integration Points
### Related Services
| Service | Relationship |
|---------|--------------|
| n8n | Telegram bot uses n8n → Claude CLI (separate path) |
| MetaMCP | PA API does NOT use MetaMCP (direct MCP Bridge) |
| Home Assistant | Controlled via MCP Bridge |
| Claude-Mem | Shared memory database for context |
### Clients
| Client | Connection |
|--------|------------|
| Android App | HTTPS via Tailscale → pa.htsn.io |
| (Future) Web UI | Same endpoint |
---
## Monitoring
### Health Checks
```bash
# PA API
curl -s http://10.10.10.207:8401/health | jq
# MCP Bridge
curl -s http://10.10.10.125:8400/health | jq
```
### Logs
```bash
# PA API logs
ssh docker-host2 'docker logs -f pa-api --tail 100'
# MCP Bridge logs (Mac Mini)
ssh macmini 'tail -f ~/Library/Logs/mcp-bridge.log'
```
---
## Troubleshooting
### PA API Not Responding
1. Check container status:
```bash
ssh docker-host2 'docker ps | grep pa-api'
```
2. Check logs for errors:
```bash
ssh docker-host2 'docker logs pa-api --tail 50'
```
3. Verify network:
```bash
curl http://10.10.10.207:8401/health
```
### MCP Bridge Not Responding
1. Check if Mac Mini is reachable:
```bash
ping 10.10.10.125
```
2. Check bridge process:
```bash
ssh macmini 'pgrep -f mcp-bridge'
```
3. Restart bridge:
```bash
ssh macmini 'launchctl unload ~/Library/LaunchAgents/com.hutson.mcp-bridge.plist'
ssh macmini 'launchctl load ~/Library/LaunchAgents/com.hutson.mcp-bridge.plist'
```
### Model Routing Issues
- Check Claude API key is valid
- Verify Haiku classifier is responding
- Check session storage for stuck model overrides
---
## Related Documentation
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP mapping
- [VMS.md](VMS.md) - docker-host2 VM details
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy configuration
- [Personal Assistant Project](~/Projects/personal-assistant/CLAUDE.md) - PA system overview
- [Services Matrix](~/Projects/personal-assistant/docs/services-matrix.md) - All MCP tools
---
**Last Updated**: 2026-01-07

509
POWER-MANAGEMENT.md Normal file
View File

@@ -0,0 +1,509 @@
# Power Management and Optimization
Documentation of power optimizations applied to reduce idle power consumption and heat generation.
## Overview
Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
---
## Power Draw Estimates
### PVE (10.10.10.120)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
| NVIDIA Quadro P2000 | 25W | 70W | 75W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
| HBAs, fans, misc | 20-30W | 20-30W | - |
| **Total** | **250-350W** | **800-940W** | - |
### PVE2 (10.10.10.102)
| Component | Idle | Load | TDP |
|-----------|------|------|-----|
| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
| NVIDIA RTX A6000 | 11W | 280W | 300W |
| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
| Fans, misc | 15-20W | 15-20W | - |
| **Total** | **226-330W** | **765-890W** | - |
### Combined
| Metric | Idle | Load |
|--------|------|------|
| Servers | 476-680W | 1565-1830W |
| Network gear | ~50W | ~50W |
| **Total** | **~530-730W** | **~1615-1880W** |
| **UPS Load** | 40-55% | 120-140% ⚠️ |
**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
---
## Optimizations Applied
### 1. KSMD Disabled (2024-12-17)
**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
**Problem**:
- KSMD was consuming 44-57% CPU continuously on PVE
- Caused CPU temp to rise from 74°C to 83°C
- **Negative profit**: More power spent scanning than saved from deduplication
**Solution**: Disabled KSM permanently
**Configuration**:
**Systemd service**: `/etc/systemd/system/disable-ksm.service`
```ini
[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable and start**:
```bash
systemctl daemon-reload
systemctl enable --now disable-ksm
systemctl mask ksmtuned # Prevent re-enabling
```
**Verify**:
```bash
# KSM should be disabled (run=0)
cat /sys/kernel/mm/ksm/run # Should output: 0
# ksmd should show 0% CPU
ps aux | grep ksmd
```
**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
```bash
cat /sys/kernel/mm/ksm/run
# If 1, disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned
```
---
### 2. CPU Governor Optimization (2024-12-16)
Default CPU governor keeps cores at max frequency even when idle, wasting power.
#### PVE: `amd-pstate-epp` Driver
**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
**Governor**: `powersave`
**EPP**: `balance_power`
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
# Check governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: powersave
# Check EPP
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
# Output: balance_power
# Check current frequency (should be low when idle)
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
```
#### PVE2: `acpi-cpufreq` Driver
**Driver**: `acpi-cpufreq` (older ACPI driver)
**Governor**: `schedutil` (adaptive, better than powersave for this driver)
**Configuration**:
**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to schedutil
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Enable**:
```bash
systemctl daemon-reload
systemctl enable --now cpu-powersave
```
**Verify**:
```bash
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: schedutil
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle
```
**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
**Performance impact**: Minimal - CPU still boosts to max frequency under load
---
### 3. GPU Power States (2024-12-16)
GPUs automatically enter low-power states when idle. Verified optimal.
| GPU | Location | Idle Power | P-State | Notes |
|-----|----------|------------|---------|-------|
| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
**Check GPU power state**:
```bash
# Via nvidia-smi (if installed in VM)
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
# Expected output:
# name, power.draw [W], pstate
# NVIDIA TITAN RTX, 2.50 W, P8
# Via lspci (from Proxmox host - shows link speed, not power)
ssh pve 'lspci | grep -i nvidia'
```
**P-States**:
- **P0**: Maximum performance
- **P8**: Minimum power (idle)
**No action needed** - GPUs automatically manage power states.
**Savings**: N/A (already optimal)
---
### 4. Syncthing Rescan Intervals (2024-12-16)
Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
**Changed**:
- Large folders: 60s → **3600s** (1 hour)
- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
**Configuration**: Via Syncthing UI on each device
- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
- Still acceptable for most use cases
- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
---
### 5. ksmtuned Disabled (2024-12-16)
**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
**Solution**: Stopped and disabled on both servers
```bash
systemctl stop ksmtuned
systemctl disable ksmtuned
systemctl mask ksmtuned # Prevent re-enabling
```
**Savings**: ~2-5W
---
### 6. HDD Spindown on PVE2 (2024-12-16)
**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
**Solution**: Configure 30-minute spindown timeout
**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
```udev
# Spin down WD Red 6TB drives after 30 minutes idle
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
```
**hdparm value**: 241 = 30 minutes
- Formula: `value * 5 seconds = timeout`
- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
# Verify drives have spindown set
hdparm -I /dev/sda | grep -i standby
hdparm -I /dev/sdb | grep -i standby
```
**Check if drives are spun down**:
```bash
hdparm -C /dev/sda
# Output: drive state is: standby (spun down)
# or: drive state is: active/idle (spinning)
```
**Savings**: ~10-16W when spun down (8W per drive)
**Trade-off**: 5-10 second delay when accessing pool after spindown
---
## Potential Optimizations (Not Yet Applied)
### PCIe ASPM (Active State Power Management)
**Benefit**: Reduce power of idle PCIe devices
**Risk**: May cause stability issues with some devices
**Estimated savings**: 5-15W
**Test**:
```bash
# Check current ASPM state
lspci -vv | grep -i aspm
# Enable ASPM (test first)
# Add to kernel cmdline: pcie_aspm=force
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
# Update grub
update-grub
reboot
```
### NMI Watchdog Disable
**Benefit**: Reduce CPU wakeups
**Risk**: Harder to debug kernel hangs
**Estimated savings**: 1-3W
**Test**:
```bash
# Disable NMI watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
# Make permanent (add to kernel cmdline)
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
update-grub
reboot
```
---
## Monitoring
### CPU Frequency
```bash
# Current frequency on all cores
ssh pve 'grep MHz /proc/cpuinfo | head -10'
# Governor
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
# Available governors
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
```
### CPU Temperature
```bash
# PVE
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
# PVE2
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
```
**Healthy temps**: 70-80°C under load
**Warning**: >85°C
**Throttle**: 90°C (Tctl max for Threadripper PRO)
### GPU Power Draw
```bash
# If nvidia-smi installed in VM
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
# Sample output:
# name, power.draw [W], power.limit [W], pstate
# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
```
### Power Consumption (UPS)
```bash
# Check UPS load percentage
ssh pve 'upsc cyberpower@localhost ups.load'
# Battery runtime (seconds)
ssh pve 'upsc cyberpower@localhost battery.runtime'
# Full UPS status
ssh pve 'upsc cyberpower@localhost'
```
See [UPS.md](UPS.md) for more UPS monitoring details.
### ZFS ARC Memory Usage
```bash
# PVE
ssh pve 'arc_summary | grep -A5 "ARC size"'
# TrueNAS
ssh truenas 'arc_summary | grep -A5 "ARC size"'
```
**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
```bash
# Limit ARC to 32 GB (example)
# Edit /etc/modprobe.d/zfs.conf:
options zfs zfs_arc_max=34359738368
# Apply (reboot required)
update-initramfs -u
reboot
```
---
## Troubleshooting
### CPU Not Downclocking
```bash
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: powersave (PVE) or schedutil (PVE2)
# If not, systemd service may have failed
# Check service status
systemctl status cpu-powersave
# Manually set governor (temporary)
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Check frequency
grep MHz /proc/cpuinfo | head -5
```
### High Idle Power After Update
**Common causes**:
1. **KSM re-enabled** after Proxmox update
- Check: `cat /sys/kernel/mm/ksm/run`
- Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
2. **CPU governor reset** to default
- Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
- Fix: `systemctl restart cpu-powersave`
3. **GPU stuck in high-performance mode**
- Check: `nvidia-smi --query-gpu=pstate --format=csv`
- Fix: Restart VM or power cycle GPU
### HDDs Won't Spin Down
```bash
# Check spindown setting
hdparm -I /dev/sda | grep -i standby
# Set spindown manually (temporary)
hdparm -S 241 /dev/sda
# Check if drive is idle (ZFS may keep it active)
zpool iostat -v 1 5 # Watch for activity
# Check what's accessing the drive
lsof | grep /mnt/pool
```
---
## Power Optimization Summary
| Optimization | Savings | Applied | Notes |
|--------------|---------|---------|-------|
| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
| PCIe ASPM | 5-15W | ❌ | Not yet tested |
| NMI watchdog | 1-3W | ❌ | Not yet tested |
| **Total savings** | **~150-300W** | - | Significant reduction |
---
## Related Documentation
- [UPS.md](UPS.md) - UPS capacity and power monitoring
- [STORAGE.md](STORAGE.md) - HDD spindown configuration
- [VMS.md](VMS.md) - VM resource allocation
---
**Last Updated**: 2025-12-22

69
PULSE-SETUP.md Normal file
View File

@@ -0,0 +1,69 @@
# Add n8n and docker-host2 to Pulse Monitoring
Pulse automatically monitors based on Prometheus targets, but you can also add custom HTTP monitors.
## Quick Steps
1. Open **https://pulse.htsn.io** in your browser
2. Login if required
3. Click **"+ Add Monitor"** or **"New Monitor"**
---
## Monitor: n8n
| Field | Value |
|-------|-------|
| **Name** | n8n Workflow Automation |
| **URL** | https://n8n.htsn.io |
| **Check Interval** | 60 seconds |
| **Monitor Type** | HTTP/HTTPS |
| **Expected Status** | 200 |
| **Timeout** | 10 seconds |
| **Alert After** | 2 failed checks |
---
## Monitor: docker-host2
| Field | Value |
|-------|-------|
| **Name** | docker-host2 (node_exporter) |
| **URL** | http://10.10.10.207:9100/metrics |
| **Check Interval** | 60 seconds |
| **Monitor Type** | HTTP |
| **Expected Status** | 200 |
| **Expected Content** | `node_exporter` |
| **Timeout** | 5 seconds |
| **Alert After** | 2 failed checks |
---
## Optional: docker-host2 SSH
| Field | Value |
|-------|-------|
| **Name** | docker-host2 SSH |
| **Host** | 10.10.10.207 |
| **Port** | 22 |
| **Monitor Type** | TCP Port |
| **Check Interval** | 60 seconds |
| **Timeout** | 5 seconds |
---
## Verification
After adding monitors, you should see:
- ✅ Green status for both monitors
- Response time graphs
- Uptime percentage
- Alert history (should be empty)
Access Pulse dashboard: **https://pulse.htsn.io**
---
**Note:** Pulse may already be monitoring these services via Prometheus integration. Check existing monitors before adding duplicates.
**Last Updated:** 2025-12-27

102
QUICK-REF-WELCOME-HOME.md Normal file
View File

@@ -0,0 +1,102 @@
# Welcome Home Automation - Quick Reference
## Quick Test (Manual Trigger)
```bash
HA_TOKEN="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiIwZThjZmJjMzVlNDA0NzYwOTMzMjg3MTQ5ZjkwOGU2NyIsImlhdCI6MTc2NTk5MjQ4OCwiZXhwIjoyMDgxMzUyNDg4fQ.r743tsb3E5NNlrwEEu9glkZdiI4j_3SKIT1n5PGUytY"
# Test the automation now (ignores conditions)
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/trigger"
```
## Current Configuration
**Lights that turn on:**
- Living Room (75%)
- Living Room Lamp (60%)
- Kitchen (80%)
**When:** After sunset (30 min early) OR before sunrise
**Trigger:** Entering home zone (100m radius)
## Quick Modifications
### Add Office Light
```bash
# Get current config
curl -s -H "Authorization: Bearer $HA_TOKEN" \
"http://10.10.10.210:8123/api/config/automation/config/welcome_home" > /tmp/welcome.json
# Edit /tmp/welcome.json and add to "actions" array:
# {
# "target": {"entity_id": "light.office"},
# "data": {"brightness_pct": 70},
# "action": "light.turn_on"
# }
# Update automation
curl -X POST \
-H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/welcome.json \
"http://10.10.10.210:8123/api/config/automation/config/welcome_home"
```
### Change to Scene Instead
Replace all light actions with a single scene:
```json
{
"actions": [
{
"service": "scene.turn_on",
"target": {
"entity_id": "scene.living_room_relax"
}
}
]
}
```
## Status Check
```bash
# Check if automation is enabled
curl -s -H "Authorization: Bearer $HA_TOKEN" \
"http://10.10.10.210:8123/api/states/automation.welcome_home" | \
python3 -c "import json, sys; data=json.load(sys.stdin); print(f\"State: {data['state']}\"); print(f\"Last triggered: {data['attributes']['last_triggered']}\")"
# Check current location
curl -s -H "Authorization: Bearer $HA_TOKEN" \
"http://10.10.10.210:8123/api/states/person.hutson" | \
python3 -c "import json, sys; data=json.load(sys.stdin); print(f\"Location: {data['state']}\"); print(f\"GPS: {data['attributes']['latitude']}, {data['attributes']['longitude']}\"); print(f\"Accuracy: {data['attributes']['gps_accuracy']}m\")"
```
## Toggle On/Off
```bash
# Disable
curl -X POST -H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/turn_off"
# Enable
curl -X POST -H "Authorization: Bearer $HA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"entity_id": "automation.welcome_home"}' \
"http://10.10.10.210:8123/api/services/automation/turn_on"
```
## Web UI
http://10.10.10.210:8123 → Settings → Automations & Scenes → "Welcome Home"
---
*Entity ID: automation.welcome_home*

151
README.md Normal file
View File

@@ -0,0 +1,151 @@
# Homelab Documentation
Documentation for Hutson's home infrastructure - two Proxmox servers running VMs and containers for home automation, media, development, and AI workloads.
## 🚀 Quick Start
**New to this homelab?** Start here:
1. [CLAUDE.md](CLAUDE.md) - Quick reference guide for common tasks
2. [SSH-ACCESS.md](SSH-ACCESS.md) - How to connect to all systems
3. [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - What's at what IP address
4. [SERVICES.md](SERVICES.md) - What services are running
**Claude Code Session?** Read [CLAUDE.md](CLAUDE.md) first - it's your command center.
## 📚 Documentation Index
### Infrastructure
| Document | Description |
|----------|-------------|
| [GATEWAY.md](GATEWAY.md) | UniFi gateway monitoring, watchdog services, troubleshooting |
| [VMS.md](VMS.md) | Complete VM/LXC inventory, specs, GPU passthrough |
| [HARDWARE.md](HARDWARE.md) | Server specs, GPUs, network cards, HBAs |
| [STORAGE.md](STORAGE.md) | ZFS pools, NFS/SMB shares, capacity planning |
| [NETWORK.md](NETWORK.md) | Bridges, VLANs, MTU config, Tailscale VPN |
| [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) | CPU governors, GPU power states, optimizations |
| [UPS.md](UPS.md) | UPS configuration, NUT monitoring, power failure handling |
### Services & Applications
| Document | Description |
|----------|-------------|
| [SERVICES.md](SERVICES.md) | Complete service inventory with URLs and credentials |
| [TRAEFIK.md](TRAEFIK.md) | Reverse proxy setup, adding services, SSL certificates |
| [HOMEASSISTANT.md](HOMEASSISTANT.md) | Home Assistant API, automations, integrations |
| [PA-API.md](PA-API.md) | Personal Assistant API, MCP Bridge, Claude integration |
| [SYNCTHING.md](SYNCTHING.md) | File sync across all devices, API access, troubleshooting |
| [SALTBOX.md](#) | Media automation stack (Plex, *arr apps) (coming soon) |
### Access & Security
| Document | Description |
|----------|-------------|
| [SSH-ACCESS.md](SSH-ACCESS.md) | SSH keys, host aliases, password auth, QEMU agent |
| [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) | Complete IP address assignments for all devices |
| [SECURITY.md](#) | Firewall, access control, certificates (coming soon) |
### Operations
| Document | Description |
|----------|-------------|
| [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) | 🚨 Backup strategy, disaster recovery (CRITICAL) |
| [MAINTENANCE.md](MAINTENANCE.md) | Regular procedures, update schedules, testing checklists |
| [MONITORING.md](MONITORING.md) | Health monitoring, alerts, dashboard recommendations |
| [DISASTER-RECOVERY.md](#) | Recovery procedures (coming soon) |
### Reference
| Document | Description |
|----------|-------------|
| [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) | Storage enclosure SES commands, LCC troubleshooting |
| [SHELL-ALIASES.md](SHELL-ALIASES.md) | ZSH aliases for Claude Code sessions |
## 🖥️ System Overview
### Servers
- **PVE** (10.10.10.120) - Primary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA Quadro P2000 + TITAN RTX
- **PVE2** (10.10.10.102) - Secondary Proxmox server
- AMD Threadripper PRO 3975WX (32-core)
- 128 GB RAM
- NVIDIA RTX A6000
### Key Services
| Service | Location | URL |
|---------|----------|-----|
| **Proxmox** | PVE | https://pve.htsn.io |
| **TrueNAS** | VM 100 | https://truenas.htsn.io |
| **Plex** | Saltbox VM | https://plex.htsn.io |
| **Home Assistant** | VM 110 | https://homeassistant.htsn.io |
| **Gitea** | VM 300 | https://git.htsn.io |
| **PA API** | docker-host2 | https://pa.htsn.io (Tailscale) |
| **Pi-hole** | CT 200 | http://10.10.10.10/admin |
| **Traefik** | CT 202 | http://10.10.10.250:8080 |
[See IP-ASSIGNMENTS.md for complete list](IP-ASSIGNMENTS.md)
## 🔥 Emergency Procedures
### Power Failure
1. UPS provides ~15 min runtime at typical load
2. At 2 min remaining, NUT triggers graceful VM shutdown
3. When power returns, servers auto-boot and start VMs in order
See [UPS.md](UPS.md) for details.
### Service Down
```bash
# Quick health check (run from Mac Mini)
ssh pve 'qm list' # Check VMs on PVE
ssh pve2 'qm list' # Check VMs on PVE2
ssh pve 'pct list' # Check containers
# Syncthing status
curl -s -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/connections"
# Restart a VM
ssh pve 'qm stop VMID && qm start VMID'
```
See [CLAUDE.md](CLAUDE.md) for complete troubleshooting runbooks.
## 📞 Getting Help
**Claude Code Assistant**: Start a session in this directory - all context is available in CLAUDE.md
**Key Contacts**:
- Homelab Owner: Hutson
- Git Repo: https://git.htsn.io/hutson/homelab-docs
- Local Path: `~/Projects/homelab`
## 🔄 Recent Changes
See [CHANGELOG.md](#) (coming soon) or the Changelog section in [CLAUDE.md](CLAUDE.md).
## 📝 Contributing
When updating docs:
1. Keep CLAUDE.md as quick reference only
2. Move detailed content to specialized docs
3. Update cross-references
4. Test all commands before committing
5. Add entries to changelog
```bash
cd ~/Projects/homelab
git add -A
git commit -m "Update documentation: <description>"
git push
```
---
**Last Updated**: 2026-01-02

591
SERVICES.md Normal file
View File

@@ -0,0 +1,591 @@
# Services Inventory
Complete inventory of all services running across the homelab infrastructure.
## Overview
| Category | Services | Location | Access |
|----------|----------|----------|--------|
| **Infrastructure** | Proxmox, TrueNAS, Pi-hole, Traefik | VMs/CTs | Web UI + SSH |
| **Media** | Plex, *arr apps, downloaders | Saltbox VM | Web UI |
| **Development** | Gitea, Docker services | VMs | Web UI |
| **Home Automation** | Home Assistant, Happy Coder | VMs | Web UI + API |
| **Monitoring** | UPS (NUT), Syncthing, Pulse | Various | API |
**Total Services**: 25+ running services
---
## Service URLs Quick Reference
| Service | URL | Authentication | Purpose |
|---------|-----|----------------|---------|
| **Proxmox** | https://pve.htsn.io:8006 | Username + 2FA | VM management |
| **TrueNAS** | https://truenas.htsn.io | Username/password | NAS management |
| **Plex** | https://plex.htsn.io | Plex account | Media streaming |
| **Home Assistant** | https://homeassistant.htsn.io | Username/password | Home automation |
| **Gitea** | https://git.htsn.io | Username/password | Git repositories |
| **Excalidraw** | https://excalidraw.htsn.io | None (public) | Whiteboard |
| **Happy Coder** | https://happy.htsn.io | QR code auth | Remote Claude sessions |
| **Pi-hole** | http://10.10.10.10/admin | Password | DNS/ad blocking |
| **Traefik** | http://10.10.10.250:8080 | None (internal) | Reverse proxy dashboard |
| **Pulse** | https://pulse.htsn.io | Unknown | Monitoring dashboard |
| **Copyparty** | https://copyparty.htsn.io | Unknown | File sharing |
| **FindShyt** | https://findshyt.htsn.io | Unknown | Custom app |
---
## Infrastructure Services
### Proxmox VE (PVE & PVE2)
**Purpose**: Virtualization platform, VM/CT host
**Location**: Physical servers (10.10.10.120, 10.10.10.102)
**Access**: https://pve.htsn.io:8006, SSH
**Version**: Unknown (check: `pveversion`)
**Key Features**:
- Web-based management
- VM and LXC container support
- ZFS storage pools
- Clustering (2-node)
- API access
**Common Operations**:
```bash
# List VMs
ssh pve 'qm list'
# Create VM
ssh pve 'qm create VMID --name myvm ...'
# Backup VM
ssh pve 'vzdump VMID --dumpdir /var/lib/vz/dump'
```
**See**: [VMS.md](VMS.md)
---
### TrueNAS SCALE (VM 100)
**Purpose**: Central file storage, NFS/SMB shares
**Location**: VM on PVE (10.10.10.200)
**Access**: https://truenas.htsn.io, SSH
**Version**: TrueNAS SCALE (check version in UI)
**Key Features**:
- ZFS storage management
- NFS exports
- SMB shares
- Syncthing hub
- Snapshot management
**Storage Pools**:
- `vault`: Main data pool on EMC enclosure
**Shares** (needs documentation):
- NFS exports for Saltbox media
- SMB shares for Windows access
- Syncthing sync folders
**See**: [STORAGE.md](STORAGE.md)
---
### Pi-hole (CT 200)
**Purpose**: Network-wide DNS server and ad blocker
**Location**: LXC on PVE (10.10.10.10)
**Access**: http://10.10.10.10/admin
**Version**: Unknown
**Configuration**:
- **Upstream DNS**: Cloudflare (1.1.1.1)
- **Blocklists**: Unknown count
- **Queries**: All network DNS traffic
- **DHCP**: Disabled (router handles DHCP)
**Stats** (example):
```bash
ssh pihole 'pihole -c -e' # Stats
ssh pihole 'pihole status' # Status
```
**Common Tasks**:
- Update blocklists: `ssh pihole 'pihole -g'`
- Whitelist domain: `ssh pihole 'pihole -w example.com'`
- View logs: `ssh pihole 'pihole -t'`
---
### Traefik (CT 202)
**Purpose**: Reverse proxy for all public-facing services
**Location**: LXC on PVE (10.10.10.250)
**Access**: http://10.10.10.250:8080/dashboard/
**Version**: Unknown (check: `traefik version`)
**Managed Services**:
- All *.htsn.io domains (except Saltbox services)
- SSL/TLS certificates via Let's Encrypt
- HTTP → HTTPS redirects
**See**: [TRAEFIK.md](TRAEFIK.md) for complete configuration
---
## Media Services (Saltbox VM)
All media services run in Docker on the Saltbox VM (10.10.10.100).
### Plex Media Server
**Purpose**: Media streaming platform
**URL**: https://plex.htsn.io
**Access**: Plex account
**Features**:
- Hardware transcoding (TITAN RTX)
- Libraries: Movies, TV, Music
- Remote access enabled
- Managed by Saltbox
**Media Storage**:
- Source: TrueNAS NFS mounts
- Location: `/mnt/unionfs/`
**Common Tasks**:
```bash
# View Plex status
ssh saltbox 'docker logs -f plex'
# Restart Plex
ssh saltbox 'docker restart plex'
# Scan library
# (via Plex UI: Settings → Library → Scan)
```
---
### *arr Apps (Media Automation)
Running on Saltbox VM, managed via Traefik-Saltbox.
| Service | Purpose | URL | Notes |
|---------|---------|-----|-------|
| **Sonarr** | TV show automation | sonarr.htsn.io | Monitors, downloads, organizes TV |
| **Radarr** | Movie automation | radarr.htsn.io | Monitors, downloads, organizes movies |
| **Lidarr** | Music automation | lidarr.htsn.io | Monitors, downloads, organizes music |
| **Overseerr** | Request management | overseerr.htsn.io | User requests for media |
| **Bazarr** | Subtitle management | bazarr.htsn.io | Downloads subtitles |
**Downloaders**:
| Service | Purpose | URL |
|---------|---------|-----|
| **SABnzbd** | Usenet downloader | sabnzbd.htsn.io |
| **NZBGet** | Usenet downloader | nzbget.htsn.io |
| **qBittorrent** | Torrent client | qbittorrent.htsn.io |
**Indexers**:
| Service | Purpose | URL |
|---------|---------|-----|
| **Jackett** | Torrent indexer proxy | jackett.htsn.io |
| **NZBHydra2** | Usenet indexer proxy | nzbhydra2.htsn.io |
---
### Supporting Media Services
| Service | Purpose | URL |
|---------|---------|-----|
| **Tautulli** | Plex statistics | tautulli.htsn.io |
| **Organizr** | Service dashboard | organizr.htsn.io |
| **Authelia** | SSO authentication | auth.htsn.io |
---
## Development Services
### Gitea (VM 300)
**Purpose**: Self-hosted Git server
**Location**: VM on PVE2 (10.10.10.220)
**URL**: https://git.htsn.io
**Access**: Username/password
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
**Common Tasks**:
```bash
# SSH to Gitea VM
ssh gitea-vm
# View logs
ssh gitea-vm 'journalctl -u gitea -f'
# Backup
ssh gitea-vm 'gitea dump -c /etc/gitea/app.ini'
```
**See**: Gitea documentation for API usage
---
### Docker Services (docker-host VM)
Running on VM 206 (10.10.10.206).
| Service | URL | Purpose | Port |
|---------|-----|---------|------|
| **Excalidraw** | https://excalidraw.htsn.io | Whiteboard/diagramming | 8080 |
| **Happy Server** | https://happy.htsn.io | Happy Coder relay | 3002 |
| **Pulse** | https://pulse.htsn.io | Monitoring dashboard | 7655 |
**Docker Compose files**: `/opt/{excalidraw,happy-server,pulse}/docker-compose.yml`
**Managing services**:
```bash
ssh docker-host 'docker ps'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose logs -f'
ssh docker-host 'cd /opt/excalidraw && sudo docker-compose restart'
```
---
## Home Automation
### Home Assistant (VM 110)
**Purpose**: Smart home automation platform
**Location**: VM on PVE (10.10.10.110)
**URL**: https://homeassistant.htsn.io
**Access**: Username/password
**Integrations**:
- UPS monitoring (NUT sensors)
- Unknown other integrations (needs documentation)
**Sensors**:
- `sensor.cyberpower_battery_charge`
- `sensor.cyberpower_load`
- `sensor.cyberpower_battery_runtime`
- `sensor.cyberpower_status`
**See**: [HOMEASSISTANT.md](HOMEASSISTANT.md)
---
### Happy Coder Relay (docker-host)
**Purpose**: Self-hosted relay server for Happy Coder mobile app
**Location**: docker-host (10.10.10.206)
**URL**: https://happy.htsn.io
**Access**: QR code authentication
**Stack**:
- Happy Server (Node.js)
- PostgreSQL (user/session data)
- Redis (real-time events)
- MinIO (file/image storage)
**Clients**:
- Mac Mini (Happy daemon)
- Mobile app (iOS/Android)
**Credentials**:
- Master Secret: `3ccbfd03a028d3c278da7d2cf36d99b94cd4b1fecabc49ab006e8e89bc7707ac`
- PostgreSQL: `happy` / `happypass`
- MinIO: `happyadmin` / `happyadmin123`
---
## File Sync & Storage
### Syncthing
**Purpose**: File synchronization across all devices
**Devices**:
- Mac Mini (10.10.10.125) - Hub
- MacBook - Mobile sync
- TrueNAS (10.10.10.200) - Central storage
- Windows PC (10.10.10.150) - Windows sync
- Phone (10.10.10.54) - Mobile sync
**API Keys**:
- Mac Mini: `oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5`
- MacBook: `qYkNdVLwy9qZZZ6MqnJr7tHX7KKdxGMJ`
- Phone: `Xxz3jDT4akUJe6psfwZsbZwG2LhfZuDM`
**Synced Folders**:
- documents (~11 GB)
- downloads (~38 GB)
- pictures
- notes
- desktop (~7.2 GB)
- config
- movies
**See**: [SYNCTHING.md](SYNCTHING.md)
---
### Copyparty (VM 201)
**Purpose**: Simple HTTP file sharing
**Location**: VM on PVE (10.10.10.201)
**URL**: https://copyparty.htsn.io
**Access**: Unknown
**Features**:
- Web-based file upload/download
- Lightweight
---
## Trading & AI Services
### AI Trading Platform (trading-vm)
**Purpose**: Algorithmic trading with AI models
**Location**: VM 301 on PVE2 (10.10.10.221)
**URL**: https://aitrade.htsn.io (if accessible)
**GPU**: RTX A6000 (48GB VRAM)
**Components**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
**Access**: SSH only (no web UI documented)
---
### LM Dev (lmdev1)
**Purpose**: AI/LLM development environment
**Location**: VM 111 on PVE (10.10.10.111)
**URL**: https://lmdev.htsn.io (if accessible)
**GPU**: TITAN RTX (shared with Saltbox)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
## Monitoring & Utilities
### UPS Monitoring (NUT)
**Purpose**: Monitor UPS status and trigger shutdowns
**Location**: PVE (master), PVE2 (slave)
**Access**: Command-line (`upsc`)
**Key Commands**:
```bash
ssh pve 'upsc cyberpower@localhost'
ssh pve 'upsc cyberpower@localhost ups.load'
ssh pve 'upsc cyberpower@localhost battery.runtime'
```
**Home Assistant Integration**: UPS sensors exposed
**See**: [UPS.md](UPS.md)
---
### Pulse Monitoring
**Purpose**: Unknown monitoring dashboard
**Location**: docker-host (10.10.10.206:7655)
**URL**: https://pulse.htsn.io
**Access**: Unknown
**Needs documentation**:
- What does it monitor?
- How to configure?
- Authentication?
---
### Tailscale VPN
**Purpose**: Secure remote access to homelab
**Subnet Routers**:
- PVE (100.113.177.80) - Primary
- UCG-Fiber (100.94.246.32) - Failover
**Devices on Tailscale**:
- Mac Mini: 100.108.89.58
- PVE: 100.113.177.80
- TrueNAS: 100.100.94.71
- Pi-hole: 100.112.59.128
**See**: [NETWORK.md](NETWORK.md)
---
## Custom Applications
### FindShyt (CT 205)
**Purpose**: Unknown custom application
**Location**: LXC on PVE (10.10.10.8)
**URL**: https://findshyt.htsn.io
**Access**: Unknown
**Needs documentation**:
- What is this app?
- How to use it?
- Tech stack?
---
## Service Dependencies
### Critical Dependencies
```
TrueNAS
├── Plex (media files via NFS)
├── *arr apps (downloads via NFS)
├── Syncthing (central storage hub)
└── Backups (if configured)
Traefik (CT 202)
├── All *.htsn.io services
└── SSL certificate management
Pi-hole
└── DNS for entire network
Router
└── Gateway for all services
```
### Startup Order
**See [VMS.md](VMS.md)** for VM boot order configuration:
1. TrueNAS (storage first)
2. Saltbox (depends on TrueNAS NFS)
3. Other VMs
4. Containers
---
## Service Port Reference
### Well-Known Ports
| Port | Service | Protocol | Purpose |
|------|---------|----------|---------|
| 22 | SSH | TCP | Remote access |
| 53 | Pi-hole | UDP | DNS queries |
| 80 | Traefik | TCP | HTTP (redirects to 443) |
| 443 | Traefik | TCP | HTTPS |
| 3000 | Gitea | TCP | Git HTTP/S |
| 8006 | Proxmox | TCP | Web UI |
| 8096 | Plex | TCP | Plex Media Server |
| 8384 | Syncthing | TCP | Web UI |
| 22000 | Syncthing | TCP | Sync protocol |
### Internal Ports
| Port | Service | Purpose |
|------|---------|---------|
| 3002 | Happy Server | Relay backend |
| 5432 | PostgreSQL | Happy Server DB |
| 6379 | Redis | Happy Server cache |
| 7655 | Pulse | Monitoring |
| 8080 | Excalidraw | Whiteboard |
| 8080 | Traefik | Dashboard |
| 9000 | MinIO | Object storage |
---
## Service Health Checks
### Quick Health Check Script
```bash
#!/bin/bash
# Check all critical services
echo "=== Infrastructure ==="
curl -Is https://pve.htsn.io:8006 | head -1
curl -Is https://truenas.htsn.io | head -1
curl -I http://10.10.10.10/admin 2>/dev/null | head -1
echo ""
echo "=== Media Services ==="
curl -Is https://plex.htsn.io | head -1
curl -Is https://sonarr.htsn.io | head -1
curl -Is https://radarr.htsn.io | head -1
echo ""
echo "=== Development ==="
curl -Is https://git.htsn.io | head -1
curl -Is https://excalidraw.htsn.io | head -1
echo ""
echo "=== Home Automation ==="
curl -Is https://homeassistant.htsn.io | head -1
curl -Is https://happy.htsn.io/health | head -1
```
### Service-Specific Checks
```bash
# Proxmox VMs
ssh pve 'qm list | grep running'
# Docker services
ssh docker-host 'docker ps --format "{{.Names}}: {{.Status}}"'
# Syncthing
curl -H "X-API-Key: oSQSrPnMnrEXuHqjWrRdrvq3TSXesAT5" \
"http://127.0.0.1:8384/rest/system/status"
# UPS
ssh pve 'upsc cyberpower@localhost ups.status'
```
---
## Service Credentials
**Location**: See individual service documentation
| Service | Credentials Location | Notes |
|---------|---------------------|-------|
| Proxmox | Proxmox UI | Username + 2FA |
| TrueNAS | TrueNAS UI | Root password |
| Plex | Plex account | Managed externally |
| Gitea | Gitea DB | Self-managed |
| Pi-hole | `/etc/pihole/setupVars.conf` | Admin password |
| Happy Server | [CLAUDE.md](CLAUDE.md) | Master secret, DB passwords |
**⚠️ Security Note**: Never commit credentials to Git. Use proper secrets management.
---
## Related Documentation
- [VMS.md](VMS.md) - VM/service locations
- [TRAEFIK.md](TRAEFIK.md) - Reverse proxy config
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [NETWORK.md](NETWORK.md) - Network configuration
- [MONITORING.md](MONITORING.md) - Monitoring setup (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many services need documentation (passwords, features, usage)

475
SSH-ACCESS.md Normal file
View File

@@ -0,0 +1,475 @@
# SSH Access
Documentation for SSH access to all homelab systems, including key authentication, password authentication for special cases, and QEMU guest agent usage.
## Overview
Most systems use **SSH key authentication** with the `~/.ssh/homelab` key. A few special cases require **password authentication** (router, Windows PC) due to platform limitations.
**SSH Password**: `GrilledCh33s3#` (for systems without key auth)
---
## SSH Key Authentication (Primary Method)
### SSH Key Configuration
SSH keys are configured in `~/.ssh/config` on both Mac Mini and MacBook.
**Key file**: `~/.ssh/homelab` (Ed25519 key)
**Key deployed to**: All Proxmox hosts, VMs, and LXCs (13 total hosts)
### Host Aliases
Use these convenient aliases instead of IP addresses:
| Host Alias | IP | User | Type | Notes |
|------------|-----|------|------|-------|
| `ucg-fiber` / `gateway` | 10.10.10.1 | root | UniFi Gateway | Router/firewall |
| `pve` | 10.10.10.120 | root | Proxmox | Primary server |
| `pve2` | 10.10.10.102 | root | Proxmox | Secondary server |
| `truenas` | 10.10.10.200 | root | VM | NAS/storage |
| `saltbox` | 10.10.10.100 | hutson | VM | Media automation |
| `lmdev1` | 10.10.10.111 | hutson | VM | AI/LLM development |
| `docker-host` | 10.10.10.206 | hutson | VM | Docker services (PVE) |
| `docker-host2` | 10.10.10.207 | hutson | VM | Docker services (PVE2) - MetaMCP, n8n |
| `fs-dev` | 10.10.10.5 | hutson | VM | Development |
| `copyparty` | 10.10.10.201 | hutson | VM | File sharing |
| `gitea-vm` | 10.10.10.220 | hutson | VM | Git server |
| `trading-vm` | 10.10.10.221 | hutson | VM | AI trading platform |
| `pihole` | 10.10.10.10 | root | LXC | DNS/Ad blocking |
| `traefik` | 10.10.10.250 | root | LXC | Reverse proxy |
| `findshyt` | 10.10.10.8 | root | LXC | Custom app |
### Usage Examples
```bash
# List VMs on PVE
ssh pve 'qm list'
# Check ZFS pool on TrueNAS
ssh truenas 'zpool status vault'
# List Docker containers on Saltbox
ssh saltbox 'docker ps'
# Check Pi-hole status
ssh pihole 'pihole status'
# View Traefik config
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
```
### SSH Config File
**Location**: `~/.ssh/config`
**Example entries**:
```sshconfig
# Proxmox Servers
Host pve
HostName 10.10.10.120
User root
IdentityFile ~/.ssh/homelab
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
# Post-quantum KEX causes MTU issues - use classic
KexAlgorithms curve25519-sha256
# VMs
Host truenas
HostName 10.10.10.200
User root
IdentityFile ~/.ssh/homelab
Host saltbox
HostName 10.10.10.100
User hutson
IdentityFile ~/.ssh/homelab
Host lmdev1
HostName 10.10.10.111
User hutson
IdentityFile ~/.ssh/homelab
Host docker-host
HostName 10.10.10.206
User hutson
IdentityFile ~/.ssh/homelab
Host docker-host2
HostName 10.10.10.207
User hutson
IdentityFile ~/.ssh/homelab
Host fs-dev
HostName 10.10.10.5
User hutson
IdentityFile ~/.ssh/homelab
Host copyparty
HostName 10.10.10.201
User hutson
IdentityFile ~/.ssh/homelab
Host gitea-vm
HostName 10.10.10.220
User hutson
IdentityFile ~/.ssh/homelab
Host trading-vm
HostName 10.10.10.221
User hutson
IdentityFile ~/.ssh/homelab
# LXC Containers
Host pihole
HostName 10.10.10.10
User root
IdentityFile ~/.ssh/homelab
Host traefik
HostName 10.10.10.250
User root
IdentityFile ~/.ssh/homelab
Host findshyt
HostName 10.10.10.8
User root
IdentityFile ~/.ssh/homelab
```
---
## Password Authentication (Special Cases)
Some systems don't support SSH key auth or have other limitations.
### UniFi Router (10.10.10.1) - NOW USES KEY AUTH
**Host alias**: `ucg-fiber` or `gateway`
**Status**: SSH key authentication now works (as of 2026-01-02)
**Commands**:
```bash
# Run command on router (using SSH key)
ssh ucg-fiber 'hostname'
# Get ARP table (all device IPs)
ssh ucg-fiber 'cat /proc/net/arp'
# Check Tailscale status
ssh ucg-fiber 'tailscale status'
# Check memory usage
ssh ucg-fiber 'free -m'
```
**Note**: Key may need to be re-deployed after firmware updates if UniFi clears authorized_keys.
### Windows PC (10.10.10.150)
**OS**: Windows with OpenSSH server
**User**: `claude`
**Password**: `GrilledCh33s3#`
**Shell**: PowerShell (not bash)
**Commands**:
```bash
# Run PowerShell command
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process | Select -First 5'
# Check Syncthing status
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Get-Process -Name syncthing -ErrorAction SilentlyContinue'
# Restart Syncthing
sshpass -p 'GrilledCh33s3#' ssh claude@10.10.10.150 'Stop-Process -Name syncthing -Force; Start-ScheduledTask -TaskName "Syncthing"'
```
**⚠️ Important**: Use `;` (semicolon) to chain PowerShell commands, NOT `&&` (bash syntax).
**Why not key auth?**: Could be configured, but password auth works and is simpler for Windows.
---
## QEMU Guest Agent
Most VMs have the QEMU guest agent installed, allowing command execution without SSH.
### VMs with QEMU Agent
| VMID | VM Name | Use Case |
|------|---------|----------|
| 100 | truenas | Execute commands, check ZFS |
| 101 | saltbox | Execute commands, Docker mgmt |
| 105 | fs-dev | Execute commands |
| 111 | lmdev1 | Execute commands |
| 201 | copyparty | Execute commands |
| 206 | docker-host | Execute commands |
| 300 | gitea-vm | Execute commands |
| 301 | trading-vm | Execute commands |
### VM WITHOUT QEMU Agent
**VMID 110 (homeassistant)**: No QEMU agent installed
- Access via web UI only
- Or install SSH server manually if needed
### Usage Examples
**Basic syntax**:
```bash
ssh pve 'qm guest exec VMID -- bash -c "COMMAND"'
```
**Examples**:
```bash
# Check ZFS pool on TrueNAS (without SSH)
ssh pve 'qm guest exec 100 -- bash -c "zpool status vault"'
# Get VM IP addresses
ssh pve 'qm guest exec 100 -- bash -c "ip addr"'
# Check Docker containers on Saltbox
ssh pve 'qm guest exec 101 -- bash -c "docker ps"'
# Run multi-line command
ssh pve 'qm guest exec 100 -- bash -c "df -h; free -h; uptime"'
```
**When to use QEMU agent vs SSH**:
- ✅ Use **SSH** for interactive sessions, file editing, complex tasks
- ✅ Use **QEMU agent** for one-off commands, when SSH is down, or VM has no network
- ⚠️ QEMU agent is slower for multiple commands (use SSH instead)
---
## Troubleshooting SSH Issues
### Connection Refused
```bash
# Check if SSH service is running
ssh pve 'systemctl status sshd'
# Check if port 22 is open
nc -zv 10.10.10.XXX 22
# Check firewall
ssh pve 'iptables -L -n | grep 22'
```
### Permission Denied (Public Key)
```bash
# Verify key file exists
ls -la ~/.ssh/homelab
# Check key permissions (should be 600)
chmod 600 ~/.ssh/homelab
# Test SSH key auth verbosely
ssh -vvv -i ~/.ssh/homelab root@10.10.10.120
# Check authorized_keys on remote (via QEMU agent if SSH broken)
ssh pve 'qm guest exec VMID -- bash -c "cat ~/.ssh/authorized_keys"'
```
### Slow SSH Connection (PVE2 Issue)
**Problem**: SSH to PVE2 hangs for 30+ seconds before connecting
**Cause**: MTU mismatch (vmbr0=9000, nic1=1500) causing post-quantum KEX packet fragmentation
**Fix**: Use classic KEX algorithm instead
**In `~/.ssh/config`**:
```sshconfig
Host pve2
HostName 10.10.10.102
User root
IdentityFile ~/.ssh/homelab
KexAlgorithms curve25519-sha256 # Avoid mlkem768x25519-sha256
```
**Permanent fix**: Set `nic1` MTU to 9000 in `/etc/network/interfaces` on PVE2
---
## Adding SSH Keys to New Systems
### Linux (VMs/LXCs)
```bash
# Copy public key to new host
ssh-copy-id -i ~/.ssh/homelab user@hostname
# Or manually:
ssh user@hostname 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' < ~/.ssh/homelab.pub
ssh user@hostname 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
```
### LXC Containers (Root User)
```bash
# Via pct exec from Proxmox host
ssh pve 'pct exec CTID -- bash -c "mkdir -p /root/.ssh"'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- bash -c "chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys"'
# Also enable PermitRootLogin in sshd_config
ssh pve 'pct exec CTID -- bash -c "sed -i \"s/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/\" /etc/ssh/sshd_config"'
ssh pve 'pct exec CTID -- bash -c "systemctl restart sshd"'
```
### VMs (via QEMU Agent)
```bash
# Add key via QEMU agent (if SSH not working)
ssh pve 'qm guest exec VMID -- bash -c "mkdir -p ~/.ssh"'
ssh pve 'qm guest exec VMID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> ~/.ssh/authorized_keys"'
ssh pve 'qm guest exec VMID -- bash -c "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"'
```
---
## SSH Key Management
### Rotate SSH Keys (Future)
When rotating SSH keys:
1. Generate new key pair:
```bash
ssh-keygen -t ed25519 -f ~/.ssh/homelab-new -C "homelab-new"
```
2. Deploy new key to all hosts (keep old key for now):
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
ssh-copy-id -i ~/.ssh/homelab-new $host
done
```
3. Update `~/.ssh/config` to use new key:
```sshconfig
IdentityFile ~/.ssh/homelab-new
```
4. Test all connections:
```bash
for host in pve pve2 truenas saltbox lmdev1 docker-host fs-dev copyparty gitea-vm trading-vm pihole traefik findshyt; do
echo "Testing $host..."
ssh $host 'hostname'
done
```
5. Remove old key from all hosts once confirmed working
---
## Quick Reference
### Common SSH Operations
```bash
# Execute command on remote host
ssh host 'command'
# Execute multiple commands
ssh host 'command1 && command2'
# Copy file to remote
scp file host:/path/
# Copy file from remote
scp host:/path/file ./
# Execute command on Proxmox VM (via QEMU agent)
ssh pve 'qm guest exec VMID -- bash -c "command"'
# Execute command on LXC
ssh pve 'pct exec CTID -- command'
# Interactive shell
ssh host
# SSH with X11 forwarding
ssh -X host
```
### Troubleshooting Commands
```bash
# Test SSH with verbose output
ssh -vvv host
# Check SSH service status (remote)
ssh host 'systemctl status sshd'
# Check SSH config (local)
ssh -G host
# Test port connectivity
nc -zv hostname 22
```
---
## Security Best Practices
### Current Security Posture
✅ **Good**:
- SSH keys used instead of passwords (where possible)
- Keys use Ed25519 (modern, secure algorithm)
- Root login disabled on VMs (use sudo instead)
- SSH keys have proper permissions (600)
⚠️ **Could Improve**:
- [ ] Disable password authentication on all hosts (force key-only)
- [ ] Use SSH certificate authority instead of individual keys
- [ ] Set up SSH bastion host (jump server)
- [ ] Enable 2FA for SSH (via PAM + Google Authenticator)
- [ ] Implement SSH key rotation policy (annually)
### Hardening SSH (Future)
For additional security, consider:
```sshconfig
# /etc/ssh/sshd_config (on remote hosts)
PermitRootLogin prohibit-password # No root password login
PasswordAuthentication no # Disable password auth entirely
PubkeyAuthentication yes # Only allow key auth
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3 # Limit auth attempts
MaxSessions 10 # Limit concurrent sessions
ClientAliveInterval 300 # Timeout idle sessions
ClientAliveCountMax 2 # Drop after 2 keepalives
```
**Apply after editing**:
```bash
systemctl restart sshd
```
---
## Related Documentation
- [VMS.md](VMS.md) - Complete VM/CT inventory
- [NETWORK.md](NETWORK.md) - Network configuration
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - IP addresses for all hosts
- [SECURITY.md](#) - Security policies (coming soon)
---
**Last Updated**: 2025-12-22

510
STORAGE.md Normal file
View File

@@ -0,0 +1,510 @@
# Storage Architecture
Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
## Overview
### Storage Distribution
| Location | Type | Capacity | Purpose |
|----------|------|----------|---------|
| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
---
## PVE (10.10.10.120) Storage Pools
### nvme-mirror1 (Primary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Sabrent Rocket Q NVMe
- **Capacity**: 3.6 TB usable
- **Purpose**: High-performance VM storage
- **Used By**:
- Critical VMs requiring fast IO
- Database workloads
- Development environments
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror1'
ssh pve 'zpool list nvme-mirror1'
```
### nvme-mirror2 (Secondary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Kingston SFYRD 2TB NVMe
- **Capacity**: 1.8 TB usable
- **Purpose**: Additional fast VM storage
- **Used By**: TBD
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror2'
ssh pve 'zpool list nvme-mirror2'
```
### rpool (Root Pool)
- **Type**: ZFS mirror
- **Devices**: 2x Samsung 870 QVO 4TB SSD
- **Capacity**: 3.6 TB usable
- **Purpose**: Proxmox OS, container storage, VM backups
- **Used By**:
- Proxmox root filesystem
- LXC containers
- Local VM backups
**Check status**:
```bash
ssh pve 'zpool status rpool'
ssh pve 'df -h /var/lib/vz'
```
### Storage Pool Usage Summary (PVE)
**Get current usage**:
```bash
ssh pve 'zpool list'
ssh pve 'pvesm status'
```
---
## PVE2 (10.10.10.102) Storage Pools
### nvme-mirror3 (Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x NVMe (model unknown)
- **Capacity**: Unknown (needs investigation)
- **Purpose**: High-performance VM storage
- **Used By**: Trading VM (301), other VMs
**Check status**:
```bash
ssh pve2 'zpool status nvme-mirror3'
ssh pve2 'zpool list nvme-mirror3'
```
### local-zfs2 (Bulk Storage)
- **Type**: ZFS mirror
- **Devices**: 2x WD Red 6TB HDD
- **Capacity**: ~6 TB usable
- **Purpose**: Bulk/archival storage
- **Power Management**: 30-minute spindown configured
- Saves ~10-16W when idle
- Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
- Command: `hdparm -S 241` (30 min)
**Notes**:
- Pool had only 768 KB used as of 2024-12-16
- Drives configured to spin down after 30 min idle
- Good for archival, NOT for active workloads
**Check status**:
```bash
ssh pve2 'zpool status local-zfs2'
ssh pve2 'zpool list local-zfs2'
# Check if drives are spun down
ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby
```
---
## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
### ZFS Pool: vault
**Primary storage pool** for all shared data.
**Devices**: ❓ Needs investigation
- EMC storage enclosure with multiple drives
- SAS connection via LSI SAS2308 HBA (passed through to VM)
**Capacity**: ❓ Needs investigation
**Check pool status**:
```bash
ssh truenas 'zpool status vault'
ssh truenas 'zpool list vault'
# Get detailed capacity
ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
```
### Datasets (Known)
Based on Syncthing configuration, likely datasets:
| Dataset | Purpose | Synced Devices | Notes |
|---------|---------|----------------|-------|
| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
| vault/desktop | Desktop sync | Unknown | 7.2 GB |
| vault/movies | Movie library | Unknown | Unknown size |
| vault/config | Config files | Mac Mini, MacBook | Unknown size |
**Get complete dataset list**:
```bash
ssh truenas 'zfs list -r vault'
```
### NFS/SMB Shares
**Status**: ❓ Not documented
**Needs investigation**:
```bash
# List NFS exports
ssh truenas 'showmount -e localhost'
# List SMB shares
ssh truenas 'smbclient -L localhost -N'
# Via TrueNAS API/UI
# Sharing → Unix Shares (NFS)
# Sharing → Windows Shares (SMB)
```
**Expected shares**:
- Media libraries for Plex (on Saltbox VM)
- Document storage
- VM backups?
- ISO storage?
### EMC Storage Enclosure
**Model**: EMC KTN-STL4 (or similar)
**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
**Drives**: ❓ Unknown count and capacity
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
- SES commands
- Fan control
- LCC (Link Control Card) troubleshooting
- Maintenance procedures
**Check enclosure status**:
```bash
ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor
ssh truenas 'smartctl --scan' # List all drives
```
---
## Storage Network Architecture
### Internal Storage Network (10.10.10.20.0/24)
**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
**Subnet**: 10.10.10.20.0/24
**DHCP**: No
**Gateway**: No (internal only, no internet)
**Connected VMs**:
- TrueNAS VM (secondary NIC)
- Saltbox VM (secondary NIC) - for NFS mounts
- Other VMs needing storage access
**Configuration**:
```bash
# On TrueNAS VM - check second NIC
ssh truenas 'ip addr show enp6s19'
# On Saltbox - check NFS mounts
ssh saltbox 'mount | grep nfs'
```
**Benefits**:
- Separates storage traffic from general network
- Prevents NFS/SMB from saturating main network
- Better performance for storage-heavy workloads
---
## Storage Capacity Planning
### Current Usage (Estimate)
**Needs actual audit**:
```bash
# PVE pools
ssh pve 'zpool list -o name,size,alloc,free'
# PVE2 pools
ssh pve2 'zpool list -o name,size,alloc,free'
# TrueNAS vault pool
ssh truenas 'zpool list vault'
# Get detailed breakdown
ssh truenas 'zfs list -r vault -o name,used,avail'
```
### Growth Rate
**Needs tracking** - recommend monthly snapshots of capacity:
```bash
#!/bin/bash
# Save as ~/bin/storage-capacity-report.sh
DATE=$(date +%Y-%m-%d)
REPORT=~/Backups/storage-reports/capacity-$DATE.txt
mkdir -p ~/Backups/storage-reports
echo "Storage Capacity Report - $DATE" > $REPORT
echo "================================" >> $REPORT
echo "" >> $REPORT
echo "PVE Pools:" >> $REPORT
ssh pve 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "PVE2 Pools:" >> $REPORT
ssh pve2 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Pools:" >> $REPORT
ssh truenas 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Datasets:" >> $REPORT
ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
echo "Report saved to $REPORT"
```
**Run monthly via cron**:
```cron
0 9 1 * * ~/bin/storage-capacity-report.sh
```
### Expansion Planning
**When to expand**:
- Pool reaches 80% capacity
- Performance degrades
- New workloads require more space
**Expansion options**:
1. Add drives to existing pools (if mirrors, add mirror vdev)
2. Add new NVMe drives to PVE/PVE2
3. Expand EMC enclosure (add more drives)
4. Add second EMC enclosure
**Cost estimates**: TBD
---
## ZFS Health Monitoring
### Daily Health Checks
```bash
# Check for errors on all pools
ssh pve 'zpool status -x' # Shows only unhealthy pools
ssh pve2 'zpool status -x'
ssh truenas 'zpool status -x'
# Check scrub status
ssh pve 'zpool status | grep scrub'
ssh pve2 'zpool status | grep scrub'
ssh truenas 'zpool status | grep scrub'
```
### Scrub Schedule
**Recommended**: Monthly scrub on all pools
**Configure scrub**:
```bash
# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
# Or via cron:
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub rpool
```
**On TrueNAS**:
- Configure via UI: Storage → Pools → Scrub Tasks
- Recommended: 1st of every month at 2 AM
### SMART Monitoring
**Check drive health**:
```bash
# PVE
ssh pve 'smartctl -a /dev/nvme0'
ssh pve 'smartctl -a /dev/sda'
# TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'smartctl -a /dev/sdX' # For each drive
```
**Configure SMART tests**:
- TrueNAS UI: Tasks → S.M.A.R.T. Tests
- Recommended: Weekly short test, monthly long test
### Alerts
**Set up email alerts for**:
- ZFS pool errors
- SMART test failures
- Pool capacity > 80%
- Scrub failures
---
## Storage Performance Tuning
### ZFS ARC (Cache)
**Check ARC usage**:
```bash
ssh pve 'arc_summary'
ssh truenas 'arc_summary'
```
**Tuning** (if needed):
- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
- TrueNAS: Configure via UI (System → Advanced → Tunables)
### NFS Performance
**Mount options** (on clients like Saltbox):
```
rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
```
**Verify NFS mounts**:
```bash
ssh saltbox 'mount | grep nfs'
```
### Record Size Optimization
**Different workloads need different record sizes**:
- VMs: 64K (default, good for VMs)
- Databases: 8K or 16K
- Media files: 1M (large sequential reads)
**Set record size** (on TrueNAS datasets):
```bash
ssh truenas 'zfs set recordsize=1M vault/movies'
```
---
## Disaster Recovery
### Pool Recovery
**If a pool fails to import**:
```bash
# Try importing with different name
zpool import -f -N poolname newpoolname
# Check pool with readonly
zpool import -f -o readonly=on poolname
# Force import (last resort)
zpool import -f -F poolname
```
### Drive Replacement
**When a drive fails**:
```bash
# Identify failed drive
zpool status poolname
# Replace drive
zpool replace poolname old-device new-device
# Monitor resilver
watch zpool status poolname
```
### Data Recovery
**If pool is completely lost**:
1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
2. Recreate pool structure
3. Restore data
**Critical**: This is why we need offsite backups!
---
## Quick Reference
### Common Commands
```bash
# Pool status
zpool status [poolname]
zpool list
# Dataset usage
zfs list
zfs list -r vault
# Check pool health (only unhealthy)
zpool status -x
# Scrub pool
zpool scrub poolname
# Get pool IO stats
zpool iostat -v 1
# Snapshot management
zfs snapshot poolname/dataset@snapname
zfs list -t snapshot
zfs rollback poolname/dataset@snapname
zfs destroy poolname/dataset@snapname
```
### Storage Locations by Use Case
| Use Case | Recommended Storage | Why |
|----------|---------------------|-----|
| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
| Database | nvme-mirror1/2 | Low latency |
| Media files | TrueNAS vault | Large capacity |
| Development | nvme-mirror2 | Fast, mid-tier |
| Containers | rpool | Good performance |
| Backups | TrueNAS or rpool | Large capacity |
| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
---
## Investigation Needed
- [ ] Get complete TrueNAS dataset list
- [ ] Document NFS/SMB share configuration
- [ ] Inventory EMC enclosure drives (count, capacity, model)
- [ ] Document current pool usage percentages
- [ ] Set up monthly capacity reports
- [ ] Configure ZFS scrub schedules
- [ ] Set up storage health alerts
---
## Related Documentation
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
- [VMS.md](VMS.md) - VM storage assignments
- [NETWORK.md](NETWORK.md) - Storage network configuration
---
**Last Updated**: 2025-12-22

View File

@@ -63,6 +63,20 @@ curl -sk "https://10.10.10.54:8384/rest/system/status" -H "X-API-Key: $API_KEY"
curl -sk "https://100.106.175.37:8384/rest/system/status" -H "X-API-Key: $API_KEY"
```
### TrueNAS (Docker Container)
```bash
API_KEY="LNWnrRmeyrw4dbngSmJMYN4a5Z2VnhSE"
# Access via Tailscale (port 20910, not 8384)
curl -s "http://100.100.94.71:20910/rest/system/status" -H "X-API-Key: $API_KEY"
# Or via local network
curl -s "http://10.10.10.200:20910/rest/system/status" -H "X-API-Key: $API_KEY"
```
**Note:** TrueNAS Syncthing runs in Docker with:
- Config: `/mnt/.ix-apps/app_mounts/syncthing/config`
- Data: `/mnt/vault/shares/syncthing` → mounted as `/data` in container
- Container name: `ix-syncthing-syncthing-1`
## Common Commands
### Check Status

296
TAILSCALE.md Normal file
View File

@@ -0,0 +1,296 @@
# Tailscale VPN Configuration
## Overview
Tailscale provides secure remote access to the homelab via a mesh VPN. This document covers the configuration, subnet routing, and critical gotchas learned from troubleshooting.
---
## Network Architecture
```
Remote Clients (MacBook, Phone)
▼ Tailscale Mesh (100.x.x.x)
┌───────┴────────┐
│ │
▼ ▼
PVE (Subnet Router) UCG-Fiber (Gateway)
100.113.177.80 100.94.246.32
│ │
│ 10.10.10.0/24 │
└──────────┬───────────┘
┌──────┴──────┐
│ │
PiHole TrueNAS
10.10.10.10 10.10.10.200
```
---
## Device Configuration
| Device | Tailscale IP | Role | Accept Routes | Advertise Routes |
|--------|--------------|------|---------------|------------------|
| **PVE** | 100.113.177.80 | Subnet Router (Primary) | **NO** | 10.10.10.0/24, 10.10.20.0/24 |
| **UCG-Fiber** | 100.94.246.32 | Gateway (backup) | **NO** | (disabled) |
| **PiHole** | 100.112.59.128 | DNS Server | **NO** | None |
| **TrueNAS** | 100.100.94.71 | NAS | Yes | None |
| **Mac-Mini** | 100.108.89.58 | Desktop | Yes | None |
| **MacBook** | 100.88.161.1 | Laptop | Yes | None |
| **Phone** | 100.106.175.37 | Mobile | Yes | None |
---
## Critical Configuration Rules
### 1. Devices on the Advertised Subnet MUST Have `--accept-routes=false`
**Problem:** If a device is directly connected to 10.10.10.0/24 AND has `--accept-routes=true`, Tailscale will route local subnet traffic through the mesh instead of the local interface.
**Symptom:** Device can't reach neighbors on the same subnet; `ip route get 10.10.10.X` shows `dev tailscale0` instead of the local interface.
**Fix:**
```bash
# On any device directly connected to 10.10.10.0/24
tailscale set --accept-routes=false
```
**Affected devices:**
- UCG-Fiber (gateway) - directly on 10.10.10.0/24
- PiHole - directly on 10.10.10.0/24
- PVE - directly on 10.10.10.0/24 (but is the subnet router, so different)
### 2. Only ONE Device Should Be Primary Subnet Router
**Problem:** Multiple devices advertising the same subnet can cause routing conflicts or failover issues.
**Current Setup:**
- **PVE** is the primary subnet router for both 10.10.10.0/24 and 10.10.20.0/24
- **UCG-Fiber** has subnet advertisement DISABLED (was causing relay-only connections)
**To change subnet router:**
1. Go to https://login.tailscale.com/admin/machines
2. Disable route on old device, enable on new device
3. Or set primary if both advertise
### 3. VPNs on Tailscale Devices Can Break Connectivity
**Problem:** A full-tunnel VPN (like ProtonVPN with `AllowedIPs = 0.0.0.0/0`) will route Tailscale's DERP/STUN traffic through the VPN, breaking NAT traversal.
**Symptom:** Device shows relay-only connections with asymmetric traffic (high TX, near-zero RX).
**Fix:** Use split-tunnel configuration that excludes Tailscale traffic. See [PiHole ProtonVPN Configuration](#pihole-protonvpn-split-tunnel) below.
---
## DNS Configuration
### Tailscale Admin DNS Settings
- **Nameserver:** 10.10.10.10 (PiHole via subnet route)
- **Fallback:** None configured
### How DNS Works
1. Remote client enables "Use Tailscale DNS"
2. DNS queries go to 10.10.10.10
3. Traffic routes through PVE (subnet router) to PiHole
4. PiHole resolves via Unbound (recursive) through ProtonVPN
---
## Subnet Routing
### Current Primary Routes
```
PVE advertises:
- 10.10.10.0/24 (LAN)
- 10.10.20.0/24 (Storage network)
```
### Verifying Routes
```bash
# From MacBook - check who's advertising routes
tailscale status --json | python3 -c "
import sys, json
data = json.load(sys.stdin)
for peer in data.get('Peer', {}).values():
routes = peer.get('PrimaryRoutes', [])
if routes:
print(f\"{peer.get('HostName')}: {routes}\")"
```
### Testing Subnet Connectivity
```bash
# Test from remote client
ping 10.10.10.10 # PiHole
ping 10.10.10.120 # PVE
ping 10.10.10.1 # Gateway
dig @10.10.10.10 google.com # DNS
```
---
## PiHole ProtonVPN Split-Tunnel
PiHole runs a WireGuard tunnel to ProtonVPN for encrypted upstream DNS queries. The configuration uses policy-based routing to ONLY route Unbound's DNS traffic through the VPN.
### Configuration File: `/etc/wireguard/piehole.conf`
```ini
[Interface]
PrivateKey = <key>
Address = 10.2.0.2/32
# CRITICAL: Disable automatic routing - we handle it manually
Table = off
# Policy routing: only route Unbound DNS through VPN
PostUp = ip route add default dev %i table 51820
PostUp = ip rule add fwmark 0x51820 table 51820 priority 100
PostUp = iptables -t mangle -N UNBOUND_VPN 2>/dev/null || true
PostUp = iptables -t mangle -F UNBOUND_VPN
PostUp = iptables -t mangle -A UNBOUND_VPN -d 10.0.0.0/8 -j RETURN
PostUp = iptables -t mangle -A UNBOUND_VPN -d 127.0.0.0/8 -j RETURN
PostUp = iptables -t mangle -A UNBOUND_VPN -d 100.64.0.0/10 -j RETURN
PostUp = iptables -t mangle -A UNBOUND_VPN -d 192.168.0.0/16 -j RETURN
PostUp = iptables -t mangle -A UNBOUND_VPN -d 172.16.0.0/12 -j RETURN
PostUp = iptables -t mangle -A UNBOUND_VPN -j MARK --set-mark 0x51820
PostUp = iptables -t mangle -A OUTPUT -p udp --dport 53 -m owner --uid-owner unbound -j UNBOUND_VPN
PostUp = iptables -t mangle -A OUTPUT -p tcp --dport 53 -m owner --uid-owner unbound -j UNBOUND_VPN
PostUp = iptables -t nat -A POSTROUTING -o %i -j MASQUERADE
PostDown = iptables -t mangle -D OUTPUT -p udp --dport 53 -m owner --uid-owner unbound -j UNBOUND_VPN
PostDown = iptables -t mangle -D OUTPUT -p tcp --dport 53 -m owner --uid-owner unbound -j UNBOUND_VPN
PostDown = iptables -t mangle -F UNBOUND_VPN
PostDown = iptables -t mangle -X UNBOUND_VPN
PostDown = ip rule del fwmark 0x51820 table 51820 priority 100
PostDown = ip route del default dev %i table 51820
PostDown = iptables -t nat -D POSTROUTING -o %i -j MASQUERADE
[Peer]
PublicKey = <ProtonVPN-key>
AllowedIPs = 0.0.0.0/0, ::/0
Endpoint = 149.102.242.1:51820
PersistentKeepalive = 25
```
**Key Points:**
- `Table = off` prevents wg-quick from adding default routes
- Only traffic from the `unbound` user to port 53 gets marked and routed through VPN
- Local, private, and Tailscale (100.64.0.0/10) traffic is excluded
---
## Troubleshooting
### Symptom: Can't reach subnet (10.10.10.x) from remote
**Check 1:** Is PVE online and advertising routes?
```bash
tailscale status | grep pve
# Should show "active" not "offline"
```
**Check 2:** Is PVE the primary subnet router?
```bash
tailscale status --json | python3 -c "..." # See above
```
**Check 3:** Can PVE reach the target on local network?
```bash
ssh pve 'ping -c 1 10.10.10.10'
```
### Symptom: Device shows "relay" with asymmetric traffic (high TX, low RX)
**Cause:** Usually a VPN or firewall blocking Tailscale's UDP traffic.
**Check:** Run netcheck on the affected device:
```bash
tailscale netcheck
```
Look for:
- Wrong external IP (indicates VPN routing issue)
- Missing DERP latencies
- `MappingVariesByDestIP: true` with no direct connections
### Symptom: Local devices can't reach each other
**Cause:** `--accept-routes=true` on a device that's directly on the subnet.
**Fix:**
```bash
# Check current setting
tailscale debug prefs | grep -i route
# Disable accept-routes
tailscale set --accept-routes=false
```
### Symptom: Gateway can ping Tailscale IPs but not local IPs
**Check routing:**
```bash
ip route get 10.10.10.120
# If it shows "dev tailscale0" instead of "dev br0", that's the problem
```
**Fix:** `tailscale set --accept-routes=false` on the gateway
---
## Maintenance Commands
### Restart Tailscale
```bash
# On Linux
systemctl restart tailscaled
# Check status
tailscale status
```
### Re-advertise Routes (PVE)
```bash
tailscale set --advertise-routes=10.10.10.0/24,10.10.20.0/24
```
### Check Connection Type
```bash
# Shows direct vs relay for each peer
tailscale status
# Detailed ping with path info
tailscale ping <tailscale-ip>
```
### Force Re-connection
```bash
tailscale down && tailscale up
```
---
## Known Issues
### UCG-Fiber Relay-Only Connections
The UniFi gateway sometimes fails to establish direct Tailscale connections, falling back to relay. This appears related to memory pressure or the gateway's NAT implementation. Current workaround: use PVE as the subnet router instead.
### Gateway Memory Pressure
The UCG-Fiber has limited RAM (~3GB) and can become unstable under load. The internet-watchdog service will auto-reboot if connectivity is lost. See [GATEWAY.md](GATEWAY.md).
---
## Change History
### 2026-01-05
- Switched subnet router from UCG-Fiber to PVE
- Fixed PiHole ProtonVPN from full-tunnel to split-tunnel (DNS-only)
- Disabled `--accept-routes` on UCG-Fiber and PiHole
- Documented critical configuration rules
---
**Last Updated:** 2026-01-05

676
TRAEFIK.md Normal file
View File

@@ -0,0 +1,676 @@
# Traefik Reverse Proxy
Documentation for Traefik reverse proxy setup, SSL certificates, and deploying new public services.
## Overview
There are **TWO separate Traefik instances** handling different services. Understanding which one to use is critical.
| Instance | Location | IP | Purpose | Managed By |
|----------|----------|-----|---------|------------|
| **Traefik-Primary** | CT 202 | **10.10.10.250** | General services | Manual config files |
| **Traefik-Saltbox** | VM 101 (Docker) | **10.10.10.100** | Saltbox services only | Saltbox Ansible |
---
## ⚠️ CRITICAL RULE: Which Traefik to Use
### When Adding ANY New Service:
**USE Traefik-Primary (CT 202 @ 10.10.10.250)** - For ALL new services
**DO NOT touch Traefik-Saltbox** - Unless you're modifying Saltbox itself
### Why This Matters:
- **Traefik-Saltbox** has complex Saltbox-managed configs (Ansible-generated)
- Messing with it breaks Plex, Sonarr, Radarr, and all media services
- Each Traefik has its own Let's Encrypt certificates
- Mixing them causes certificate conflicts and routing issues
---
## Traefik-Primary (CT 202) - For New Services
### Configuration
**Location**: Container 202 on PVE (10.10.10.250)
**Config Directory**: `/etc/traefik/`
**Main Config**: `/etc/traefik/traefik.yaml`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
### Access Traefik Config
```bash
# From Mac Mini:
ssh pve 'pct exec 202 -- cat /etc/traefik/traefik.yaml'
ssh pve 'pct exec 202 -- ls /etc/traefik/conf.d/'
# Edit a service config:
ssh pve 'pct exec 202 -- vi /etc/traefik/conf.d/myservice.yaml'
# View logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Services Using Traefik-Primary
| Service | Domain | Backend |
|---------|--------|---------|
| Excalidraw | excalidraw.htsn.io | 10.10.10.206:8080 (docker-host) |
| FindShyt | findshyt.htsn.io | 10.10.10.205 (CT 205) |
| Gitea | git.htsn.io | 10.10.10.220:3000 |
| Home Assistant | homeassistant.htsn.io | 10.10.10.110 |
| LM Dev | lmdev.htsn.io | 10.10.10.111 |
| MetaMCP | metamcp.htsn.io | 10.10.10.207:12008 (docker-host2) |
| Pi-hole | pihole.htsn.io | 10.10.10.200 |
| TrueNAS | truenas.htsn.io | 10.10.10.200 |
| Proxmox | pve.htsn.io | 10.10.10.120 |
| Copyparty | copyparty.htsn.io | 10.10.10.201 |
| AI Trade | aitrade.htsn.io | (trading server) |
| Pulse | pulse.htsn.io | 10.10.10.206:7655 (monitoring) |
| Happy | happy.htsn.io | 10.10.10.206:3002 (Happy Coder relay) |
| BlueMap | map.htsn.io | 10.10.10.207:8100 (Minecraft web map, password protected) |
| Notes Redirect | notes.htsn.io | 10.10.10.207:8765 (HTTP→obsidian:// redirect) |
| Todo Redirect | todo.htsn.io | 10.10.10.207:8765 (HTTP→ticktick:// redirect) |
---
## Traefik-Saltbox (VM 101) - DO NOT MODIFY
### Configuration
**Location**: `/opt/traefik/` inside Saltbox VM
**Managed By**: Saltbox Ansible playbooks (automatic)
**Docker Mount**: `/opt/traefik``/etc/traefik` in container
### Services Using Traefik-Saltbox
- Plex (plex.htsn.io)
- Sonarr, Radarr, Lidarr
- SABnzbd, NZBGet, qBittorrent
- Overseerr, Tautulli, Organizr
- Jackett, NZBHydra2
- Authelia (SSO authentication)
- All other Saltbox-managed containers
### View Saltbox Traefik (Read-Only)
```bash
# View config (don't edit!)
ssh pve 'qm guest exec 101 -- bash -c "docker exec traefik cat /etc/traefik/traefik.yml"'
# View logs
ssh saltbox 'docker logs -f traefik'
```
**⚠️ WARNING**: Editing Saltbox Traefik configs manually will be overwritten by Ansible and may break media services.
---
## Adding a New Public Service - Complete Workflow
Follow these steps to deploy a new service and make it accessible at `servicename.htsn.io`.
### Step 0: Deploy Your Service
First, deploy your service on the appropriate host.
#### Option A: Docker on docker-host (10.10.10.206)
```bash
ssh hutson@10.10.10.206
sudo mkdir -p /opt/myservice
cat > /opt/myservice/docker-compose.yml << 'EOF'
version: "3.8"
services:
myservice:
image: myimage:latest
ports:
- "8080:80"
restart: unless-stopped
EOF
cd /opt/myservice && sudo docker-compose up -d
```
#### Option B: New LXC Container on PVE
```bash
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname myservice --memory 2048 --cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 --unprivileged 1 --start 1'
```
#### Option C: New VM on PVE
```bash
ssh pve 'qm create VMID --name myservice --memory 2048 --cores 2 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci'
```
### Step 1: Create Traefik Config File
Use this template for new services on **Traefik-Primary (CT 202)**:
#### Basic Template
```yaml
# /etc/traefik/conf.d/myservice.yaml
http:
routers:
# HTTPS router
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare # Use 'cloudflare' for proxied domains, 'letsencrypt' for DNS-only
priority: 50
# HTTP → HTTPS redirect
myservice-redirect:
entryPoints:
- web
rule: "Host(`myservice.htsn.io`)"
middlewares:
- myservice-https-redirect
service: myservice
priority: 50
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
middlewares:
myservice-https-redirect:
redirectScheme:
scheme: https
permanent: true
```
#### Deploy the Config
```bash
# Create file on CT 202
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << '\''EOF'\''
<paste config here>
EOF"'
# Traefik auto-reloads (watches conf.d directory)
# Check logs:
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
### Step 2: Add Cloudflare DNS Entry
#### Cloudflare Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID (htsn.io) | c0f5a80448c608af35d39aa820a5f3af |
| Public IP | 70.237.94.174 |
#### Method 1: Manual (Cloudflare Dashboard)
1. Go to https://dash.cloudflare.com/
2. Select `htsn.io` domain
3. DNS → Add Record
4. Type: `A`, Name: `myservice`, IPv4: `70.237.94.174`, Proxied: ☑️
#### Method 2: Automated (CLI)
Save this as `~/bin/add-cloudflare-dns.sh`:
```bash
#!/bin/bash
# Add DNS record to Cloudflare for htsn.io
SUBDOMAIN="$1"
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
PUBLIC_IP="70.237.94.174"
if [ -z "$SUBDOMAIN" ]; then
echo "Usage: $0 <subdomain>"
echo "Example: $0 myservice # Creates myservice.htsn.io"
exit 1
fi
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data "{
\"type\":\"A\",
\"name\":\"$SUBDOMAIN\",
\"content\":\"$PUBLIC_IP\",
\"ttl\":1,
\"proxied\":true
}" | jq .
```
**Usage**:
```bash
chmod +x ~/bin/add-cloudflare-dns.sh
~/bin/add-cloudflare-dns.sh myservice # Creates myservice.htsn.io
```
### Step 3: Testing
```bash
# Check if DNS resolves
dig myservice.htsn.io
# Should return: 70.237.94.174 (or Cloudflare IPs if proxied)
# Test HTTP redirect
curl -I http://myservice.htsn.io
# Expected: 301 redirect to https://
# Test HTTPS
curl -I https://myservice.htsn.io
# Expected: 200 OK
# Check Traefik dashboard (if enabled)
# http://10.10.10.250:8080/dashboard/
```
### Step 4: Update Documentation
After deploying, update:
1. **IP-ASSIGNMENTS.md** - Add to Services & Reverse Proxy Mapping table
2. **This file (TRAEFIK.md)** - Add to "Services Using Traefik-Primary" list
3. **CLAUDE.md** - Update quick reference if needed
---
## SSL Certificates
Traefik has **two certificate resolvers** configured:
| Resolver | Use When | Challenge Type | Notes |
|----------|----------|----------------|-------|
| `letsencrypt` | Cloudflare DNS-only (gray cloud ☁️) | HTTP-01 | Requires port 80 reachable |
| `cloudflare` | Cloudflare Proxied (orange cloud 🟠) | DNS-01 | Works with Cloudflare proxy |
### ⚠️ Important: HTTP Challenge vs DNS Challenge
**If Cloudflare proxy is enabled** (orange cloud), HTTP challenge **FAILS** because Cloudflare redirects HTTP→HTTPS before the challenge reaches your server.
**Solution**: Use `cloudflare` resolver (DNS-01 challenge) instead.
### Certificate Resolver Configuration
**Cloudflare API credentials** are configured in `/etc/systemd/system/traefik.service`:
```ini
Environment="CF_API_EMAIL=cloudflare@htsn.io"
Environment="CF_API_KEY=849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
```
### Certificate Storage
| Resolver | Storage File |
|----------|--------------|
| HTTP challenge (`letsencrypt`) | `/etc/traefik/acme.json` |
| DNS challenge (`cloudflare`) | `/etc/traefik/acme-cf.json` |
**Permissions**: Must be `600` (read/write owner only)
```bash
# Check permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
# Fix if needed
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
```
### Certificate Renewal
- **Automatic** via Traefik
- Checks every 24 hours
- Renews 30 days before expiry
- No manual intervention needed
### Troubleshooting Certificates
#### Certificate Fails to Issue
```bash
# Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log | grep -i error'
# Verify Cloudflare API access
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
# Check acme.json permissions
ssh pve 'pct exec 202 -- ls -la /etc/traefik/acme*.json'
```
#### Force Certificate Renewal
```bash
# Delete certificate (Traefik will re-request)
ssh pve 'pct exec 202 -- rm /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- touch /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- chmod 600 /etc/traefik/acme-cf.json'
ssh pve 'pct exec 202 -- systemctl restart traefik'
# Watch logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
```
---
## Quick Deployment - One-Liner
For fast deployment, use this all-in-one command:
```bash
# === DEPLOY SERVICE (example: myservice on docker-host port 8080) ===
# 1. Create Traefik config
ssh pve 'pct exec 202 -- bash -c "cat > /etc/traefik/conf.d/myservice.yaml << EOF
http:
routers:
myservice-secure:
entryPoints: [websecure]
rule: Host(\\\`myservice.htsn.io\\\`)
service: myservice
tls: {certResolver: cloudflare}
services:
myservice:
loadBalancer:
servers:
- url: http://10.10.10.206:8080
EOF"'
# 2. Add Cloudflare DNS
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/c0f5a80448c608af35d39aa820a5f3af/dns_records" \
-H "X-Auth-Email: cloudflare@htsn.io" \
-H "X-Auth-Key: 849ebefd163d2ccdec25e49b3e1b3fe2cdadc" \
-H "Content-Type: application/json" \
--data '{"type":"A","name":"myservice","content":"70.237.94.174","proxied":true}'
# 3. Test (wait a few seconds for DNS propagation)
curl -I https://myservice.htsn.io
```
---
## Docker Service with Traefik Labels (Alternative)
If deploying a service via Docker on `docker-host` (VM 206), you can use Traefik labels instead of config files.
**Requirements**:
- Traefik must have access to Docker socket
- Service must be on same Docker network as Traefik
**Example docker-compose.yml**:
```yaml
version: "3.8"
services:
myservice:
image: myimage:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.myservice.rule=Host(`myservice.htsn.io`)"
- "traefik.http.routers.myservice.entrypoints=websecure"
- "traefik.http.routers.myservice.tls.certresolver=letsencrypt"
- "traefik.http.services.myservice.loadbalancer.server.port=8080"
networks:
- traefik
networks:
traefik:
external: true
```
**Note**: This method is NOT currently used on Traefik-Primary (CT 202), as it doesn't have Docker socket access. Config files are preferred.
---
## Cloudflare API Reference
### API Credentials
| Field | Value |
|-------|-------|
| Email | cloudflare@htsn.io |
| API Key | 849ebefd163d2ccdec25e49b3e1b3fe2cdadc |
| Zone ID | c0f5a80448c608af35d39aa820a5f3af |
### Common API Operations
Set credentials:
```bash
CF_EMAIL="cloudflare@htsn.io"
CF_API_KEY="849ebefd163d2ccdec25e49b3e1b3fe2cdadc"
ZONE_ID="c0f5a80448c608af35d39aa820a5f3af"
```
**List all DNS records**:
```bash
curl -X GET "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" | jq
```
**Add A record**:
```bash
curl -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{
"type":"A",
"name":"subdomain",
"content":"70.237.94.174",
"proxied":true
}'
```
**Delete record**:
```bash
curl -X DELETE "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY"
```
**Update record** (toggle proxy):
```bash
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" \
-H "X-Auth-Key: $CF_API_KEY" \
-H "Content-Type: application/json" \
--data '{"proxied":false}'
```
---
## Troubleshooting
### Service Not Accessible
```bash
# 1. Check if DNS resolves
dig myservice.htsn.io
# 2. Check if backend is reachable
curl -I http://10.10.10.XXX:PORT
# 3. Check Traefik logs
ssh pve 'pct exec 202 -- tail -f /var/log/traefik/traefik.log'
# 4. Check Traefik config is valid
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml'
# 5. Restart Traefik (if needed)
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
### Certificate Issues
```bash
# Check certificate status in acme.json
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq'
# Check certificate expiry
echo | openssl s_client -servername myservice.htsn.io -connect myservice.htsn.io:443 2>/dev/null | openssl x509 -noout -dates
```
### 502 Bad Gateway
**Cause**: Backend service is down or unreachable
```bash
# Check if backend is running
ssh backend-host 'systemctl status myservice'
# Check if port is open
nc -zv 10.10.10.XXX PORT
# Check firewall
ssh backend-host 'iptables -L -n | grep PORT'
```
### 404 Not Found
**Cause**: Traefik can't match the request to a router
```bash
# Check router rule matches domain
ssh pve 'pct exec 202 -- cat /etc/traefik/conf.d/myservice.yaml | grep rule'
# Should be: rule: "Host(`myservice.htsn.io`)"
# Check DNS is pointing to correct IP
dig myservice.htsn.io
# Restart Traefik to reload config
ssh pve 'pct exec 202 -- systemctl restart traefik'
```
---
## Advanced Configuration Examples
### WebSocket Support
For services that use WebSockets (like Home Assistant):
```yaml
http:
routers:
myservice-secure:
entryPoints:
- websecure
rule: "Host(`myservice.htsn.io`)"
service: myservice
tls:
certResolver: cloudflare
services:
myservice:
loadBalancer:
servers:
- url: "http://10.10.10.XXX:PORT"
# No special config needed - WebSockets work by default in Traefik v2+
```
### Custom Headers
Add custom headers (e.g., security headers):
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-headers
middlewares:
myservice-headers:
headers:
customResponseHeaders:
X-Frame-Options: "DENY"
X-Content-Type-Options: "nosniff"
Referrer-Policy: "strict-origin-when-cross-origin"
```
### Basic Authentication
Protect a service with basic auth:
```yaml
http:
routers:
myservice-secure:
middlewares:
- myservice-auth
middlewares:
myservice-auth:
basicAuth:
users:
- "user:$apr1$..." # Generate with: htpasswd -nb user password
```
---
## Maintenance
### Monthly Checks
```bash
# Check Traefik status
ssh pve 'pct exec 202 -- systemctl status traefik'
# Review logs for errors
ssh pve 'pct exec 202 -- grep -i error /var/log/traefik/traefik.log | tail -20'
# Check certificate expiry dates
ssh pve 'pct exec 202 -- cat /etc/traefik/acme-cf.json | jq ".cloudflare.Certificates[] | {domain: .domain.main, expiry: .certificate}"'
# Verify all services responding
for domain in plex.htsn.io git.htsn.io truenas.htsn.io; do
echo "Testing $domain..."
curl -sI https://$domain | head -1
done
```
### Backup Traefik Config
```bash
# Backup all configs
ssh pve 'pct exec 202 -- tar czf /tmp/traefik-backup-$(date +%Y%m%d).tar.gz /etc/traefik'
# Copy to safe location
scp "pve:/var/lib/lxc/202/rootfs/tmp/traefik-backup-*.tar.gz" ~/Backups/traefik/
```
---
## Related Documentation
- [IP-ASSIGNMENTS.md](IP-ASSIGNMENTS.md) - Service IP addresses
- [CLOUDFLARE.md](#) - Cloudflare DNS management (coming soon)
- [SERVICES.md](#) - Complete service inventory (coming soon)
---
**Last Updated**: 2025-12-22

605
UPS.md Normal file
View File

@@ -0,0 +1,605 @@
# UPS and Power Management
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
## Hardware
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
| **Installed** | 2025-12-21 |
| **Status** | Active |
### ⚠️ Temporary Wiring Modification
**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
**Current draw**: ~1000-1350W total (safe margin)
**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
### Previous UPS
| Model | Capacity | Issue | Replaced |
|-------|----------|-------|----------|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
---
## Power Draw Estimates
### Typical Load
| Component | Idle | Load | Notes |
|-----------|------|------|-------|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
| Network gear | ~50W | ~50W | Router, switches |
| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
**UPS Load**: ~33-50% typical, 70-80% under heavy load
### Runtime Calculation
At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
At **660W load** (50%): ~10-12 min estimated
At **1000W load** (75%): ~6-8 min estimated
**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
---
## NUT (Network UPS Tools) Configuration
### Architecture
```
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
└──> Home Assistant (monitoring only)
```
**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
### NUT Server Configuration (PVE)
#### 1. UPS Driver Config: `/etc/nut/ups.conf`
```ini
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower OR2200PFCRT2U"
override.battery.charge.low = 20
override.battery.runtime.low = 120
```
**Key settings**:
- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
- `port = auto`: Auto-detect USB device
- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
#### 2. NUT Server Config: `/etc/nut/upsd.conf`
```ini
LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493
```
**Listens on**:
- Localhost (for local monitoring)
- LAN IP (for PVE2 to connect)
#### 3. User Config: `/etc/nut/upsd.users`
```ini
[admin]
password = upsadmin123
actions = SET
instcmds = ALL
[upsmon]
password = upsmon123
upsmon master
```
**Users**:
- `admin`: Full control, can run commands
- `upsmon`: Monitoring only (used by PVE2)
#### 4. Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONLINE "UPS %s on line power"
NOTIFYMSG ONBATT "UPS %s on battery"
NOTIFYMSG LOWBATT "UPS %s battery is low"
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK "Communications with UPS %s established"
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM "UPS %s is unavailable"
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
NOTIFYFLAG ONLINE SYSLOG+WALL
NOTIFYFLAG ONBATT SYSLOG+WALL
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG
```
**Key settings**:
- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
- `POLLFREQ 5`: Check UPS every 5 seconds
#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
```udev
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
```
**Purpose**: Ensure NUT can access USB UPS device
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
```
### NUT Client Configuration (PVE2)
#### Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
# Same NOTIFYMSG and NOTIFYFLAG as PVE
```
**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
---
## Custom Shutdown Script
### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
```bash
#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low
LOG="/var/log/ups-shutdown.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}
log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"
# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
log "Stopping VM $VMID..."
qm shutdown $VMID
done
# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
log "Stopping CT $CTID..."
pct shutdown $CTID
done
# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60
# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
log "Stopping TrueNAS (VM 100) last..."
qm shutdown 100
sleep 30
fi
log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="
```
**Make executable**:
```bash
chmod +x /usr/local/bin/ups-shutdown.sh
```
**Script behavior**:
1. Stops all VMs (except TrueNAS)
2. Stops all containers
3. Waits 60 seconds
4. Stops TrueNAS last (storage must be cleanly unmounted)
5. **Does NOT shut down Proxmox hosts** - intentionally left running
**Why not shut down hosts?**
- BIOS configured to "Restore on AC Power Loss"
- When power returns, servers auto-boot and start VMs in order
- Avoids need for manual intervention
---
## Power Failure Behavior
### When Power Fails
1. **UPS switches to battery** (`OB DISCHRG` status)
2. **NUT monitors runtime** - polls every 5 seconds
3. **At 120 seconds (2 min) remaining**:
- NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
- Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
4. **Hosts remain running** until UPS battery depletes
5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
### When Power Returns
1. **UPS charges battery**, power returns to servers
2. **BIOS "Restore on AC Power Loss"** boots both servers
3. **Proxmox starts** and auto-starts VMs in configured order:
| Order | Wait | VMs/CTs | Reason |
|-------|------|---------|--------|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
| 4 | 5s | pihole, traefik, findshyt | Containers |
PVE2 VMs: order=1, wait=10s
**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
---
## UPS Status Codes
| Code | Meaning | Action |
|------|---------|--------|
| `OL` | Online (AC power) | Normal operation |
| `OB` | On Battery | Power outage - monitor runtime |
| `LB` | Low Battery | <2 min remaining - shutdown imminent |
| `CHRG` | Charging | Battery charging after power restored |
| `DISCHRG` | Discharging | On battery, draining |
| `FSD` | Forced Shutdown | NUT triggered shutdown |
---
## Monitoring & Commands
### Check UPS Status
```bash
# Full status
ssh pve 'upsc cyberpower@localhost'
# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
# Example output:
# battery.charge: 100
# battery.runtime: 1234 (seconds remaining)
# ups.load: 33 (% load)
# ups.status: OL (online)
```
### Control UPS Beeper
```bash
# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
```
### Test Shutdown Procedure
**Simulate low battery** (careful - this will shut down VMs!):
```bash
# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'
# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
```
**Better test**: Run shutdown script manually without actually triggering NUT:
```bash
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
---
## Home Assistant Integration
UPS metrics are exposed to Home Assistant via NUT integration.
### Available Sensors
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
| `sensor.cyberpower_load` | Load % (0-100) |
| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
### Configuration
**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
### Example Automations
**Send notification when on battery**:
```yaml
automation:
- alias: "UPS On Battery Alert"
trigger:
- platform: state
entity_id: sensor.cyberpower_status
to: "OB"
action:
- service: notify.mobile_app
data:
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
```
**Alert when battery low**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_runtime
below: 300
action:
- service: notify.mobile_app
data:
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
```
---
## Testing Results
### Full Power Failure Test (2025-12-21)
Complete end-to-end test of power failure and recovery:
| Event | Time | Duration | Notes |
|-------|------|----------|-------|
| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
| **Power restored** | ~22:47 | - | Plugged back in |
| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
| **Total recovery** | - | **~7 min** | From power return to fully operational |
**Results**:
✅ VMs shut down gracefully
✅ Hosts remained running until UPS died (as intended)
✅ Auto-boot on power restoration worked
✅ VMs started in correct order with appropriate delays
✅ No data corruption or issues
**Runtime calculation**:
- Load: ~33% (440W estimated)
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
- Matches manufacturer estimate for 33% load
---
## Proxmox Cluster Quorum Fix
### Problem
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
### Solution
Modified `/etc/pve/corosync.conf` to enable 2-node mode:
```
quorum {
provider: corosync_votequorum
two_node: 1
}
```
**Effect**:
- Either node can operate independently if the other is down
- No more waiting for quorum when one server is offline
- Both nodes visible in single Proxmox interface when both up
**Applied**: 2025-12-21
---
## Maintenance
### Monthly Checks
```bash
# Check UPS status
ssh pve 'upsc cyberpower@localhost'
# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'
# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'
# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'
# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'
```
### Battery Health
**Check battery stats**:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery'
# Key metrics:
# battery.charge: 100 (should be near 100 when on AC)
# battery.runtime: 1200+ (seconds at current load)
# battery.voltage: ~24V (normal for 24V battery system)
```
**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
```
CyberPower batteries typically last 3-5 years.
### Firmware Updates
Check CyberPower website for firmware updates:
https://www.cyberpowersystems.com/support/firmware/
---
## Troubleshooting
### UPS Not Detected
```bash
# Check USB connection
ssh pve 'lsusb | grep Cyber'
# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'
```
### PVE2 Can't Connect
```bash
# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'
# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'
# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'
```
### Shutdown Script Not Running
```bash
# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
# Should be: -rwxr-xr-x (executable)
# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'
# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
### UPS Status Shows UNKNOWN
```bash
# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'
# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb
# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'
```
---
## Future Improvements
- [ ] Add email alerts for UPS events (power fail, low battery)
- [ ] Log runtime statistics to track battery degradation
- [ ] Set up Grafana dashboard for UPS metrics
- [ ] Test battery runtime at different load levels
- [ ] Upgrade to 20A circuit, restore original 5-20P plug
- [ ] Consider adding network management card for out-of-band UPS access
---
## Related Documentation
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
- [VMS.md](VMS.md) - VM startup order configuration
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
---
**Last Updated**: 2025-12-22

580
VMS.md Normal file
View File

@@ -0,0 +1,580 @@
# VMs and Containers
Complete inventory of all virtual machines and LXC containers across both Proxmox servers.
## Overview
| Server | VMs | LXCs | Total |
|--------|-----|------|-------|
| **PVE** (10.10.10.120) | 6 | 3 | 9 |
| **PVE2** (10.10.10.102) | 3 | 0 | 3 |
| **Total** | **9** | **3** | **12** |
---
## PVE (10.10.10.120) - Primary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **100** | truenas | 10.10.10.200 | 8 | 32GB | nvme-mirror1 | NAS, central file storage | LSI SAS2308 HBA, Samsung NVMe | ✅ Yes |
| **101** | saltbox | 10.10.10.100 | 16 | 16GB | nvme-mirror1 | Media automation (Plex, *arr) | TITAN RTX | ✅ Yes |
| **105** | fs-dev | 10.10.10.5 | 10 | 8GB | rpool | Development environment | - | ✅ Yes |
| **110** | homeassistant | 10.10.10.110 | 2 | 2GB | rpool | Home automation platform | - | ❌ No |
| **111** | lmdev1 | 10.10.10.111 | 8 | 32GB | nvme-mirror1 | AI/LLM development | TITAN RTX | ✅ Yes |
| **201** | copyparty | 10.10.10.201 | 2 | 2GB | rpool | File sharing service | - | ✅ Yes |
| **206** | docker-host | 10.10.10.206 | 2 | 4GB | rpool | Docker services (Excalidraw, Happy, Pulse) | - | ✅ Yes |
### LXC Containers
| CTID | Name | IP | RAM | Storage | Purpose |
|------|------|-----|-----|---------|---------|
| **200** | pihole | 10.10.10.10 | - | rpool | DNS, ad blocking |
| **202** | traefik | 10.10.10.250 | - | rpool | Reverse proxy (primary) |
| **205** | findshyt | 10.10.10.8 | - | rpool | Custom app |
---
## PVE2 (10.10.10.102) - Secondary Server
### Virtual Machines
| VMID | Name | IP | vCPUs | RAM | Storage | Purpose | GPU/Passthrough | QEMU Agent |
|------|------|-----|-------|-----|---------|---------|-----------------|------------|
| **300** | gitea-vm | 10.10.10.220 | 2 | 4GB | nvme-mirror3 | Git server (Gitea) | - | ✅ Yes |
| **301** | trading-vm | 10.10.10.221 | 16 | 32GB | nvme-mirror3 | AI trading platform | RTX A6000 | ✅ Yes |
| **302** | docker-host2 | 10.10.10.207 | 4 | 8GB | nvme-mirror3 | Docker host (n8n, automation) | - | ✅ Yes |
### LXC Containers
None on PVE2.
---
## VM Details
### 100 - TrueNAS (Storage Server)
**Purpose**: Central NAS for all file storage, NFS/SMB shares, and media libraries
**Specs**:
- **OS**: TrueNAS SCALE
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1 (OS), EMC storage enclosure (data pool via HBA passthrough)
- **Network**:
- Primary: 10 Gb (vmbr2)
- Secondary: Internal storage network (vmbr3 @ 10.10.20.x)
**Hardware Passthrough**:
- LSI SAS2308 HBA (for EMC enclosure drives)
- Samsung NVMe (for ZFS caching)
**ZFS Pools**:
- `vault`: Main storage pool on EMC drives
- Boot pool on passed-through NVMe
**See**: [STORAGE.md](STORAGE.md), [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)
---
### 101 - Saltbox (Media Automation)
**Purpose**: Media server stack - Plex, Sonarr, Radarr, SABnzbd, Overseerr, etc.
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 16 GB
- **Storage**: nvme-mirror1
- **Network**: 10 Gb (vmbr2)
**GPU Passthrough**:
- NVIDIA TITAN RTX (for Plex hardware transcoding)
**Services**:
- Plex Media Server (plex.htsn.io)
- Sonarr, Radarr, Lidarr (TV/movie/music automation)
- SABnzbd, NZBGet (downloaders)
- Overseerr (request management)
- Tautulli (Plex stats)
- Organizr (dashboard)
- Authelia (SSO authentication)
- Traefik (reverse proxy - separate from CT 202)
**Managed By**: Saltbox Ansible playbooks
**See**: [SALTBOX.md](#) (coming soon)
---
### 105 - fs-dev (Development Environment)
**Purpose**: General development work, testing, prototyping
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 10
- **RAM**: 8 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
---
### 110 - Home Assistant (Home Automation)
**Purpose**: Smart home automation platform
**Specs**:
- **OS**: Home Assistant OS
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**:
- Web UI: https://homeassistant.htsn.io
- API: See [HOMEASSISTANT.md](HOMEASSISTANT.md)
**Special Notes**:
- ❌ No QEMU agent (Home Assistant OS doesn't support it)
- No SSH server by default (access via web terminal)
---
### 111 - lmdev1 (AI/LLM Development)
**Purpose**: AI model development, fine-tuning, inference
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 8
- **RAM**: 32 GB
- **Storage**: nvme-mirror1
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA TITAN RTX (shared with Saltbox, but can be dedicated if needed)
**Installed**:
- CUDA toolkit
- Python 3.11+
- PyTorch, TensorFlow
- Hugging Face transformers
---
### 201 - Copyparty (File Sharing)
**Purpose**: Simple HTTP file sharing server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 2 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
**Access**: https://copyparty.htsn.io
---
### 206 - docker-host (Docker Services)
**Purpose**: General-purpose Docker host for miscellaneous services
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: rpool
- **Network**: 1 Gb (vmbr0)
- **CPU**: `host` passthrough (for x86-64-v3 support)
**Services Running**:
- Excalidraw (excalidraw.htsn.io) - Whiteboard
- Happy Coder relay server (happy.htsn.io) - Self-hosted relay for Happy Coder mobile app
- Pulse (pulse.htsn.io) - Monitoring dashboard
**Docker Compose Files**: `/opt/*/docker-compose.yml`
---
### 300 - gitea-vm (Git Server)
**Purpose**: Self-hosted Git server
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 2
- **RAM**: 4 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**Access**: https://git.htsn.io
**Repositories**:
- homelab-docs (this documentation)
- Personal projects
- Private repos
---
### 301 - trading-vm (AI Trading Platform)
**Purpose**: Algorithmic trading system with AI models
**Specs**:
- **OS**: Ubuntu 22.04
- **vCPUs**: 16
- **RAM**: 32 GB
- **Storage**: nvme-mirror3 (PVE2)
- **Network**: 1 Gb (vmbr0)
**GPU Passthrough**:
- NVIDIA RTX A6000 (300W TDP, 48GB VRAM)
**Software**:
- Trading algorithms
- AI models for market prediction
- Real-time data feeds
- Backtesting infrastructure
---
## LXC Container Details
### 200 - Pi-hole (DNS & Ad Blocking)
**Purpose**: Network-wide DNS server and ad blocker
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.10
**Storage**: rpool
**Access**:
- Web UI: http://10.10.10.10/admin
- Public URL: https://pihole.htsn.io
**Configuration**:
- Upstream DNS: Cloudflare (1.1.1.1)
- DHCP: Disabled (router handles DHCP)
- Interface: All interfaces
**Usage**: Set router DNS to 10.10.10.10 for network-wide ad blocking
---
### 202 - Traefik (Reverse Proxy)
**Purpose**: Primary reverse proxy for all public-facing services
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.250
**Storage**: rpool
**Configuration**: `/etc/traefik/`
**Dynamic Configs**: `/etc/traefik/conf.d/*.yaml`
**See**: [TRAEFIK.md](TRAEFIK.md) for complete documentation
**⚠️ Important**: This is the PRIMARY Traefik instance. Do NOT confuse with Saltbox's Traefik (VM 101).
---
### 205 - FindShyt (Custom App)
**Purpose**: Custom application (details TBD)
**Type**: LXC (unprivileged)
**OS**: Ubuntu 22.04
**IP**: 10.10.10.8
**Storage**: rpool
**Access**: https://findshyt.htsn.io
---
## VM Startup Order & Dependencies
### Power-On Sequence
When servers boot (after power failure or restart), VMs/CTs start in this order:
#### PVE (10.10.10.120)
| Order | Wait | VMID | Name | Reason |
|-------|------|------|------|--------|
| **1** | 30s | 100 | TrueNAS | ⚠️ Storage must start first - other VMs depend on NFS |
| **2** | 60s | 101 | Saltbox | Depends on TrueNAS NFS mounts for media |
| **3** | 10s | 105, 110, 111, 201, 206 | Other VMs | General VMs, no critical dependencies |
| **4** | 5s | 200, 202, 205 | Containers | Lightweight, start quickly |
**Configure startup order** (already set):
```bash
# View current config
ssh pve 'qm config 100 | grep -E "startup|onboot"'
# Set startup order (example)
ssh pve 'qm set 100 --onboot 1 --startup order=1,up=30'
ssh pve 'qm set 101 --onboot 1 --startup order=2,up=60'
```
#### PVE2 (10.10.10.102)
| Order | Wait | VMID | Name |
|-------|------|------|------|
| **1** | 10s | 300, 301 | All VMs |
**Less critical** - no dependencies between PVE2 VMs.
---
## Resource Allocation Summary
### Total Allocated (PVE)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 56 | 64 (32 cores × 2 threads) | 88% |
| **RAM** | 98 GB | 128 GB | 77% |
**Note**: vCPU overcommit is acceptable (VMs rarely use all cores simultaneously)
### Total Allocated (PVE2)
| Resource | Allocated | Physical | % Used |
|----------|-----------|----------|--------|
| **vCPUs** | 18 | 64 | 28% |
| **RAM** | 36 GB | 128 GB | 28% |
**PVE2** has significant headroom for additional VMs.
---
## Adding a New VM
### Quick Template
```bash
# Create VM
ssh pve 'qm create VMID \
--name myvm \
--memory 4096 \
--cores 2 \
--net0 virtio,bridge=vmbr0 \
--scsihw virtio-scsi-pci \
--scsi0 nvme-mirror1:32 \
--boot order=scsi0 \
--ostype l26 \
--agent enabled=1'
# Attach ISO for installation
ssh pve 'qm set VMID --ide2 local:iso/ubuntu-22.04.iso,media=cdrom'
# Start VM
ssh pve 'qm start VMID'
# Access console
ssh pve 'qm vncproxy VMID' # Then connect with VNC client
# Or via Proxmox web UI
```
### Cloud-Init Template (Faster)
Use cloud-init for automated VM deployment:
```bash
# Download cloud image
ssh pve 'wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img -O /var/lib/vz/template/iso/ubuntu-22.04-cloud.img'
# Create VM
ssh pve 'qm create VMID --name myvm --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0'
# Import disk
ssh pve 'qm importdisk VMID /var/lib/vz/template/iso/ubuntu-22.04-cloud.img nvme-mirror1'
# Attach disk
ssh pve 'qm set VMID --scsi0 nvme-mirror1:vm-VMID-disk-0'
# Add cloud-init drive
ssh pve 'qm set VMID --ide2 nvme-mirror1:cloudinit'
# Set boot disk
ssh pve 'qm set VMID --boot order=scsi0'
# Configure cloud-init (user, SSH key, network)
ssh pve 'qm set VMID --ciuser hutson --sshkeys ~/.ssh/homelab.pub --ipconfig0 ip=10.10.10.XXX/24,gw=10.10.10.1'
# Enable QEMU agent
ssh pve 'qm set VMID --agent enabled=1'
# Resize disk (cloud images are small by default)
ssh pve 'qm resize VMID scsi0 +30G'
# Start VM
ssh pve 'qm start VMID'
```
**Cloud-init VMs boot ready-to-use** with SSH keys, static IP, and user configured.
---
## Adding a New LXC Container
```bash
# Download template (if not already downloaded)
ssh pve 'pveam update'
ssh pve 'pveam available | grep ubuntu'
ssh pve 'pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst'
# Create container
ssh pve 'pct create CTID local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname mycontainer \
--memory 2048 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=10.10.10.XXX/24,gw=10.10.10.1 \
--rootfs local-zfs:8 \
--unprivileged 1 \
--features nesting=1 \
--start 1'
# Set root password
ssh pve 'pct exec CTID -- passwd'
# Add SSH key
ssh pve 'pct exec CTID -- mkdir -p /root/.ssh'
ssh pve 'pct exec CTID -- bash -c "echo \"$(cat ~/.ssh/homelab.pub)\" >> /root/.ssh/authorized_keys"'
ssh pve 'pct exec CTID -- chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys'
```
---
## GPU Passthrough Configuration
### Current GPU Assignments
| GPU | Location | Passed To | VMID | Purpose |
|-----|----------|-----------|------|---------|
| **NVIDIA Quadro P2000** | PVE | - | - | Proxmox host (Plex transcoding via driver) |
| **NVIDIA TITAN RTX** | PVE | saltbox, lmdev1 | 101, 111 | Media transcoding + AI dev (shared) |
| **NVIDIA RTX A6000** | PVE2 | trading-vm | 301 | AI trading (dedicated) |
### How to Pass GPU to VM
1. **Identify GPU PCI ID**:
```bash
ssh pve 'lspci | grep -i nvidia'
# Example output:
# 81:00.0 VGA compatible controller: NVIDIA Corporation TU102 [TITAN RTX] (rev a1)
# 81:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
```
2. **Pass GPU to VM** (include both VGA and Audio):
```bash
ssh pve 'qm set VMID -hostpci0 81:00.0,pcie=1'
# If multi-function device (GPU + Audio), use:
ssh pve 'qm set VMID -hostpci0 81:00,pcie=1'
```
3. **Configure VM for GPU**:
```bash
# Set machine type to q35
ssh pve 'qm set VMID --machine q35'
# Set BIOS to OVMF (UEFI)
ssh pve 'qm set VMID --bios ovmf'
# Add EFI disk
ssh pve 'qm set VMID --efidisk0 nvme-mirror1:1,format=raw,efitype=4m,pre-enrolled-keys=1'
```
4. **Reboot VM** and install NVIDIA drivers inside the VM
**See**: [GPU-PASSTHROUGH.md](#) (coming soon) for detailed guide
---
## Backup Priority
See [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) for complete backup plan.
### Critical VMs (Must Backup)
| Priority | VMID | Name | Reason |
|----------|------|------|--------|
| 🔴 **CRITICAL** | 100 | truenas | All storage lives here - catastrophic if lost |
| 🟡 **HIGH** | 101 | saltbox | Complex media stack config |
| 🟡 **HIGH** | 110 | homeassistant | Home automation config |
| 🟡 **HIGH** | 300 | gitea-vm | Git repositories (code, docs) |
| 🟡 **HIGH** | 301 | trading-vm | Trading algorithms and AI models |
### Medium Priority
| VMID | Name | Notes |
|------|------|-------|
| 200 | pihole | Easy to rebuild, but DNS config valuable |
| 202 | traefik | Config files backed up separately |
### Low Priority (Ephemeral/Rebuildable)
| VMID | Name | Notes |
|------|------|-------|
| 105 | fs-dev | Development - code is in Git |
| 111 | lmdev1 | Ephemeral development |
| 201 | copyparty | Simple app, easy to redeploy |
| 206 | docker-host | Docker Compose files backed up separately |
---
## Quick Reference Commands
```bash
# List all VMs
ssh pve 'qm list'
ssh pve2 'qm list'
# List all containers
ssh pve 'pct list'
# Start/stop VM
ssh pve 'qm start VMID'
ssh pve 'qm stop VMID'
ssh pve 'qm shutdown VMID' # Graceful
# Start/stop container
ssh pve 'pct start CTID'
ssh pve 'pct stop CTID'
ssh pve 'pct shutdown CTID' # Graceful
# VM console
ssh pve 'qm terminal VMID'
# Container console
ssh pve 'pct enter CTID'
# Clone VM
ssh pve 'qm clone VMID NEW_VMID --name newvm'
# Delete VM
ssh pve 'qm destroy VMID'
# Delete container
ssh pve 'pct destroy CTID'
```
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - Storage pool assignments
- [SSH-ACCESS.md](SSH-ACCESS.md) - How to access VMs
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - VM backup strategy
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - VM resource optimization
- [NETWORK.md](NETWORK.md) - Which bridge to use for new VMs
---
**Last Updated**: 2025-12-22

View File

@@ -0,0 +1 @@
{"web":{"client_id":"693027753314-hdjfnvfnarlcnehba6u8plbehv78rfh9.apps.googleusercontent.com","project_id":"spheric-method-482514-f8","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["https://my.home-assistant.io/redirect/oauth"]}}

View File

@@ -0,0 +1,41 @@
#!/bin/bash
# Internet Watchdog - Reboots if internet is unreachable for 5 minutes
LOG_FILE="/var/log/internet-watchdog.log"
FAIL_COUNT=0
MAX_FAILS=5
CHECK_INTERVAL=60
log() {
echo "$(date "+%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
}
check_internet() {
for endpoint in 1.1.1.1 8.8.8.8 208.67.222.222; do
if ping -c 1 -W 5 "$endpoint" > /dev/null 2>&1; then
return 0
fi
done
return 1
}
log "Watchdog started"
while true; do
if check_internet; then
if [ $FAIL_COUNT -gt 0 ]; then
log "Internet restored after $FAIL_COUNT failures"
fi
FAIL_COUNT=0
else
FAIL_COUNT=$((FAIL_COUNT + 1))
log "Internet check failed ($FAIL_COUNT/$MAX_FAILS)"
if [ $FAIL_COUNT -ge $MAX_FAILS ]; then
log "CRITICAL: $MAX_FAILS consecutive failures - REBOOTING"
sync
sleep 2
reboot
fi
fi
sleep $CHECK_INTERVAL
done

View File

@@ -0,0 +1,23 @@
#!/bin/bash
LOG_DIR="/data/logs"
LOG_FILE="$LOG_DIR/memory-history.log"
mkdir -p "$LOG_DIR"
while true; do
# Rotate if over 10MB
if [ -f "$LOG_FILE" ]; then
SIZE=$(wc -c < "$LOG_FILE" 2>/dev/null || echo 0)
if [ "$SIZE" -gt 10485760 ]; then
mv "$LOG_FILE" "$LOG_FILE.old"
fi
fi
echo "========== $(date +%Y-%m-%d\ %H:%M:%S) ==========" >> "$LOG_FILE"
echo "--- MEMORY ---" >> "$LOG_FILE"
free -m >> "$LOG_FILE"
echo "--- TOP MEMORY PROCESSES ---" >> "$LOG_FILE"
ps -eo pid,rss,comm --sort=-rss | head -12 >> "$LOG_FILE"
echo "" >> "$LOG_FILE"
sleep 600
done

View File

@@ -0,0 +1,114 @@
#!/bin/bash
# Crafty Permission Checker Script
# Checks for permission issues that could break plugin functionality
echo "Crafty Permission Check - $(date)"
echo "================================"
# Base directory
CRAFTY_DIR="/home/hutson/crafty/data/servers"
# Check if running on docker-host2
if [ "$(hostname)" != "docker-host2" ]; then
echo "⚠️ This script should be run on docker-host2"
echo " Use: ssh docker-host2 '~/check-crafty-permissions.sh'"
exit 1
fi
# Function to check permissions
check_permissions() {
local issues_found=0
# Check for files not owned by root group
echo -e "\n📁 Checking file ownership..."
wrong_group=$(find "$CRAFTY_DIR" -type f ! -group root 2>/dev/null)
if [ ! -z "$wrong_group" ]; then
echo "❌ Files with incorrect group (should be 'root'):"
echo "$wrong_group" | head -10
issues_found=$((issues_found + 1))
else
echo "✅ All files have correct group ownership (root)"
fi
# Check for directories not owned by root group
echo -e "\n📁 Checking directory ownership..."
wrong_dir_group=$(find "$CRAFTY_DIR" -type d ! -group root 2>/dev/null)
if [ ! -z "$wrong_dir_group" ]; then
echo "❌ Directories with incorrect group (should be 'root'):"
echo "$wrong_dir_group" | head -10
issues_found=$((issues_found + 1))
else
echo "✅ All directories have correct group ownership (root)"
fi
# Check for directories without setgid bit
echo -e "\n🔒 Checking setgid bit on directories..."
no_setgid=$(find "$CRAFTY_DIR" -type d ! -perm -g+s 2>/dev/null)
if [ ! -z "$no_setgid" ]; then
echo "⚠️ Directories without setgid bit (may cause future issues):"
echo "$no_setgid" | head -10
issues_found=$((issues_found + 1))
else
echo "✅ All directories have setgid bit set"
fi
# Check for files that crafty user can't read (excluding temp files)
echo -e "\n📖 Checking read permissions..."
unreadable=$(find "$CRAFTY_DIR" -type f ! -perm -g+r ! -name "*.tmp" 2>/dev/null)
if [ ! -z "$unreadable" ]; then
echo "❌ Files that crafty user can't read:"
echo "$unreadable" | head -10
issues_found=$((issues_found + 1))
else
echo "✅ All files are readable by crafty user"
fi
return $issues_found
}
# Function to fix permissions
fix_permissions() {
echo -e "\n🔧 Fixing permissions..."
# Fix ownership
sudo chown -R hutson:root "$CRAFTY_DIR"
# Fix directory permissions (2775 = rwxrwsr-x)
sudo find "$CRAFTY_DIR" -type d -exec chmod 2775 {} \;
# Fix file permissions (664 = rw-rw-r--)
sudo find "$CRAFTY_DIR" -type f -exec chmod 664 {} \;
echo "✅ Permissions fixed!"
}
# Main execution
echo "Checking Crafty server permissions..."
check_permissions
result=$?
if [ $result -gt 0 ]; then
echo -e "\n⚠ Found $result permission issue(s)!"
echo -n "Would you like to fix them automatically? (y/n): "
read -r response
if [[ "$response" =~ ^[Yy]$ ]]; then
fix_permissions
echo -e "\n🔄 Re-checking permissions..."
check_permissions
if [ $? -eq 0 ]; then
echo -e "\n✅ All permission issues resolved!"
else
echo -e "\n❌ Some issues remain. You may need to restart the Crafty container."
fi
else
echo -e "\nTo fix manually, run:"
echo "sudo chown -R hutson:root $CRAFTY_DIR"
echo "sudo find $CRAFTY_DIR -type d -exec chmod 2775 {} \;"
echo "sudo find $CRAFTY_DIR -type f -exec chmod 664 {} \;"
fi
else
echo -e "\n✅ No permission issues found!"
fi
echo -e "\n================================"
echo "Check complete - $(date)"

View File

@@ -0,0 +1,77 @@
#!/bin/bash
# Minecraft Servers Backup Script (All Servers)
# Backs up both Hutworld and Backrooms servers to TrueNAS
BACKUP_DEST="hutson@10.10.10.200:/mnt/vault/users/backups/minecraft"
DATE=$(date +%Y-%m-%d_%H%M)
echo "[$(date)] Starting Minecraft servers backup..."
# Backup Hutworld server
HUTWORLD_SRC="$HOME/crafty/data/servers/19f604a9-f037-442d-9283-0761c73cfd60"
HUTWORLD_BACKUP="/tmp/hutworld-$DATE.tar.gz"
echo "[$(date)] Backing up Hutworld server..."
tar -czf "$HUTWORLD_BACKUP" \
--exclude="*.jar" \
--exclude="cache" \
--exclude="libraries" \
--exclude=".paper-remapped" \
-C "$HOME/crafty/data/servers" \
19f604a9-f037-442d-9283-0761c73cfd60
echo "[$(date)] Hutworld backup created: $(ls -lh $HUTWORLD_BACKUP | awk '{print $5}')"
# Transfer Hutworld backup to TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$HUTWORLD_BACKUP" "$BACKUP_DEST/"
if [ $? -eq 0 ]; then
echo "[$(date)] Hutworld backup transferred successfully"
rm "$HUTWORLD_BACKUP"
else
echo "[$(date)] ERROR: Failed to transfer Hutworld backup"
fi
# Backup Backrooms server
BACKROOMS_SRC="$HOME/crafty/data/servers/64079d6c-acb0-48c4-9b21-23e0fa354522"
BACKROOMS_BACKUP="/tmp/backrooms-$DATE.tar.gz"
echo "[$(date)] Backing up Backrooms server..."
tar -czf "$BACKROOMS_BACKUP" \
--exclude="*.jar" \
--exclude="cache" \
--exclude="libraries" \
--exclude=".paper-remapped" \
-C "$HOME/crafty/data/servers" \
64079d6c-acb0-48c4-9b21-23e0fa354522
echo "[$(date)] Backrooms backup created: $(ls -lh $BACKROOMS_BACKUP | awk '{print $5}')"
# Transfer Backrooms backup to TrueNAS
sshpass -p 'GrilledCh33s3#' scp -o StrictHostKeyChecking=no "$BACKROOMS_BACKUP" "$BACKUP_DEST/"
if [ $? -eq 0 ]; then
echo "[$(date)] Backrooms backup transferred successfully"
rm "$BACKROOMS_BACKUP"
else
echo "[$(date)] ERROR: Failed to transfer Backrooms backup"
fi
# Clean up old backups (keep last 30 of each server)
echo "[$(date)] Cleaning up old backups..."
sshpass -p 'GrilledCh33s3#' ssh -o StrictHostKeyChecking=no hutson@10.10.10.200 '
cd /mnt/vault/users/backups/minecraft
# Keep only last 30 Hutworld backups
ls -t hutworld-*.tar.gz 2>/dev/null | tail -n +31 | xargs -r rm -f
# Keep only last 30 Backrooms backups
ls -t backrooms-*.tar.gz 2>/dev/null | tail -n +31 | xargs -r rm -f
echo "Current backups:"
echo "Hutworld: $(ls -1 hutworld-*.tar.gz 2>/dev/null | wc -l) backups"
echo "Backrooms: $(ls -1 backrooms-*.tar.gz 2>/dev/null | wc -l) backups"
echo "Total size: $(du -sh . | cut -f1)"
'
echo "[$(date)] All backups complete!"