Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Hutson
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions

605
UPS.md Normal file
View File

@@ -0,0 +1,605 @@
# UPS and Power Management
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
## Hardware
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Output** | PFC Sinewave (compatible with active PFC PSUs) |
| **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
| **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** |
| **Runtime** | ~15-20 min at typical load (~33% / 440W) |
| **Installed** | 2025-12-21 |
| **Status** | Active |
### ⚠️ Temporary Wiring Modification
**Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit
**Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility
**Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V)
**Current draw**: ~1000-1350W total (safe margin)
**Backlog**: Upgrade to 20A circuit, restore original 5-20P plug
### Previous UPS
| Model | Capacity | Issue | Replaced |
|-------|----------|-------|----------|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
**Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity.
---
## Power Draw Estimates
### Typical Load
| Component | Idle | Load | Notes |
|-----------|------|------|-------|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
| Network gear | ~50W | ~50W | Router, switches |
| **Total** | **500-700W** | **1000-1400W** | Varies by workload |
**UPS Load**: ~33-50% typical, 70-80% under heavy load
### Runtime Calculation
At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21)
At **660W load** (50%): ~10-12 min estimated
At **1000W load** (75%): ~6-8 min estimated
**NUT shutdown trigger**: 120 seconds (2 min) remaining runtime
---
## NUT (Network UPS Tools) Configuration
### Architecture
```
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
└──> Home Assistant (monitoring only)
```
**Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server
**Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
### NUT Server Configuration (PVE)
#### 1. UPS Driver Config: `/etc/nut/ups.conf`
```ini
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower OR2200PFCRT2U"
override.battery.charge.low = 20
override.battery.runtime.low = 120
```
**Key settings**:
- `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower)
- `port = auto`: Auto-detect USB device
- `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining
#### 2. NUT Server Config: `/etc/nut/upsd.conf`
```ini
LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493
```
**Listens on**:
- Localhost (for local monitoring)
- LAN IP (for PVE2 to connect)
#### 3. User Config: `/etc/nut/upsd.users`
```ini
[admin]
password = upsadmin123
actions = SET
instcmds = ALL
[upsmon]
password = upsmon123
upsmon master
```
**Users**:
- `admin`: Full control, can run commands
- `upsmon`: Monitoring only (used by PVE2)
#### 4. Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONLINE "UPS %s on line power"
NOTIFYMSG ONBATT "UPS %s on battery"
NOTIFYMSG LOWBATT "UPS %s battery is low"
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK "Communications with UPS %s established"
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM "UPS %s is unavailable"
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
NOTIFYFLAG ONLINE SYSLOG+WALL
NOTIFYFLAG ONBATT SYSLOG+WALL
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG
```
**Key settings**:
- `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS
- `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script
- `POLLFREQ 5`: Check UPS every 5 seconds
#### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`
```udev
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
```
**Purpose**: Ensure NUT can access USB UPS device
**Apply rule**:
```bash
udevadm control --reload-rules
udevadm trigger
```
### NUT Client Configuration (PVE2)
#### Monitor Config: `/etc/nut/upsmon.conf`
```ini
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
# Same NOTIFYMSG and NOTIFYFLAG as PVE
```
**Key difference**: `slave` instead of `master` - monitors remote UPS on PVE
---
## Custom Shutdown Script
### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)
```bash
#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low
LOG="/var/log/ups-shutdown.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}
log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"
# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
log "Stopping VM $VMID..."
qm shutdown $VMID
done
# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
log "Stopping CT $CTID..."
pct shutdown $CTID
done
# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60
# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
log "Stopping TrueNAS (VM 100) last..."
qm shutdown 100
sleep 30
fi
log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="
```
**Make executable**:
```bash
chmod +x /usr/local/bin/ups-shutdown.sh
```
**Script behavior**:
1. Stops all VMs (except TrueNAS)
2. Stops all containers
3. Waits 60 seconds
4. Stops TrueNAS last (storage must be cleanly unmounted)
5. **Does NOT shut down Proxmox hosts** - intentionally left running
**Why not shut down hosts?**
- BIOS configured to "Restore on AC Power Loss"
- When power returns, servers auto-boot and start VMs in order
- Avoids need for manual intervention
---
## Power Failure Behavior
### When Power Fails
1. **UPS switches to battery** (`OB DISCHRG` status)
2. **NUT monitors runtime** - polls every 5 seconds
3. **At 120 seconds (2 min) remaining**:
- NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers
- Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
4. **Hosts remain running** until UPS battery depletes
5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped)
### When Power Returns
1. **UPS charges battery**, power returns to servers
2. **BIOS "Restore on AC Power Loss"** boots both servers
3. **Proxmox starts** and auto-starts VMs in configured order:
| Order | Wait | VMs/CTs | Reason |
|-------|------|---------|--------|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
| 4 | 5s | pihole, traefik, findshyt | Containers |
PVE2 VMs: order=1, wait=10s
**Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
---
## UPS Status Codes
| Code | Meaning | Action |
|------|---------|--------|
| `OL` | Online (AC power) | Normal operation |
| `OB` | On Battery | Power outage - monitor runtime |
| `LB` | Low Battery | <2 min remaining - shutdown imminent |
| `CHRG` | Charging | Battery charging after power restored |
| `DISCHRG` | Discharging | On battery, draining |
| `FSD` | Forced Shutdown | NUT triggered shutdown |
---
## Monitoring & Commands
### Check UPS Status
```bash
# Full status
ssh pve 'upsc cyberpower@localhost'
# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
# Example output:
# battery.charge: 100
# battery.runtime: 1234 (seconds remaining)
# ups.load: 33 (% load)
# ups.status: OL (online)
```
### Control UPS Beeper
```bash
# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
```
### Test Shutdown Procedure
**Simulate low battery** (careful - this will shut down VMs!):
```bash
# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'
# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
```
**Better test**: Run shutdown script manually without actually triggering NUT:
```bash
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
---
## Home Assistant Integration
UPS metrics are exposed to Home Assistant via NUT integration.
### Available Sensors
| Entity ID | Description |
|-----------|-------------|
| `sensor.cyberpower_battery_charge` | Battery % (0-100) |
| `sensor.cyberpower_battery_runtime` | Seconds remaining on battery |
| `sensor.cyberpower_load` | Load % (0-100) |
| `sensor.cyberpower_input_voltage` | Input voltage (V AC) |
| `sensor.cyberpower_output_voltage` | Output voltage (V AC) |
| `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) |
### Configuration
**Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup.
### Example Automations
**Send notification when on battery**:
```yaml
automation:
- alias: "UPS On Battery Alert"
trigger:
- platform: state
entity_id: sensor.cyberpower_status
to: "OB"
action:
- service: notify.mobile_app
data:
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
```
**Alert when battery low**:
```yaml
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_runtime
below: 300
action:
- service: notify.mobile_app
data:
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
```
---
## Testing Results
### Full Power Failure Test (2025-12-21)
Complete end-to-end test of power failure and recovery:
| Event | Time | Duration | Notes |
|-------|------|----------|-------|
| **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
| **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
| **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed |
| **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
| **Power restored** | ~22:47 | - | Plugged back in |
| **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
| **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
| **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order |
| **Total recovery** | - | **~7 min** | From power return to fully operational |
**Results**:
✅ VMs shut down gracefully
✅ Hosts remained running until UPS died (as intended)
✅ Auto-boot on power restoration worked
✅ VMs started in correct order with appropriate delays
✅ No data corruption or issues
**Runtime calculation**:
- Load: ~33% (440W estimated)
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
- Matches manufacturer estimate for 33% load
---
## Proxmox Cluster Quorum Fix
### Problem
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
### Solution
Modified `/etc/pve/corosync.conf` to enable 2-node mode:
```
quorum {
provider: corosync_votequorum
two_node: 1
}
```
**Effect**:
- Either node can operate independently if the other is down
- No more waiting for quorum when one server is offline
- Both nodes visible in single Proxmox interface when both up
**Applied**: 2025-12-21
---
## Maintenance
### Monthly Checks
```bash
# Check UPS status
ssh pve 'upsc cyberpower@localhost'
# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'
# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'
# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'
# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'
```
### Battery Health
**Check battery stats**:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery'
# Key metrics:
# battery.charge: 100 (should be near 100 when on AC)
# battery.runtime: 1200+ (seconds at current load)
# battery.voltage: ~24V (normal for 24V battery system)
```
**Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`:
```bash
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
```
CyberPower batteries typically last 3-5 years.
### Firmware Updates
Check CyberPower website for firmware updates:
https://www.cyberpowersystems.com/support/firmware/
---
## Troubleshooting
### UPS Not Detected
```bash
# Check USB connection
ssh pve 'lsusb | grep Cyber'
# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'
```
### PVE2 Can't Connect
```bash
# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'
# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'
# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'
```
### Shutdown Script Not Running
```bash
# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
# Should be: -rwxr-xr-x (executable)
# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'
# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'
```
### UPS Status Shows UNKNOWN
```bash
# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'
# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb
# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'
```
---
## Future Improvements
- [ ] Add email alerts for UPS events (power fail, low battery)
- [ ] Log runtime statistics to track battery degradation
- [ ] Set up Grafana dashboard for UPS metrics
- [ ] Test battery runtime at different load levels
- [ ] Upgrade to 20A circuit, restore original 5-20P plug
- [ ] Consider adding network management card for out-of-band UPS access
---
## Related Documentation
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization
- [VMS.md](VMS.md) - VM startup order configuration
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration
---
**Last Updated**: 2025-12-22