Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
UPS and Power Management
Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.
Hardware
Current UPS
| Specification | Value |
|---|---|
| Model | CyberPower OR2200PFCRT2U |
| Capacity | 2200VA / 1320W |
| Form Factor | 2U rackmount |
| Output | PFC Sinewave (compatible with active PFC PSUs) |
| Outlets | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) |
| Input Plug | ⚠️ Originally NEMA 5-20P (20A), rewired to 5-15P (15A) |
| Runtime | ~15-20 min at typical load (~33% / 440W) |
| Installed | 2025-12-21 |
| Status | Active |
⚠️ Temporary Wiring Modification
Issue: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit Solution: Temporarily rewired plug from 5-20P → 5-15P for compatibility Risk: UPS can output 1320W but circuit limited to 1800W max (15A × 120V) Current draw: ~1000-1350W total (safe margin) Backlog: Upgrade to 20A circuit, restore original 5-20P plug
Previous UPS
| Model | Capacity | Issue | Replaced |
|---|---|---|---|
| WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 |
Why replaced: Combined server load of 1000-1350W exceeded 660W capacity.
Power Draw Estimates
Typical Load
| Component | Idle | Load | Notes |
|---|---|---|---|
| PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage |
| PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage |
| Network gear | ~50W | ~50W | Router, switches |
| Total | 500-700W | 1000-1400W | Varies by workload |
UPS Load: ~33-50% typical, 70-80% under heavy load
Runtime Calculation
At 440W load (33%): ~15-20 min runtime (tested 2025-12-21) At 660W load (50%): ~10-12 min estimated At 1000W load (75%): ~6-8 min estimated
NUT shutdown trigger: 120 seconds (2 min) remaining runtime
NUT (Network UPS Tools) Configuration
Architecture
UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
│
└──> Home Assistant (monitoring only)
Master: PVE (10.10.10.120) - UPS connected via USB, runs NUT server Slave: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered
NUT Server Configuration (PVE)
1. UPS Driver Config: /etc/nut/ups.conf
[cyberpower]
driver = usbhid-ups
port = auto
desc = "CyberPower OR2200PFCRT2U"
override.battery.charge.low = 20
override.battery.runtime.low = 120
Key settings:
driver = usbhid-ups: USB HID UPS driver (generic for CyberPower)port = auto: Auto-detect USB deviceoverride.battery.runtime.low = 120: Trigger shutdown at 120 seconds (2 min) remaining
2. NUT Server Config: /etc/nut/upsd.conf
LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493
Listens on:
- Localhost (for local monitoring)
- LAN IP (for PVE2 to connect)
3. User Config: /etc/nut/upsd.users
[admin]
password = upsadmin123
actions = SET
instcmds = ALL
[upsmon]
password = upsmon123
upsmon master
Users:
admin: Full control, can run commandsupsmon: Monitoring only (used by PVE2)
4. Monitor Config: /etc/nut/upsmon.conf
MONITOR cyberpower@localhost 1 upsmon upsmon123 master
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONLINE "UPS %s on line power"
NOTIFYMSG ONBATT "UPS %s on battery"
NOTIFYMSG LOWBATT "UPS %s battery is low"
NOTIFYMSG FSD "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK "Communications with UPS %s established"
NOTIFYMSG COMMBAD "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM "UPS %s is unavailable"
NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible"
NOTIFYFLAG ONLINE SYSLOG+WALL
NOTIFYFLAG ONBATT SYSLOG+WALL
NOTIFYFLAG LOWBATT SYSLOG+WALL
NOTIFYFLAG FSD SYSLOG+WALL
NOTIFYFLAG COMMOK SYSLOG+WALL
NOTIFYFLAG COMMBAD SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG
Key settings:
MONITOR cyberpower@localhost 1 upsmon upsmon123 master: Monitor local UPSSHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh": Custom shutdown scriptPOLLFREQ 5: Check UPS every 5 seconds
5. USB Permissions: /etc/udev/rules.d/99-nut-ups.rules
SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"
Purpose: Ensure NUT can access USB UPS device
Apply rule:
udevadm control --reload-rules
udevadm trigger
NUT Client Configuration (PVE2)
Monitor Config: /etc/nut/upsmon.conf
MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave
MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
# Same NOTIFYMSG and NOTIFYFLAG as PVE
Key difference: slave instead of master - monitors remote UPS on PVE
Custom Shutdown Script
/usr/local/bin/ups-shutdown.sh (Same on both PVE and PVE2)
#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low
LOG="/var/log/ups-shutdown.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}
log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"
# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
log "Stopping VM $VMID..."
qm shutdown $VMID
done
# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
log "Stopping CT $CTID..."
pct shutdown $CTID
done
# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60
# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
log "Stopping TrueNAS (VM 100) last..."
qm shutdown 100
sleep 30
fi
log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="
Make executable:
chmod +x /usr/local/bin/ups-shutdown.sh
Script behavior:
- Stops all VMs (except TrueNAS)
- Stops all containers
- Waits 60 seconds
- Stops TrueNAS last (storage must be cleanly unmounted)
- Does NOT shut down Proxmox hosts - intentionally left running
Why not shut down hosts?
- BIOS configured to "Restore on AC Power Loss"
- When power returns, servers auto-boot and start VMs in order
- Avoids need for manual intervention
Power Failure Behavior
When Power Fails
- UPS switches to battery (
OB DISCHRGstatus) - NUT monitors runtime - polls every 5 seconds
- At 120 seconds (2 min) remaining:
- NUT triggers
/usr/local/bin/ups-shutdown.shon both servers - Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
- NUT triggers
- Hosts remain running until UPS battery depletes
- UPS battery dies → Hosts lose power (ungraceful but safe - VMs already stopped)
When Power Returns
- UPS charges battery, power returns to servers
- BIOS "Restore on AC Power Loss" boots both servers
- Proxmox starts and auto-starts VMs in configured order:
| Order | Wait | VMs/CTs | Reason |
|---|---|---|---|
| 1 | 30s | TrueNAS (VM 100) | Storage must start first |
| 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS |
| 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs |
| 4 | 5s | pihole, traefik, findshyt | Containers |
PVE2 VMs: order=1, wait=10s
Total recovery time: ~7 minutes from power restoration to fully operational (tested 2025-12-21)
UPS Status Codes
| Code | Meaning | Action |
|---|---|---|
OL |
Online (AC power) | Normal operation |
OB |
On Battery | Power outage - monitor runtime |
LB |
Low Battery | <2 min remaining - shutdown imminent |
CHRG |
Charging | Battery charging after power restored |
DISCHRG |
Discharging | On battery, draining |
FSD |
Forced Shutdown | NUT triggered shutdown |
Monitoring & Commands
Check UPS Status
# Full status
ssh pve 'upsc cyberpower@localhost'
# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'
# Example output:
# battery.charge: 100
# battery.runtime: 1234 (seconds remaining)
# ups.load: 33 (% load)
# ups.status: OL (online)
Control UPS Beeper
# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'
# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'
# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'
Test Shutdown Procedure
Simulate low battery (careful - this will shut down VMs!):
# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'
# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'
# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'
Better test: Run shutdown script manually without actually triggering NUT:
ssh pve '/usr/local/bin/ups-shutdown.sh'
Home Assistant Integration
UPS metrics are exposed to Home Assistant via NUT integration.
Available Sensors
| Entity ID | Description |
|---|---|
sensor.cyberpower_battery_charge |
Battery % (0-100) |
sensor.cyberpower_battery_runtime |
Seconds remaining on battery |
sensor.cyberpower_load |
Load % (0-100) |
sensor.cyberpower_input_voltage |
Input voltage (V AC) |
sensor.cyberpower_output_voltage |
Output voltage (V AC) |
sensor.cyberpower_status |
Status text (OL, OB, LB, etc.) |
Configuration
Home Assistant: See HOMEASSISTANT.md for integration setup.
Example Automations
Send notification when on battery:
automation:
- alias: "UPS On Battery Alert"
trigger:
- platform: state
entity_id: sensor.cyberpower_status
to: "OB"
action:
- service: notify.mobile_app
data:
message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"
Alert when battery low:
automation:
- alias: "UPS Low Battery Alert"
trigger:
- platform: numeric_state
entity_id: sensor.cyberpower_battery_runtime
below: 300
action:
- service: notify.mobile_app
data:
message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"
Testing Results
Full Power Failure Test (2025-12-21)
Complete end-to-end test of power failure and recovery:
| Event | Time | Duration | Notes |
|---|---|---|---|
| Power pulled | 22:30 | - | UPS on battery, ~15 min runtime at 33% load |
| Low battery trigger | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran |
| All VMs stopped | 22:41:36 | +0:58 | Graceful shutdown completed |
| UPS died | 22:46:29 | +4:53 | Hosts lost power at 0% battery |
| Power restored | ~22:47 | - | Plugged back in |
| PVE online | 22:49:11 | +2:11 | BIOS boot, Proxmox started |
| PVE2 online | 22:50:47 | +3:47 | BIOS boot, Proxmox started |
| All VMs running | 22:53:39 | +6:39 | Auto-started in correct order |
| Total recovery | - | ~7 min | From power return to fully operational |
Results: ✅ VMs shut down gracefully ✅ Hosts remained running until UPS died (as intended) ✅ Auto-boot on power restoration worked ✅ VMs started in correct order with appropriate delays ✅ No data corruption or issues
Runtime calculation:
- Load: ~33% (440W estimated)
- Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
- Matches manufacturer estimate for 33% load
Proxmox Cluster Quorum Fix
Problem
With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.
During UPS testing, this would prevent the remaining node from starting VMs after power restoration.
Solution
Modified /etc/pve/corosync.conf to enable 2-node mode:
quorum {
provider: corosync_votequorum
two_node: 1
}
Effect:
- Either node can operate independently if the other is down
- No more waiting for quorum when one server is offline
- Both nodes visible in single Proxmox interface when both up
Applied: 2025-12-21
Maintenance
Monthly Checks
# Check UPS status
ssh pve 'upsc cyberpower@localhost'
# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'
# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'
# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'
# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'
Battery Health
Check battery stats:
ssh pve 'upsc cyberpower@localhost | grep battery'
# Key metrics:
# battery.charge: 100 (should be near 100 when on AC)
# battery.runtime: 1200+ (seconds at current load)
# battery.voltage: ~24V (normal for 24V battery system)
Battery replacement: When runtime significantly decreases or UPS reports REPLBATT:
ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'
CyberPower batteries typically last 3-5 years.
Firmware Updates
Check CyberPower website for firmware updates: https://www.cyberpowersystems.com/support/firmware/
Troubleshooting
UPS Not Detected
# Check USB connection
ssh pve 'lsusb | grep Cyber'
# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS
# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'
PVE2 Can't Connect
# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'
# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN
# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'
# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'
Shutdown Script Not Running
# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'
# Should be: -rwxr-xr-x (executable)
# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'
# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'
UPS Status Shows UNKNOWN
# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'
# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb
# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'
Future Improvements
- Add email alerts for UPS events (power fail, low battery)
- Log runtime statistics to track battery degradation
- Set up Grafana dashboard for UPS metrics
- Test battery runtime at different load levels
- Upgrade to 20A circuit, restore original 5-20P plug
- Consider adding network management card for out-of-band UPS access
Related Documentation
- POWER-MANAGEMENT.md - Overall power optimization
- VMS.md - VM startup order configuration
- HOMEASSISTANT.md - UPS sensor integration
Last Updated: 2025-12-22