# UPS and Power Management Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures. ## Hardware ### Current UPS | Specification | Value | |---------------|-------| | **Model** | CyberPower OR2200PFCRT2U | | **Capacity** | 2200VA / 1320W | | **Form Factor** | 2U rackmount | | **Output** | PFC Sinewave (compatible with active PFC PSUs) | | **Outlets** | 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge) | | **Input Plug** | ⚠️ Originally NEMA 5-20P (20A), **rewired to 5-15P (15A)** | | **Runtime** | ~15-20 min at typical load (~33% / 440W) | | **Installed** | 2025-12-21 | | **Status** | Active | ### ⚠️ Temporary Wiring Modification **Issue**: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit **Solution**: Temporarily rewired plug from 5-20P → 5-15P for compatibility **Risk**: UPS can output 1320W but circuit limited to 1800W max (15A × 120V) **Current draw**: ~1000-1350W total (safe margin) **Backlog**: Upgrade to 20A circuit, restore original 5-20P plug ### Previous UPS | Model | Capacity | Issue | Replaced | |-------|----------|-------|----------| | WattBox WB-1100-IPVMB-6 | 1100VA / 660W | Insufficient for dual Threadripper setup | 2025-12-21 | **Why replaced**: Combined server load of 1000-1350W exceeded 660W capacity. --- ## Power Draw Estimates ### Typical Load | Component | Idle | Load | Notes | |-----------|------|------|-------| | PVE Server | 250-350W | 500-750W | CPU + TITAN RTX + P2000 + storage | | PVE2 Server | 200-300W | 450-600W | CPU + RTX A6000 + storage | | Network gear | ~50W | ~50W | Router, switches | | **Total** | **500-700W** | **1000-1400W** | Varies by workload | **UPS Load**: ~33-50% typical, 70-80% under heavy load ### Runtime Calculation At **440W load** (33%): ~15-20 min runtime (tested 2025-12-21) At **660W load** (50%): ~10-12 min estimated At **1000W load** (75%): ~6-8 min estimated **NUT shutdown trigger**: 120 seconds (2 min) remaining runtime --- ## NUT (Network UPS Tools) Configuration ### Architecture ``` UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave) │ └──> Home Assistant (monitoring only) ``` **Master**: PVE (10.10.10.120) - UPS connected via USB, runs NUT server **Slave**: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered ### NUT Server Configuration (PVE) #### 1. UPS Driver Config: `/etc/nut/ups.conf` ```ini [cyberpower] driver = usbhid-ups port = auto desc = "CyberPower OR2200PFCRT2U" override.battery.charge.low = 20 override.battery.runtime.low = 120 ``` **Key settings**: - `driver = usbhid-ups`: USB HID UPS driver (generic for CyberPower) - `port = auto`: Auto-detect USB device - `override.battery.runtime.low = 120`: Trigger shutdown at 120 seconds (2 min) remaining #### 2. NUT Server Config: `/etc/nut/upsd.conf` ```ini LISTEN 127.0.0.1 3493 LISTEN 10.10.10.120 3493 ``` **Listens on**: - Localhost (for local monitoring) - LAN IP (for PVE2 to connect) #### 3. User Config: `/etc/nut/upsd.users` ```ini [admin] password = upsadmin123 actions = SET instcmds = ALL [upsmon] password = upsmon123 upsmon master ``` **Users**: - `admin`: Full control, can run commands - `upsmon`: Monitoring only (used by PVE2) #### 4. Monitor Config: `/etc/nut/upsmon.conf` ```ini MONITOR cyberpower@localhost 1 upsmon upsmon123 master MINSUPPLIES 1 SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh" NOTIFYCMD /usr/sbin/upssched POLLFREQ 5 POLLFREQALERT 5 HOSTSYNC 15 DEADTIME 15 POWERDOWNFLAG /etc/killpower NOTIFYMSG ONLINE "UPS %s on line power" NOTIFYMSG ONBATT "UPS %s on battery" NOTIFYMSG LOWBATT "UPS %s battery is low" NOTIFYMSG FSD "UPS %s: forced shutdown in progress" NOTIFYMSG COMMOK "Communications with UPS %s established" NOTIFYMSG COMMBAD "Communications with UPS %s lost" NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding" NOTIFYMSG REPLBATT "UPS %s battery needs to be replaced" NOTIFYMSG NOCOMM "UPS %s is unavailable" NOTIFYMSG NOPARENT "upsmon parent process died - shutdown impossible" NOTIFYFLAG ONLINE SYSLOG+WALL NOTIFYFLAG ONBATT SYSLOG+WALL NOTIFYFLAG LOWBATT SYSLOG+WALL NOTIFYFLAG FSD SYSLOG+WALL NOTIFYFLAG COMMOK SYSLOG+WALL NOTIFYFLAG COMMBAD SYSLOG+WALL NOTIFYFLAG SHUTDOWN SYSLOG+WALL NOTIFYFLAG REPLBATT SYSLOG+WALL NOTIFYFLAG NOCOMM SYSLOG+WALL NOTIFYFLAG NOPARENT SYSLOG ``` **Key settings**: - `MONITOR cyberpower@localhost 1 upsmon upsmon123 master`: Monitor local UPS - `SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"`: Custom shutdown script - `POLLFREQ 5`: Check UPS every 5 seconds #### 5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules` ```udev SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut" ``` **Purpose**: Ensure NUT can access USB UPS device **Apply rule**: ```bash udevadm control --reload-rules udevadm trigger ``` ### NUT Client Configuration (PVE2) #### Monitor Config: `/etc/nut/upsmon.conf` ```ini MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave MINSUPPLIES 1 SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh" POLLFREQ 5 POLLFREQALERT 5 HOSTSYNC 15 DEADTIME 15 POWERDOWNFLAG /etc/killpower # Same NOTIFYMSG and NOTIFYFLAG as PVE ``` **Key difference**: `slave` instead of `master` - monitors remote UPS on PVE --- ## Custom Shutdown Script ### `/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2) ```bash #!/bin/bash # Graceful VM/CT shutdown when UPS battery low LOG="/var/log/ups-shutdown.log" log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG" } log "=== UPS Shutdown Triggered ===" log "Battery low - initiating graceful shutdown of VMs/CTs" # Get list of running VMs (skip TrueNAS for now) VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}') for VMID in $VMS; do log "Stopping VM $VMID..." qm shutdown $VMID done # Get list of running containers CTS=$(pct list | awk '$2=="running" {print $1}') for CTID in $CTS; do log "Stopping CT $CTID..." pct shutdown $CTID done # Wait for VMs/CTs to stop log "Waiting 60 seconds for VMs/CTs to shut down..." sleep 60 # Now stop TrueNAS (storage - must be last) if qm status 100 | grep -q running; then log "Stopping TrueNAS (VM 100) last..." qm shutdown 100 sleep 30 fi log "All VMs/CTs stopped. Host will remain running until UPS dies." log "=== UPS Shutdown Complete ===" ``` **Make executable**: ```bash chmod +x /usr/local/bin/ups-shutdown.sh ``` **Script behavior**: 1. Stops all VMs (except TrueNAS) 2. Stops all containers 3. Waits 60 seconds 4. Stops TrueNAS last (storage must be cleanly unmounted) 5. **Does NOT shut down Proxmox hosts** - intentionally left running **Why not shut down hosts?** - BIOS configured to "Restore on AC Power Loss" - When power returns, servers auto-boot and start VMs in order - Avoids need for manual intervention --- ## Power Failure Behavior ### When Power Fails 1. **UPS switches to battery** (`OB DISCHRG` status) 2. **NUT monitors runtime** - polls every 5 seconds 3. **At 120 seconds (2 min) remaining**: - NUT triggers `/usr/local/bin/ups-shutdown.sh` on both servers - Script gracefully stops all VMs/CTs - TrueNAS stopped last (storage integrity) 4. **Hosts remain running** until UPS battery depletes 5. **UPS battery dies** → Hosts lose power (ungraceful but safe - VMs already stopped) ### When Power Returns 1. **UPS charges battery**, power returns to servers 2. **BIOS "Restore on AC Power Loss"** boots both servers 3. **Proxmox starts** and auto-starts VMs in configured order: | Order | Wait | VMs/CTs | Reason | |-------|------|---------|--------| | 1 | 30s | TrueNAS (VM 100) | Storage must start first | | 2 | 60s | Saltbox (VM 101) | Depends on TrueNAS NFS | | 3 | 10s | fs-dev, homeassistant, lmdev1, copyparty, docker-host | General VMs | | 4 | 5s | pihole, traefik, findshyt | Containers | PVE2 VMs: order=1, wait=10s **Total recovery time**: ~7 minutes from power restoration to fully operational (tested 2025-12-21) --- ## UPS Status Codes | Code | Meaning | Action | |------|---------|--------| | `OL` | Online (AC power) | Normal operation | | `OB` | On Battery | Power outage - monitor runtime | | `LB` | Low Battery | <2 min remaining - shutdown imminent | | `CHRG` | Charging | Battery charging after power restored | | `DISCHRG` | Discharging | On battery, draining | | `FSD` | Forced Shutdown | NUT triggered shutdown | --- ## Monitoring & Commands ### Check UPS Status ```bash # Full status ssh pve 'upsc cyberpower@localhost' # Key metrics only ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"' # Example output: # battery.charge: 100 # battery.runtime: 1234 (seconds remaining) # ups.load: 33 (% load) # ups.status: OL (online) ``` ### Control UPS Beeper ```bash # Mute beeper (temporary - until next power event) ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute' # Disable beeper (permanent) ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable' # Enable beeper ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable' ``` ### Test Shutdown Procedure **Simulate low battery** (careful - this will shut down VMs!): ```bash # Set a very high low battery threshold to trigger shutdown ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost' # Watch it trigger (when runtime drops below 300 seconds) ssh pve 'tail -f /var/log/ups-shutdown.log' # Reset to normal ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost' ``` **Better test**: Run shutdown script manually without actually triggering NUT: ```bash ssh pve '/usr/local/bin/ups-shutdown.sh' ``` --- ## Home Assistant Integration UPS metrics are exposed to Home Assistant via NUT integration. ### Available Sensors | Entity ID | Description | |-----------|-------------| | `sensor.cyberpower_battery_charge` | Battery % (0-100) | | `sensor.cyberpower_battery_runtime` | Seconds remaining on battery | | `sensor.cyberpower_load` | Load % (0-100) | | `sensor.cyberpower_input_voltage` | Input voltage (V AC) | | `sensor.cyberpower_output_voltage` | Output voltage (V AC) | | `sensor.cyberpower_status` | Status text (OL, OB, LB, etc.) | ### Configuration **Home Assistant**: See [HOMEASSISTANT.md](HOMEASSISTANT.md) for integration setup. ### Example Automations **Send notification when on battery**: ```yaml automation: - alias: "UPS On Battery Alert" trigger: - platform: state entity_id: sensor.cyberpower_status to: "OB" action: - service: notify.mobile_app data: message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s" ``` **Alert when battery low**: ```yaml automation: - alias: "UPS Low Battery Alert" trigger: - platform: numeric_state entity_id: sensor.cyberpower_battery_runtime below: 300 action: - service: notify.mobile_app data: message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining" ``` --- ## Testing Results ### Full Power Failure Test (2025-12-21) Complete end-to-end test of power failure and recovery: | Event | Time | Duration | Notes | |-------|------|----------|-------| | **Power pulled** | 22:30 | - | UPS on battery, ~15 min runtime at 33% load | | **Low battery trigger** | 22:40:38 | +10:38 | Runtime < 120s, shutdown script ran | | **All VMs stopped** | 22:41:36 | +0:58 | Graceful shutdown completed | | **UPS died** | 22:46:29 | +4:53 | Hosts lost power at 0% battery | | **Power restored** | ~22:47 | - | Plugged back in | | **PVE online** | 22:49:11 | +2:11 | BIOS boot, Proxmox started | | **PVE2 online** | 22:50:47 | +3:47 | BIOS boot, Proxmox started | | **All VMs running** | 22:53:39 | +6:39 | Auto-started in correct order | | **Total recovery** | - | **~7 min** | From power return to fully operational | **Results**: ✅ VMs shut down gracefully ✅ Hosts remained running until UPS died (as intended) ✅ Auto-boot on power restoration worked ✅ VMs started in correct order with appropriate delays ✅ No data corruption or issues **Runtime calculation**: - Load: ~33% (440W estimated) - Total runtime on battery: ~16 minutes (22:30 → 22:46:29) - Matches manufacturer estimate for 33% load --- ## Proxmox Cluster Quorum Fix ### Problem With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs. During UPS testing, this would prevent the remaining node from starting VMs after power restoration. ### Solution Modified `/etc/pve/corosync.conf` to enable 2-node mode: ``` quorum { provider: corosync_votequorum two_node: 1 } ``` **Effect**: - Either node can operate independently if the other is down - No more waiting for quorum when one server is offline - Both nodes visible in single Proxmox interface when both up **Applied**: 2025-12-21 --- ## Maintenance ### Monthly Checks ```bash # Check UPS status ssh pve 'upsc cyberpower@localhost' # Check NUT server running ssh pve 'systemctl status nut-server' ssh pve 'systemctl status nut-monitor' # Check NUT client running (PVE2) ssh pve2 'systemctl status nut-monitor' # Verify PVE2 can see UPS ssh pve2 'upsc cyberpower@10.10.10.120' # Check logs for errors ssh pve 'journalctl -u nut-server -n 50' ssh pve 'journalctl -u nut-monitor -n 50' ``` ### Battery Health **Check battery stats**: ```bash ssh pve 'upsc cyberpower@localhost | grep battery' # Key metrics: # battery.charge: 100 (should be near 100 when on AC) # battery.runtime: 1200+ (seconds at current load) # battery.voltage: ~24V (normal for 24V battery system) ``` **Battery replacement**: When runtime significantly decreases or UPS reports `REPLBATT`: ```bash ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date' ``` CyberPower batteries typically last 3-5 years. ### Firmware Updates Check CyberPower website for firmware updates: https://www.cyberpowersystems.com/support/firmware/ --- ## Troubleshooting ### UPS Not Detected ```bash # Check USB connection ssh pve 'lsusb | grep Cyber' # Expected: # Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS # Restart NUT driver ssh pve 'systemctl restart nut-driver' ssh pve 'systemctl status nut-driver' ``` ### PVE2 Can't Connect ```bash # Verify NUT server listening ssh pve 'netstat -tuln | grep 3493' # Should show: # tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN # Test connection from PVE2 ssh pve2 'telnet 10.10.10.120 3493' # Check firewall (should allow port 3493) ssh pve 'iptables -L -n | grep 3493' ``` ### Shutdown Script Not Running ```bash # Check script permissions ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh' # Should be: -rwxr-xr-x (executable) # Check logs ssh pve 'cat /var/log/ups-shutdown.log' # Test script manually ssh pve '/usr/local/bin/ups-shutdown.sh' ``` ### UPS Status Shows UNKNOWN ```bash # Driver may not be compatible ssh pve 'upsc cyberpower@localhost ups.status' # Try different driver (in /etc/nut/ups.conf) # driver = usbhid-ups # or # driver = blazer_usb # Restart after change ssh pve 'systemctl restart nut-driver nut-server' ``` --- ## Future Improvements - [ ] Add email alerts for UPS events (power fail, low battery) - [ ] Log runtime statistics to track battery degradation - [ ] Set up Grafana dashboard for UPS metrics - [ ] Test battery runtime at different load levels - [ ] Upgrade to 20A circuit, restore original 5-20P plug - [ ] Consider adding network management card for out-of-band UPS access --- ## Related Documentation - [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Overall power optimization - [VMS.md](VMS.md) - VM startup order configuration - [HOMEASSISTANT.md](HOMEASSISTANT.md) - UPS sensor integration --- **Last Updated**: 2025-12-22