Files
homelab-docs/UPS.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

16 KiB
Raw Permalink Blame History

UPS and Power Management

Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.

Hardware

Current UPS

Specification Value
Model CyberPower OR2200PFCRT2U
Capacity 2200VA / 1320W
Form Factor 2U rackmount
Output PFC Sinewave (compatible with active PFC PSUs)
Outlets 2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge)
Input Plug ⚠️ Originally NEMA 5-20P (20A), rewired to 5-15P (15A)
Runtime ~15-20 min at typical load (~33% / 440W)
Installed 2025-12-21
Status Active

⚠️ Temporary Wiring Modification

Issue: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit Solution: Temporarily rewired plug from 5-20P → 5-15P for compatibility Risk: UPS can output 1320W but circuit limited to 1800W max (15A × 120V) Current draw: ~1000-1350W total (safe margin) Backlog: Upgrade to 20A circuit, restore original 5-20P plug

Previous UPS

Model Capacity Issue Replaced
WattBox WB-1100-IPVMB-6 1100VA / 660W Insufficient for dual Threadripper setup 2025-12-21

Why replaced: Combined server load of 1000-1350W exceeded 660W capacity.


Power Draw Estimates

Typical Load

Component Idle Load Notes
PVE Server 250-350W 500-750W CPU + TITAN RTX + P2000 + storage
PVE2 Server 200-300W 450-600W CPU + RTX A6000 + storage
Network gear ~50W ~50W Router, switches
Total 500-700W 1000-1400W Varies by workload

UPS Load: ~33-50% typical, 70-80% under heavy load

Runtime Calculation

At 440W load (33%): ~15-20 min runtime (tested 2025-12-21) At 660W load (50%): ~10-12 min estimated At 1000W load (75%): ~6-8 min estimated

NUT shutdown trigger: 120 seconds (2 min) remaining runtime


NUT (Network UPS Tools) Configuration

Architecture

UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
                      │
                      └──> Home Assistant (monitoring only)

Master: PVE (10.10.10.120) - UPS connected via USB, runs NUT server Slave: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered

NUT Server Configuration (PVE)

1. UPS Driver Config: /etc/nut/ups.conf

[cyberpower]
    driver = usbhid-ups
    port = auto
    desc = "CyberPower OR2200PFCRT2U"
    override.battery.charge.low = 20
    override.battery.runtime.low = 120

Key settings:

  • driver = usbhid-ups: USB HID UPS driver (generic for CyberPower)
  • port = auto: Auto-detect USB device
  • override.battery.runtime.low = 120: Trigger shutdown at 120 seconds (2 min) remaining

2. NUT Server Config: /etc/nut/upsd.conf

LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493

Listens on:

  • Localhost (for local monitoring)
  • LAN IP (for PVE2 to connect)

3. User Config: /etc/nut/upsd.users

[admin]
    password = upsadmin123
    actions = SET
    instcmds = ALL

[upsmon]
    password = upsmon123
    upsmon master

Users:

  • admin: Full control, can run commands
  • upsmon: Monitoring only (used by PVE2)

4. Monitor Config: /etc/nut/upsmon.conf

MONITOR cyberpower@localhost 1 upsmon upsmon123 master

MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower

NOTIFYMSG ONLINE    "UPS %s on line power"
NOTIFYMSG ONBATT    "UPS %s on battery"
NOTIFYMSG LOWBATT   "UPS %s battery is low"
NOTIFYMSG FSD       "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK    "Communications with UPS %s established"
NOTIFYMSG COMMBAD   "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN  "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT  "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM    "UPS %s is unavailable"
NOTIFYMSG NOPARENT  "upsmon parent process died - shutdown impossible"

NOTIFYFLAG ONLINE   SYSLOG+WALL
NOTIFYFLAG ONBATT   SYSLOG+WALL
NOTIFYFLAG LOWBATT  SYSLOG+WALL
NOTIFYFLAG FSD      SYSLOG+WALL
NOTIFYFLAG COMMOK   SYSLOG+WALL
NOTIFYFLAG COMMBAD  SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM   SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG

Key settings:

  • MONITOR cyberpower@localhost 1 upsmon upsmon123 master: Monitor local UPS
  • SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh": Custom shutdown script
  • POLLFREQ 5: Check UPS every 5 seconds

5. USB Permissions: /etc/udev/rules.d/99-nut-ups.rules

SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"

Purpose: Ensure NUT can access USB UPS device

Apply rule:

udevadm control --reload-rules
udevadm trigger

NUT Client Configuration (PVE2)

Monitor Config: /etc/nut/upsmon.conf

MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave

MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower

# Same NOTIFYMSG and NOTIFYFLAG as PVE

Key difference: slave instead of master - monitors remote UPS on PVE


Custom Shutdown Script

/usr/local/bin/ups-shutdown.sh (Same on both PVE and PVE2)

#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low

LOG="/var/log/ups-shutdown.log"

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}

log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"

# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
    log "Stopping VM $VMID..."
    qm shutdown $VMID
done

# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
    log "Stopping CT $CTID..."
    pct shutdown $CTID
done

# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60

# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
    log "Stopping TrueNAS (VM 100) last..."
    qm shutdown 100
    sleep 30
fi

log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="

Make executable:

chmod +x /usr/local/bin/ups-shutdown.sh

Script behavior:

  1. Stops all VMs (except TrueNAS)
  2. Stops all containers
  3. Waits 60 seconds
  4. Stops TrueNAS last (storage must be cleanly unmounted)
  5. Does NOT shut down Proxmox hosts - intentionally left running

Why not shut down hosts?

  • BIOS configured to "Restore on AC Power Loss"
  • When power returns, servers auto-boot and start VMs in order
  • Avoids need for manual intervention

Power Failure Behavior

When Power Fails

  1. UPS switches to battery (OB DISCHRG status)
  2. NUT monitors runtime - polls every 5 seconds
  3. At 120 seconds (2 min) remaining:
    • NUT triggers /usr/local/bin/ups-shutdown.sh on both servers
    • Script gracefully stops all VMs/CTs
    • TrueNAS stopped last (storage integrity)
  4. Hosts remain running until UPS battery depletes
  5. UPS battery dies → Hosts lose power (ungraceful but safe - VMs already stopped)

When Power Returns

  1. UPS charges battery, power returns to servers
  2. BIOS "Restore on AC Power Loss" boots both servers
  3. Proxmox starts and auto-starts VMs in configured order:
Order Wait VMs/CTs Reason
1 30s TrueNAS (VM 100) Storage must start first
2 60s Saltbox (VM 101) Depends on TrueNAS NFS
3 10s fs-dev, homeassistant, lmdev1, copyparty, docker-host General VMs
4 5s pihole, traefik, findshyt Containers

PVE2 VMs: order=1, wait=10s

Total recovery time: ~7 minutes from power restoration to fully operational (tested 2025-12-21)


UPS Status Codes

Code Meaning Action
OL Online (AC power) Normal operation
OB On Battery Power outage - monitor runtime
LB Low Battery <2 min remaining - shutdown imminent
CHRG Charging Battery charging after power restored
DISCHRG Discharging On battery, draining
FSD Forced Shutdown NUT triggered shutdown

Monitoring & Commands

Check UPS Status

# Full status
ssh pve 'upsc cyberpower@localhost'

# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

# Example output:
# battery.charge: 100
# battery.runtime: 1234        (seconds remaining)
# ups.load: 33                  (% load)
# ups.status: OL                (online)

Control UPS Beeper

# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'

# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'

# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'

Test Shutdown Procedure

Simulate low battery (careful - this will shut down VMs!):

# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'

# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'

# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'

Better test: Run shutdown script manually without actually triggering NUT:

ssh pve '/usr/local/bin/ups-shutdown.sh'

Home Assistant Integration

UPS metrics are exposed to Home Assistant via NUT integration.

Available Sensors

Entity ID Description
sensor.cyberpower_battery_charge Battery % (0-100)
sensor.cyberpower_battery_runtime Seconds remaining on battery
sensor.cyberpower_load Load % (0-100)
sensor.cyberpower_input_voltage Input voltage (V AC)
sensor.cyberpower_output_voltage Output voltage (V AC)
sensor.cyberpower_status Status text (OL, OB, LB, etc.)

Configuration

Home Assistant: See HOMEASSISTANT.md for integration setup.

Example Automations

Send notification when on battery:

automation:
  - alias: "UPS On Battery Alert"
    trigger:
      - platform: state
        entity_id: sensor.cyberpower_status
        to: "OB"
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"

Alert when battery low:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_runtime
        below: 300
    action:
      - service: notify.mobile_app
        data:
          message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"

Testing Results

Full Power Failure Test (2025-12-21)

Complete end-to-end test of power failure and recovery:

Event Time Duration Notes
Power pulled 22:30 - UPS on battery, ~15 min runtime at 33% load
Low battery trigger 22:40:38 +10:38 Runtime < 120s, shutdown script ran
All VMs stopped 22:41:36 +0:58 Graceful shutdown completed
UPS died 22:46:29 +4:53 Hosts lost power at 0% battery
Power restored ~22:47 - Plugged back in
PVE online 22:49:11 +2:11 BIOS boot, Proxmox started
PVE2 online 22:50:47 +3:47 BIOS boot, Proxmox started
All VMs running 22:53:39 +6:39 Auto-started in correct order
Total recovery - ~7 min From power return to fully operational

Results: VMs shut down gracefully Hosts remained running until UPS died (as intended) Auto-boot on power restoration worked VMs started in correct order with appropriate delays No data corruption or issues

Runtime calculation:

  • Load: ~33% (440W estimated)
  • Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
  • Matches manufacturer estimate for 33% load

Proxmox Cluster Quorum Fix

Problem

With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.

During UPS testing, this would prevent the remaining node from starting VMs after power restoration.

Solution

Modified /etc/pve/corosync.conf to enable 2-node mode:

quorum {
    provider: corosync_votequorum
    two_node: 1
}

Effect:

  • Either node can operate independently if the other is down
  • No more waiting for quorum when one server is offline
  • Both nodes visible in single Proxmox interface when both up

Applied: 2025-12-21


Maintenance

Monthly Checks

# Check UPS status
ssh pve 'upsc cyberpower@localhost'

# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'

# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'

# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'

# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'

Battery Health

Check battery stats:

ssh pve 'upsc cyberpower@localhost | grep battery'

# Key metrics:
# battery.charge: 100          (should be near 100 when on AC)
# battery.runtime: 1200+       (seconds at current load)
# battery.voltage: ~24V        (normal for 24V battery system)

Battery replacement: When runtime significantly decreases or UPS reports REPLBATT:

ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'

CyberPower batteries typically last 3-5 years.

Firmware Updates

Check CyberPower website for firmware updates: https://www.cyberpowersystems.com/support/firmware/


Troubleshooting

UPS Not Detected

# Check USB connection
ssh pve 'lsusb | grep Cyber'

# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS

# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'

PVE2 Can't Connect

# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'

# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN

# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'

# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'

Shutdown Script Not Running

# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'

# Should be: -rwxr-xr-x (executable)

# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'

# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'

UPS Status Shows UNKNOWN

# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'

# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb

# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'

Future Improvements

  • Add email alerts for UPS events (power fail, low battery)
  • Log runtime statistics to track battery degradation
  • Set up Grafana dashboard for UPS metrics
  • Test battery runtime at different load levels
  • Upgrade to 20A circuit, restore original 5-20P plug
  • Consider adding network management card for out-of-band UPS access


Last Updated: 2025-12-22