Files

Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-23 00:34:21 -05:00

16 KiB

Raw Permalink Blame History

UPS and Power Management

Documentation for UPS (Uninterruptible Power Supply) configuration, NUT (Network UPS Tools) monitoring, and power failure procedures.

Hardware

Current UPS

Specification	Value
Model	CyberPower OR2200PFCRT2U
Capacity	2200VA / 1320W
Form Factor	2U rackmount
Output	PFC Sinewave (compatible with active PFC PSUs)
Outlets	2x NEMA 5-20R + 6x NEMA 5-15R (all battery + surge)
Input Plug	⚠️ Originally NEMA 5-20P (20A), rewired to 5-15P (15A)
Runtime	~15-20 min at typical load (~33% / 440W)
Installed	2025-12-21
Status	Active

⚠️ Temporary Wiring Modification

Issue: UPS came with NEMA 5-20P plug (20A) but server rack is on 15A circuit Solution: Temporarily rewired plug from 5-20P → 5-15P for compatibility Risk: UPS can output 1320W but circuit limited to 1800W max (15A × 120V) Current draw: ~1000-1350W total (safe margin) Backlog: Upgrade to 20A circuit, restore original 5-20P plug

Previous UPS

Model	Capacity	Issue	Replaced
WattBox WB-1100-IPVMB-6	1100VA / 660W	Insufficient for dual Threadripper setup	2025-12-21

Why replaced: Combined server load of 1000-1350W exceeded 660W capacity.

Power Draw Estimates

Typical Load

Component	Idle	Load	Notes
PVE Server	250-350W	500-750W	CPU + TITAN RTX + P2000 + storage
PVE2 Server	200-300W	450-600W	CPU + RTX A6000 + storage
Network gear	~50W	~50W	Router, switches
Total	500-700W	1000-1400W	Varies by workload

UPS Load: ~33-50% typical, 70-80% under heavy load

Runtime Calculation

At 440W load (33%): ~15-20 min runtime (tested 2025-12-21) At 660W load (50%): ~10-12 min estimated At 1000W load (75%): ~6-8 min estimated

NUT shutdown trigger: 120 seconds (2 min) remaining runtime

NUT (Network UPS Tools) Configuration

Architecture

UPS (USB) ──> PVE (NUT Server/Master) ──> PVE2 (NUT Client/Slave)
                      │
                      └──> Home Assistant (monitoring only)

Master: PVE (10.10.10.120) - UPS connected via USB, runs NUT server Slave: PVE2 (10.10.10.102) - Monitors PVE's NUT server, shuts down when triggered

NUT Server Configuration (PVE)

1. UPS Driver Config: `/etc/nut/ups.conf`

[cyberpower]
    driver = usbhid-ups
    port = auto
    desc = "CyberPower OR2200PFCRT2U"
    override.battery.charge.low = 20
    override.battery.runtime.low = 120

Key settings:

driver = usbhid-ups: USB HID UPS driver (generic for CyberPower)
port = auto: Auto-detect USB device
override.battery.runtime.low = 120: Trigger shutdown at 120 seconds (2 min) remaining

2. NUT Server Config: `/etc/nut/upsd.conf`

LISTEN 127.0.0.1 3493
LISTEN 10.10.10.120 3493

Listens on:

Localhost (for local monitoring)
LAN IP (for PVE2 to connect)

3. User Config: `/etc/nut/upsd.users`

[admin]
    password = upsadmin123
    actions = SET
    instcmds = ALL

[upsmon]
    password = upsmon123
    upsmon master

Users:

admin: Full control, can run commands
upsmon: Monitoring only (used by PVE2)

4. Monitor Config: `/etc/nut/upsmon.conf`

MONITOR cyberpower@localhost 1 upsmon upsmon123 master

MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
NOTIFYCMD /usr/sbin/upssched
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower

NOTIFYMSG ONLINE    "UPS %s on line power"
NOTIFYMSG ONBATT    "UPS %s on battery"
NOTIFYMSG LOWBATT   "UPS %s battery is low"
NOTIFYMSG FSD       "UPS %s: forced shutdown in progress"
NOTIFYMSG COMMOK    "Communications with UPS %s established"
NOTIFYMSG COMMBAD   "Communications with UPS %s lost"
NOTIFYMSG SHUTDOWN  "Auto logout and shutdown proceeding"
NOTIFYMSG REPLBATT  "UPS %s battery needs to be replaced"
NOTIFYMSG NOCOMM    "UPS %s is unavailable"
NOTIFYMSG NOPARENT  "upsmon parent process died - shutdown impossible"

NOTIFYFLAG ONLINE   SYSLOG+WALL
NOTIFYFLAG ONBATT   SYSLOG+WALL
NOTIFYFLAG LOWBATT  SYSLOG+WALL
NOTIFYFLAG FSD      SYSLOG+WALL
NOTIFYFLAG COMMOK   SYSLOG+WALL
NOTIFYFLAG COMMBAD  SYSLOG+WALL
NOTIFYFLAG SHUTDOWN SYSLOG+WALL
NOTIFYFLAG REPLBATT SYSLOG+WALL
NOTIFYFLAG NOCOMM   SYSLOG+WALL
NOTIFYFLAG NOPARENT SYSLOG

Key settings:

MONITOR cyberpower@localhost 1 upsmon upsmon123 master: Monitor local UPS
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh": Custom shutdown script
POLLFREQ 5: Check UPS every 5 seconds

5. USB Permissions: `/etc/udev/rules.d/99-nut-ups.rules`

SUBSYSTEM=="usb", ATTR{idVendor}=="0764", ATTR{idProduct}=="0501", MODE="0660", GROUP="nut"

Purpose: Ensure NUT can access USB UPS device

Apply rule:

udevadm control --reload-rules
udevadm trigger

NUT Client Configuration (PVE2)

Monitor Config: `/etc/nut/upsmon.conf`

MONITOR cyberpower@10.10.10.120 1 upsmon upsmon123 slave

MINSUPPLIES 1
SHUTDOWNCMD "/usr/local/bin/ups-shutdown.sh"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower

# Same NOTIFYMSG and NOTIFYFLAG as PVE

Key difference: slave instead of master - monitors remote UPS on PVE

Custom Shutdown Script

`/usr/local/bin/ups-shutdown.sh` (Same on both PVE and PVE2)

#!/bin/bash
# Graceful VM/CT shutdown when UPS battery low

LOG="/var/log/ups-shutdown.log"

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG"
}

log "=== UPS Shutdown Triggered ==="
log "Battery low - initiating graceful shutdown of VMs/CTs"

# Get list of running VMs (skip TrueNAS for now)
VMS=$(qm list | awk '$3=="running" && $1!=100 {print $1}')
for VMID in $VMS; do
    log "Stopping VM $VMID..."
    qm shutdown $VMID
done

# Get list of running containers
CTS=$(pct list | awk '$2=="running" {print $1}')
for CTID in $CTS; do
    log "Stopping CT $CTID..."
    pct shutdown $CTID
done

# Wait for VMs/CTs to stop
log "Waiting 60 seconds for VMs/CTs to shut down..."
sleep 60

# Now stop TrueNAS (storage - must be last)
if qm status 100 | grep -q running; then
    log "Stopping TrueNAS (VM 100) last..."
    qm shutdown 100
    sleep 30
fi

log "All VMs/CTs stopped. Host will remain running until UPS dies."
log "=== UPS Shutdown Complete ==="

Make executable:

chmod +x /usr/local/bin/ups-shutdown.sh

Script behavior:

Stops all VMs (except TrueNAS)
Stops all containers
Waits 60 seconds
Stops TrueNAS last (storage must be cleanly unmounted)
Does NOT shut down Proxmox hosts - intentionally left running

Why not shut down hosts?

BIOS configured to "Restore on AC Power Loss"
When power returns, servers auto-boot and start VMs in order
Avoids need for manual intervention

Power Failure Behavior

When Power Fails

UPS switches to battery (OB DISCHRG status)
NUT monitors runtime - polls every 5 seconds
At 120 seconds (2 min) remaining:
- NUT triggers /usr/local/bin/ups-shutdown.sh on both servers
- Script gracefully stops all VMs/CTs
- TrueNAS stopped last (storage integrity)
Hosts remain running until UPS battery depletes
UPS battery dies → Hosts lose power (ungraceful but safe - VMs already stopped)

When Power Returns

UPS charges battery, power returns to servers
BIOS "Restore on AC Power Loss" boots both servers
Proxmox starts and auto-starts VMs in configured order:

Order	Wait	VMs/CTs	Reason
1	30s	TrueNAS (VM 100)	Storage must start first
2	60s	Saltbox (VM 101)	Depends on TrueNAS NFS
3	10s	fs-dev, homeassistant, lmdev1, copyparty, docker-host	General VMs
4	5s	pihole, traefik, findshyt	Containers

PVE2 VMs: order=1, wait=10s

Total recovery time: ~7 minutes from power restoration to fully operational (tested 2025-12-21)

UPS Status Codes

Code	Meaning	Action
`OL`	Online (AC power)	Normal operation
`OB`	On Battery	Power outage - monitor runtime
`LB`	Low Battery	<2 min remaining - shutdown imminent
`CHRG`	Charging	Battery charging after power restored
`DISCHRG`	Discharging	On battery, draining
`FSD`	Forced Shutdown	NUT triggered shutdown

Monitoring & Commands

Check UPS Status

# Full status
ssh pve 'upsc cyberpower@localhost'

# Key metrics only
ssh pve 'upsc cyberpower@localhost | grep -E "battery.charge:|battery.runtime:|ups.load:|ups.status:"'

# Example output:
# battery.charge: 100
# battery.runtime: 1234        (seconds remaining)
# ups.load: 33                  (% load)
# ups.status: OL                (online)

Control UPS Beeper

# Mute beeper (temporary - until next power event)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.mute'

# Disable beeper (permanent)
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.disable'

# Enable beeper
ssh pve 'upscmd -u admin -p upsadmin123 cyberpower@localhost beeper.enable'

Test Shutdown Procedure

Simulate low battery (careful - this will shut down VMs!):

# Set a very high low battery threshold to trigger shutdown
ssh pve 'upsrw -s battery.runtime.low=300 -u admin -p upsadmin123 cyberpower@localhost'

# Watch it trigger (when runtime drops below 300 seconds)
ssh pve 'tail -f /var/log/ups-shutdown.log'

# Reset to normal
ssh pve 'upsrw -s battery.runtime.low=120 -u admin -p upsadmin123 cyberpower@localhost'

Better test: Run shutdown script manually without actually triggering NUT:

ssh pve '/usr/local/bin/ups-shutdown.sh'

Home Assistant Integration

UPS metrics are exposed to Home Assistant via NUT integration.

Available Sensors

Entity ID	Description
`sensor.cyberpower_battery_charge`	Battery % (0-100)
`sensor.cyberpower_battery_runtime`	Seconds remaining on battery
`sensor.cyberpower_load`	Load % (0-100)
`sensor.cyberpower_input_voltage`	Input voltage (V AC)
`sensor.cyberpower_output_voltage`	Output voltage (V AC)
`sensor.cyberpower_status`	Status text (OL, OB, LB, etc.)

Configuration

Home Assistant: See HOMEASSISTANT.md for integration setup.

Example Automations

Send notification when on battery:

automation:
  - alias: "UPS On Battery Alert"
    trigger:
      - platform: state
        entity_id: sensor.cyberpower_status
        to: "OB"
    action:
      - service: notify.mobile_app
        data:
          message: "⚠️ Power outage! UPS on battery. Runtime: {{ states('sensor.cyberpower_battery_runtime') }}s"

Alert when battery low:

automation:
  - alias: "UPS Low Battery Alert"
    trigger:
      - platform: numeric_state
        entity_id: sensor.cyberpower_battery_runtime
        below: 300
    action:
      - service: notify.mobile_app
        data:
          message: "🚨 UPS battery low! {{ states('sensor.cyberpower_battery_runtime') }}s remaining"

Testing Results

Full Power Failure Test (2025-12-21)

Complete end-to-end test of power failure and recovery:

Event	Time	Duration	Notes
Power pulled	22:30	-	UPS on battery, ~15 min runtime at 33% load
Low battery trigger	22:40:38	+10:38	Runtime < 120s, shutdown script ran
All VMs stopped	22:41:36	+0:58	Graceful shutdown completed
UPS died	22:46:29	+4:53	Hosts lost power at 0% battery
Power restored	~22:47	-	Plugged back in
PVE online	22:49:11	+2:11	BIOS boot, Proxmox started
PVE2 online	22:50:47	+3:47	BIOS boot, Proxmox started
All VMs running	22:53:39	+6:39	Auto-started in correct order
Total recovery	-	~7 min	From power return to fully operational

Results: ✅ VMs shut down gracefully ✅ Hosts remained running until UPS died (as intended) ✅ Auto-boot on power restoration worked ✅ VMs started in correct order with appropriate delays ✅ No data corruption or issues

Runtime calculation:

Load: ~33% (440W estimated)
Total runtime on battery: ~16 minutes (22:30 → 22:46:29)
Matches manufacturer estimate for 33% load

Proxmox Cluster Quorum Fix

Problem

With a 2-node cluster, if one node goes down, the other loses quorum and can't manage VMs.

During UPS testing, this would prevent the remaining node from starting VMs after power restoration.

Solution

Modified /etc/pve/corosync.conf to enable 2-node mode:

quorum {
    provider: corosync_votequorum
    two_node: 1
}

Effect:

Either node can operate independently if the other is down
No more waiting for quorum when one server is offline
Both nodes visible in single Proxmox interface when both up

Applied: 2025-12-21

Maintenance

Monthly Checks

# Check UPS status
ssh pve 'upsc cyberpower@localhost'

# Check NUT server running
ssh pve 'systemctl status nut-server'
ssh pve 'systemctl status nut-monitor'

# Check NUT client running (PVE2)
ssh pve2 'systemctl status nut-monitor'

# Verify PVE2 can see UPS
ssh pve2 'upsc cyberpower@10.10.10.120'

# Check logs for errors
ssh pve 'journalctl -u nut-server -n 50'
ssh pve 'journalctl -u nut-monitor -n 50'

Battery Health

Check battery stats:

ssh pve 'upsc cyberpower@localhost | grep battery'

# Key metrics:
# battery.charge: 100          (should be near 100 when on AC)
# battery.runtime: 1200+       (seconds at current load)
# battery.voltage: ~24V        (normal for 24V battery system)

Battery replacement: When runtime significantly decreases or UPS reports REPLBATT:

ssh pve 'upsc cyberpower@localhost | grep battery.mfr.date'

CyberPower batteries typically last 3-5 years.

Firmware Updates

Check CyberPower website for firmware updates: https://www.cyberpowersystems.com/support/firmware/

Troubleshooting

UPS Not Detected

# Check USB connection
ssh pve 'lsusb | grep Cyber'

# Expected:
# Bus 001 Device 003: ID 0764:0501 Cyber Power System, Inc. CP1500 AVR UPS

# Restart NUT driver
ssh pve 'systemctl restart nut-driver'
ssh pve 'systemctl status nut-driver'

PVE2 Can't Connect

# Verify NUT server listening
ssh pve 'netstat -tuln | grep 3493'

# Should show:
# tcp 0 0 10.10.10.120:3493 0.0.0.0:* LISTEN

# Test connection from PVE2
ssh pve2 'telnet 10.10.10.120 3493'

# Check firewall (should allow port 3493)
ssh pve 'iptables -L -n | grep 3493'

Shutdown Script Not Running

# Check script permissions
ssh pve 'ls -la /usr/local/bin/ups-shutdown.sh'

# Should be: -rwxr-xr-x (executable)

# Check logs
ssh pve 'cat /var/log/ups-shutdown.log'

# Test script manually
ssh pve '/usr/local/bin/ups-shutdown.sh'

UPS Status Shows UNKNOWN

# Driver may not be compatible
ssh pve 'upsc cyberpower@localhost ups.status'

# Try different driver (in /etc/nut/ups.conf)
# driver = usbhid-ups
# or
# driver = blazer_usb

# Restart after change
ssh pve 'systemctl restart nut-driver nut-server'

Future Improvements

Add email alerts for UPS events (power fail, low battery)
Log runtime statistics to track battery degradation
Set up Grafana dashboard for UPS metrics
Test battery runtime at different load levels
Upgrade to 20A circuit, restore original 5-20P plug
Consider adding network management card for out-of-band UPS access

POWER-MANAGEMENT.md - Overall power optimization
VMS.md - VM startup order configuration
HOMEASSISTANT.md - UPS sensor integration

Last Updated: 2025-12-22

16 KiB Raw Permalink Blame History Unescape Escape