Files
homelab-docs/POWER-MANAGEMENT.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

13 KiB

Power Management and Optimization

Documentation of power optimizations applied to reduce idle power consumption and heat generation.

Overview

Combined estimated power draw: ~1000-1350W under load, 500-700W idle

Through various optimizations, we've reduced idle power consumption by approximately 150-250W compared to default settings.


Power Draw Estimates

PVE (10.10.10.120)

Component Idle Load TDP
Threadripper PRO 3975WX 150-200W 400-500W 280W
NVIDIA TITAN RTX 2-3W 250W 280W
NVIDIA Quadro P2000 25W 70W 75W
RAM (128 GB DDR4) 30-40W 30-40W -
Storage (NVMe + SSD) 20-30W 40-50W -
HBAs, fans, misc 20-30W 20-30W -
Total 250-350W 800-940W -

PVE2 (10.10.10.102)

Component Idle Load TDP
Threadripper PRO 3975WX 150-200W 400-500W 280W
NVIDIA RTX A6000 11W 280W 300W
RAM (128 GB DDR4) 30-40W 30-40W -
Storage (NVMe + HDD) 20-30W 40-50W -
Fans, misc 15-20W 15-20W -
Total 226-330W 765-890W -

Combined

Metric Idle Load
Servers 476-680W 1565-1830W
Network gear ~50W ~50W
Total ~530-730W ~1615-1880W
UPS Load 40-55% 120-140% ⚠️

Note: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.


Optimizations Applied

1. KSMD Disabled (2024-12-17)

KSM (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.

Problem:

  • KSMD was consuming 44-57% CPU continuously on PVE
  • Caused CPU temp to rise from 74°C to 83°C
  • Negative profit: More power spent scanning than saved from deduplication

Solution: Disabled KSM permanently

Configuration:

Systemd service: /etc/systemd/system/disable-ksm.service

[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable and start:

systemctl daemon-reload
systemctl enable --now disable-ksm
systemctl mask ksmtuned  # Prevent re-enabling

Verify:

# KSM should be disabled (run=0)
cat /sys/kernel/mm/ksm/run  # Should output: 0

# ksmd should show 0% CPU
ps aux | grep ksmd

Savings: ~60-80W + significant temperature reduction (74°C → 83°C prevented)

⚠️ Important: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:

cat /sys/kernel/mm/ksm/run
# If 1, disable it:
echo 0 > /sys/kernel/mm/ksm/run
systemctl mask ksmtuned

2. CPU Governor Optimization (2024-12-16)

Default CPU governor keeps cores at max frequency even when idle, wasting power.

PVE: amd-pstate-epp Driver

Driver: amd-pstate-epp (modern AMD P-state driver) Governor: powersave EPP: balance_power

Configuration:

Systemd service: /etc/systemd/system/cpu-powersave.service

[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable:

systemctl daemon-reload
systemctl enable --now cpu-powersave

Verify:

# Check governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: powersave

# Check EPP
cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
# Output: balance_power

# Check current frequency (should be low when idle)
grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle, up to 4000 MHz under load

PVE2: acpi-cpufreq Driver

Driver: acpi-cpufreq (older ACPI driver) Governor: schedutil (adaptive, better than powersave for this driver)

Configuration:

Systemd service: /etc/systemd/system/cpu-powersave.service

[Unit]
Description=Set CPU governor to schedutil
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable:

systemctl daemon-reload
systemctl enable --now cpu-powersave

Verify:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Output: schedutil

grep MHz /proc/cpuinfo | head -5
# Should show ~1700-2200 MHz idle

Savings: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)

Performance impact: Minimal - CPU still boosts to max frequency under load


3. GPU Power States (2024-12-16)

GPUs automatically enter low-power states when idle. Verified optimal.

GPU Location Idle Power P-State Notes
RTX A6000 PVE2 11W P8 Excellent idle power
TITAN RTX PVE 2-3W P8 Excellent idle power
Quadro P2000 PVE 25W P0 Plex keeps it active

Check GPU power state:

# Via nvidia-smi (if installed in VM)
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'

# Expected output:
# name, power.draw [W], pstate
# NVIDIA TITAN RTX, 2.50 W, P8

# Via lspci (from Proxmox host - shows link speed, not power)
ssh pve 'lspci | grep -i nvidia'

P-States:

  • P0: Maximum performance
  • P8: Minimum power (idle)

No action needed - GPUs automatically manage power states.

Savings: N/A (already optimal)


4. Syncthing Rescan Intervals (2024-12-16)

Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.

Changed:

  • Large folders: 60s → 3600s (1 hour)
  • Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config

Configuration: Via Syncthing UI on each device

  • Settings → Folders → [Folder Name] → Advanced → Rescan Interval

Savings: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)

Trade-off: Changes take up to 1 hour to detect instead of 1 minute

  • Still acceptable for most use cases
  • Manual rescan available if needed: curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"

5. ksmtuned Disabled (2024-12-16)

ksmtuned is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.

Solution: Stopped and disabled on both servers

systemctl stop ksmtuned
systemctl disable ksmtuned
systemctl mask ksmtuned  # Prevent re-enabling

Savings: ~2-5W


6. HDD Spindown on PVE2 (2024-12-16)

Problem: local-zfs2 pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7

Solution: Configure 30-minute spindown timeout

Udev rule: /etc/udev/rules.d/69-hdd-spindown.rules

# Spin down WD Red 6TB drives after 30 minutes idle
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"

hdparm value: 241 = 30 minutes

  • Formula: value * 5 seconds = timeout
  • 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)

Apply rule:

udevadm control --reload-rules
udevadm trigger

# Verify drives have spindown set
hdparm -I /dev/sda | grep -i standby
hdparm -I /dev/sdb | grep -i standby

Check if drives are spun down:

hdparm -C /dev/sda
# Output: drive state is:  standby  (spun down)
# or:     drive state is:  active/idle  (spinning)

Savings: ~10-16W when spun down (8W per drive)

Trade-off: 5-10 second delay when accessing pool after spindown


Potential Optimizations (Not Yet Applied)

PCIe ASPM (Active State Power Management)

Benefit: Reduce power of idle PCIe devices Risk: May cause stability issues with some devices Estimated savings: 5-15W

Test:

# Check current ASPM state
lspci -vv | grep -i aspm

# Enable ASPM (test first)
# Add to kernel cmdline: pcie_aspm=force
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"

# Update grub
update-grub
reboot

NMI Watchdog Disable

Benefit: Reduce CPU wakeups Risk: Harder to debug kernel hangs Estimated savings: 1-3W

Test:

# Disable NMI watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog

# Make permanent (add to kernel cmdline)
# Edit /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"

update-grub
reboot

Monitoring

CPU Frequency

# Current frequency on all cores
ssh pve 'grep MHz /proc/cpuinfo | head -10'

# Governor
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'

# Available governors
ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'

CPU Temperature

# PVE
ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'

# PVE2
ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'

Healthy temps: 70-80°C under load Warning: >85°C Throttle: 90°C (Tctl max for Threadripper PRO)

GPU Power Draw

# If nvidia-smi installed in VM
ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'

# Sample output:
# name, power.draw [W], power.limit [W], pstate
# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8

Power Consumption (UPS)

# Check UPS load percentage
ssh pve 'upsc cyberpower@localhost ups.load'

# Battery runtime (seconds)
ssh pve 'upsc cyberpower@localhost battery.runtime'

# Full UPS status
ssh pve 'upsc cyberpower@localhost'

See UPS.md for more UPS monitoring details.

ZFS ARC Memory Usage

# PVE
ssh pve 'arc_summary | grep -A5 "ARC size"'

# TrueNAS
ssh truenas 'arc_summary | grep -A5 "ARC size"'

ARC (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:

# Limit ARC to 32 GB (example)
# Edit /etc/modprobe.d/zfs.conf:
options zfs zfs_arc_max=34359738368

# Apply (reboot required)
update-initramfs -u
reboot

Troubleshooting

CPU Not Downclocking

# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Should be: powersave (PVE) or schedutil (PVE2)
# If not, systemd service may have failed

# Check service status
systemctl status cpu-powersave

# Manually set governor (temporary)
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Check frequency
grep MHz /proc/cpuinfo | head -5

High Idle Power After Update

Common causes:

  1. KSM re-enabled after Proxmox update

    • Check: cat /sys/kernel/mm/ksm/run
    • Fix: echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned
  2. CPU governor reset to default

    • Check: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
    • Fix: systemctl restart cpu-powersave
  3. GPU stuck in high-performance mode

    • Check: nvidia-smi --query-gpu=pstate --format=csv
    • Fix: Restart VM or power cycle GPU

HDDs Won't Spin Down

# Check spindown setting
hdparm -I /dev/sda | grep -i standby

# Set spindown manually (temporary)
hdparm -S 241 /dev/sda

# Check if drive is idle (ZFS may keep it active)
zpool iostat -v 1 5  # Watch for activity

# Check what's accessing the drive
lsof | grep /mnt/pool

Power Optimization Summary

Optimization Savings Applied Notes
KSMD disabled 60-80W Also reduces CPU temp significantly
CPU governor 60-120W PVE: powersave+balance_power, PVE2: schedutil
GPU power states 0W Already optimal (automatic)
Syncthing rescans 60-80W Reduced TrueNAS CPU usage
ksmtuned disabled 2-5W Minor but easy win
HDD spindown 10-16W Only when drives idle
PCIe ASPM 5-15W Not yet tested
NMI watchdog 1-3W Not yet tested
Total savings ~150-300W - Significant reduction

  • UPS.md - UPS capacity and power monitoring
  • STORAGE.md - HDD spindown configuration
  • VMS.md - VM resource allocation

Last Updated: 2025-12-22