hutson/homelab-docs

Fork 0

Files

Hutson eddd98c57f Auto-sync: 20260105-122831

2026-01-05 12:28:33 -05:00

6.5 KiB

Raw Blame History

Homelab Changelog

2024-12-16

Power Investigation

Investigated UPS power limit issues across both Proxmox servers.

Findings

KSMD (Kernel Same-page Merging Daemon) was consuming 50-57% CPU constantly on PVE
- sleep_millisecs set to 12ms (extremely aggressive, default is 200ms)
- general_profit was negative (-320MB) meaning it was wasting CPU
- No memory overcommit situation (98GB allocated on 128GB RAM)
- Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge
GPU Power Draw identified as major consumers:
- RTX A6000 on PVE2: up to 300W TDP
- TITAN RTX on PVE: up to 280W TDP
- Quadro P2000 on PVE: up to 75W TDP
TrueNAS VM occasionally spiking to 86% CPU (needs investigation)

Changes Made

Disabled KSMD on PVE (10.10.10.120)
```
echo 0 > /sys/kernel/mm/ksm/run
```
- Immediate result: KSMD CPU dropped from 51-57% to 0%
- Load average dropped from 1.88 to 1.28
- Estimated savings: ~7-10W continuous

Additional Changes

Made KSMD disable persistent on both hosts

Note: KSM is controlled via sysfs, not sysctl
Created systemd service /etc/systemd/system/disable-ksm.service:

[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enabled on both PVE and PVE2: systemctl enable disable-ksm.service

Syncthing Rescan Interval Fix

Root Cause: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days).

Folders affected (changed from 60s to 3600s):

downloads (38GB)
documents (11GB)
desktop (7.2GB)
config, movies, notes, pictures

Fix applied:

# Downloaded config from TrueNAS
ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml'

# Changed all rescanIntervalS="60" to rescanIntervalS="3600"
sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml

# Uploaded and restarted Syncthing
curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart

Note: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net.

Estimated savings: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle)

GPU Power State Investigation

GPU	VM	Idle Power	P-State	Status
RTX A6000	trading-vm (301)	11W	P8	Optimal
TITAN RTX	lmdev1 (111)	2W	P8	Excellent!
Quadro P2000	saltbox (101)	25W	P0	Stuck due to Plex

Findings:

RTX A6000: Properly entering P8 (11W idle) - excellent
TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used)
- Modern GPUs have much better idle power management
Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory
- Older Quadro cards don't idle as efficiently with processes attached
- Power limit fixed at 75W (not adjustable)

Changes made:

Installed QEMU guest agent on lmdev1 (VM 111)
Added SSH key access to lmdev1 (10.10.10.111)
Updated ~/.ssh/config with lmdev1 entry

CPU Governor Optimization

Issue: Both servers using performance CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle.

Changes:

PVE (10.10.10.120)

Driver: amd-pstate-epp (modern AMD P-State with Energy Performance Preference)
Change: Governor performance → powersave, EPP performance → balance_power
Result: Idle frequencies dropped from ~4GHz to ~1.7GHz

Persistence: Created /etc/systemd/system/cpu-powersave.service

[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

PVE2 (10.10.10.102)

Driver: acpi-cpufreq (older driver)
Change: Governor performance → schedutil
Result: Idle frequencies dropped from ~4GHz to ~2.2GHz

Persistence: Created /etc/systemd/system/cpu-powersave.service

[Unit]
Description=Set CPU governor to schedutil for power savings
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Estimated savings: 30-60W per server (60-120W total)

ksmtuned Service Disabled

Issue: The ksmtuned (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days.

Fix:

systemctl stop ksmtuned
systemctl disable ksmtuned

Applied to both PVE and PVE2.

Estimated savings: ~2-5W

HDD Spindown on PVE2

Issue: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning.

Fix:

# Set 30-minute spindown timeout
hdparm -S 241 /dev/sda /dev/sdb

Persistence: Created udev rule /etc/udev/rules.d/69-hdd-spindown.rules:

ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k"

Estimated savings: ~10-16W (when drives spin down)

Pending Changes

Monitor overall power consumption after all optimizations
Consider PCIe ASPM optimization
Consider NMI watchdog disable

SSH Key Setup

Added SSH key authentication to both Proxmox servers
Updated ~/.ssh/config with entries for pve and pve2

Notes

What is KSMD?

Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when:

Overcommitting memory
Running many identical VMs

Not useful when:

Plenty of RAM headroom (our case)
Diverse workloads with few duplicate pages
general_profit is negative

What is Memory Ballooning?

Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.

6.5 KiB Raw Blame History