Files
homelab-docs/CHANGELOG.md
2026-01-05 12:28:33 -05:00

6.5 KiB

Homelab Changelog

2024-12-16

Power Investigation

Investigated UPS power limit issues across both Proxmox servers.

Findings

  1. KSMD (Kernel Same-page Merging Daemon) was consuming 50-57% CPU constantly on PVE

    • sleep_millisecs set to 12ms (extremely aggressive, default is 200ms)
    • general_profit was negative (-320MB) meaning it was wasting CPU
    • No memory overcommit situation (98GB allocated on 128GB RAM)
    • Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge
  2. GPU Power Draw identified as major consumers:

    • RTX A6000 on PVE2: up to 300W TDP
    • TITAN RTX on PVE: up to 280W TDP
    • Quadro P2000 on PVE: up to 75W TDP
  3. TrueNAS VM occasionally spiking to 86% CPU (needs investigation)

Changes Made

  • Disabled KSMD on PVE (10.10.10.120)
    echo 0 > /sys/kernel/mm/ksm/run
    
    • Immediate result: KSMD CPU dropped from 51-57% to 0%
    • Load average dropped from 1.88 to 1.28
    • Estimated savings: ~7-10W continuous

Additional Changes

  • Made KSMD disable persistent on both hosts
    • Note: KSM is controlled via sysfs, not sysctl
    • Created systemd service /etc/systemd/system/disable-ksm.service:
    [Unit]
    Description=Disable KSM (Kernel Same-page Merging)
    After=multi-user.target
    
    [Service]
    Type=oneshot
    ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
    RemainAfterExit=yes
    
    [Install]
    WantedBy=multi-user.target
    
    • Enabled on both PVE and PVE2: systemctl enable disable-ksm.service

Syncthing Rescan Interval Fix

Root Cause: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days).

Folders affected (changed from 60s to 3600s):

  • downloads (38GB)
  • documents (11GB)
  • desktop (7.2GB)
  • config, movies, notes, pictures

Fix applied:

# Downloaded config from TrueNAS
ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml'

# Changed all rescanIntervalS="60" to rescanIntervalS="3600"
sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml

# Uploaded and restarted Syncthing
curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart

Note: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net.

Estimated savings: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle)

GPU Power State Investigation

GPU VM Idle Power P-State Status
RTX A6000 trading-vm (301) 11W P8 Optimal
TITAN RTX lmdev1 (111) 2W P8 Excellent!
Quadro P2000 saltbox (101) 25W P0 Stuck due to Plex

Findings:

  • RTX A6000: Properly entering P8 (11W idle) - excellent
  • TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used)
    • Modern GPUs have much better idle power management
  • Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory
    • Older Quadro cards don't idle as efficiently with processes attached
    • Power limit fixed at 75W (not adjustable)

Changes made:

  • Installed QEMU guest agent on lmdev1 (VM 111)
  • Added SSH key access to lmdev1 (10.10.10.111)
  • Updated ~/.ssh/config with lmdev1 entry

CPU Governor Optimization

Issue: Both servers using performance CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle.

Changes:

PVE (10.10.10.120)

  • Driver: amd-pstate-epp (modern AMD P-State with Energy Performance Preference)
  • Change: Governor performancepowersave, EPP performancebalance_power
  • Result: Idle frequencies dropped from ~4GHz to ~1.7GHz
  • Persistence: Created /etc/systemd/system/cpu-powersave.service
    [Unit]
    Description=Set CPU governor to powersave with balance_power EPP
    After=multi-user.target
    
    [Service]
    Type=oneshot
    ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
    RemainAfterExit=yes
    
    [Install]
    WantedBy=multi-user.target
    

PVE2 (10.10.10.102)

  • Driver: acpi-cpufreq (older driver)
  • Change: Governor performanceschedutil
  • Result: Idle frequencies dropped from ~4GHz to ~2.2GHz
  • Persistence: Created /etc/systemd/system/cpu-powersave.service
    [Unit]
    Description=Set CPU governor to schedutil for power savings
    After=multi-user.target
    
    [Service]
    Type=oneshot
    ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
    RemainAfterExit=yes
    
    [Install]
    WantedBy=multi-user.target
    

Estimated savings: 30-60W per server (60-120W total)

ksmtuned Service Disabled

Issue: The ksmtuned (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days.

Fix:

systemctl stop ksmtuned
systemctl disable ksmtuned

Applied to both PVE and PVE2.

Estimated savings: ~2-5W

HDD Spindown on PVE2

Issue: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning.

Fix:

# Set 30-minute spindown timeout
hdparm -S 241 /dev/sda /dev/sdb

Persistence: Created udev rule /etc/udev/rules.d/69-hdd-spindown.rules:

ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k"

Estimated savings: ~10-16W (when drives spin down)

Pending Changes

  • Monitor overall power consumption after all optimizations
  • Consider PCIe ASPM optimization
  • Consider NMI watchdog disable

SSH Key Setup

  • Added SSH key authentication to both Proxmox servers
  • Updated ~/.ssh/config with entries for pve and pve2

Notes

What is KSMD?

Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when:

  • Overcommitting memory
  • Running many identical VMs

Not useful when:

  • Plenty of RAM headroom (our case)
  • Diverse workloads with few duplicate pages
  • general_profit is negative

What is Memory Ballooning?

Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.