# Homelab Changelog ## 2024-12-16 ### Power Investigation Investigated UPS power limit issues across both Proxmox servers. #### Findings 1. **KSMD (Kernel Same-page Merging Daemon)** was consuming 50-57% CPU constantly on PVE - `sleep_millisecs` set to 12ms (extremely aggressive, default is 200ms) - `general_profit` was **negative** (-320MB) meaning it was wasting CPU - No memory overcommit situation (98GB allocated on 128GB RAM) - Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge 2. **GPU Power Draw** identified as major consumers: - RTX A6000 on PVE2: up to 300W TDP - TITAN RTX on PVE: up to 280W TDP - Quadro P2000 on PVE: up to 75W TDP 3. **TrueNAS VM** occasionally spiking to 86% CPU (needs investigation) #### Changes Made - [x] **Disabled KSMD on PVE** (10.10.10.120) ```bash echo 0 > /sys/kernel/mm/ksm/run ``` - Immediate result: KSMD CPU dropped from 51-57% to 0% - Load average dropped from 1.88 to 1.28 - Estimated savings: ~7-10W continuous #### Additional Changes - [x] **Made KSMD disable persistent on both hosts** - Note: KSM is controlled via sysfs, not sysctl - Created systemd service `/etc/systemd/system/disable-ksm.service`: ```ini [Unit] Description=Disable KSM (Kernel Same-page Merging) After=multi-user.target [Service] Type=oneshot ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run" RemainAfterExit=yes [Install] WantedBy=multi-user.target ``` - Enabled on both PVE and PVE2: `systemctl enable disable-ksm.service` ### Syncthing Rescan Interval Fix **Root Cause**: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days). **Folders affected** (changed from 60s to 3600s): - downloads (38GB) - documents (11GB) - desktop (7.2GB) - config, movies, notes, pictures **Fix applied**: ```bash # Downloaded config from TrueNAS ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml' # Changed all rescanIntervalS="60" to rescanIntervalS="3600" sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml # Uploaded and restarted Syncthing curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart ``` **Note**: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net. **Estimated savings**: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle) ### GPU Power State Investigation | GPU | VM | Idle Power | P-State | Status | |-----|-----|-----------|---------|--------| | RTX A6000 | trading-vm (301) | **11W** | P8 | Optimal | | TITAN RTX | lmdev1 (111) | **2W** | P8 | Excellent! | | Quadro P2000 | saltbox (101) | **25W** | P0 | Stuck due to Plex | **Findings**: - RTX A6000: Properly entering P8 (11W idle) - excellent - TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used) - Modern GPUs have much better idle power management - Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory - Older Quadro cards don't idle as efficiently with processes attached - Power limit fixed at 75W (not adjustable) **Changes made**: - [x] Installed QEMU guest agent on lmdev1 (VM 111) - [x] Added SSH key access to lmdev1 (10.10.10.111) - [x] Updated ~/.ssh/config with lmdev1 entry ### CPU Governor Optimization **Issue**: Both servers using `performance` CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle. **Changes**: #### PVE (10.10.10.120) - **Driver**: `amd-pstate-epp` (modern AMD P-State with Energy Performance Preference) - **Change**: Governor `performance` → `powersave`, EPP `performance` → `balance_power` - **Result**: Idle frequencies dropped from ~4GHz to ~1.7GHz - **Persistence**: Created `/etc/systemd/system/cpu-powersave.service` ```ini [Unit] Description=Set CPU governor to powersave with balance_power EPP After=multi-user.target [Service] Type=oneshot ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done' RemainAfterExit=yes [Install] WantedBy=multi-user.target ``` #### PVE2 (10.10.10.102) - **Driver**: `acpi-cpufreq` (older driver) - **Change**: Governor `performance` → `schedutil` - **Result**: Idle frequencies dropped from ~4GHz to ~2.2GHz - **Persistence**: Created `/etc/systemd/system/cpu-powersave.service` ```ini [Unit] Description=Set CPU governor to schedutil for power savings After=multi-user.target [Service] Type=oneshot ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done' RemainAfterExit=yes [Install] WantedBy=multi-user.target ``` **Estimated savings**: 30-60W per server (60-120W total) ### ksmtuned Service Disabled **Issue**: The `ksmtuned` (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days. **Fix**: ```bash systemctl stop ksmtuned systemctl disable ksmtuned ``` Applied to both PVE and PVE2. **Estimated savings**: ~2-5W ### HDD Spindown on PVE2 **Issue**: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning. **Fix**: ```bash # Set 30-minute spindown timeout hdparm -S 241 /dev/sda /dev/sdb ``` **Persistence**: Created udev rule `/etc/udev/rules.d/69-hdd-spindown.rules`: ``` ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k" ``` **Estimated savings**: ~10-16W (when drives spin down) #### Pending Changes - [ ] Monitor overall power consumption after all optimizations - [ ] Consider PCIe ASPM optimization - [ ] Consider NMI watchdog disable ### SSH Key Setup - Added SSH key authentication to both Proxmox servers - Updated `~/.ssh/config` with entries for `pve` and `pve2` --- ## Notes ### What is KSMD? Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when: - Overcommitting memory - Running many identical VMs Not useful when: - Plenty of RAM headroom (our case) - Diverse workloads with few duplicate pages - `general_profit` is negative ### What is Memory Ballooning? Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.