198 lines
6.5 KiB
Markdown
198 lines
6.5 KiB
Markdown
# Homelab Changelog
|
|
|
|
## 2024-12-16
|
|
|
|
### Power Investigation
|
|
Investigated UPS power limit issues across both Proxmox servers.
|
|
|
|
#### Findings
|
|
1. **KSMD (Kernel Same-page Merging Daemon)** was consuming 50-57% CPU constantly on PVE
|
|
- `sleep_millisecs` set to 12ms (extremely aggressive, default is 200ms)
|
|
- `general_profit` was **negative** (-320MB) meaning it was wasting CPU
|
|
- No memory overcommit situation (98GB allocated on 128GB RAM)
|
|
- Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge
|
|
|
|
2. **GPU Power Draw** identified as major consumers:
|
|
- RTX A6000 on PVE2: up to 300W TDP
|
|
- TITAN RTX on PVE: up to 280W TDP
|
|
- Quadro P2000 on PVE: up to 75W TDP
|
|
|
|
3. **TrueNAS VM** occasionally spiking to 86% CPU (needs investigation)
|
|
|
|
#### Changes Made
|
|
- [x] **Disabled KSMD on PVE** (10.10.10.120)
|
|
```bash
|
|
echo 0 > /sys/kernel/mm/ksm/run
|
|
```
|
|
- Immediate result: KSMD CPU dropped from 51-57% to 0%
|
|
- Load average dropped from 1.88 to 1.28
|
|
- Estimated savings: ~7-10W continuous
|
|
|
|
#### Additional Changes
|
|
- [x] **Made KSMD disable persistent on both hosts**
|
|
- Note: KSM is controlled via sysfs, not sysctl
|
|
- Created systemd service `/etc/systemd/system/disable-ksm.service`:
|
|
```ini
|
|
[Unit]
|
|
Description=Disable KSM (Kernel Same-page Merging)
|
|
After=multi-user.target
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
|
|
RemainAfterExit=yes
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
- Enabled on both PVE and PVE2: `systemctl enable disable-ksm.service`
|
|
|
|
### Syncthing Rescan Interval Fix
|
|
**Root Cause**: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days).
|
|
|
|
**Folders affected** (changed from 60s to 3600s):
|
|
- downloads (38GB)
|
|
- documents (11GB)
|
|
- desktop (7.2GB)
|
|
- config, movies, notes, pictures
|
|
|
|
**Fix applied**:
|
|
```bash
|
|
# Downloaded config from TrueNAS
|
|
ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml'
|
|
|
|
# Changed all rescanIntervalS="60" to rescanIntervalS="3600"
|
|
sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml
|
|
|
|
# Uploaded and restarted Syncthing
|
|
curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
|
|
```
|
|
|
|
**Note**: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net.
|
|
|
|
**Estimated savings**: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle)
|
|
|
|
### GPU Power State Investigation
|
|
|
|
| GPU | VM | Idle Power | P-State | Status |
|
|
|-----|-----|-----------|---------|--------|
|
|
| RTX A6000 | trading-vm (301) | **11W** | P8 | Optimal |
|
|
| TITAN RTX | lmdev1 (111) | **2W** | P8 | Excellent! |
|
|
| Quadro P2000 | saltbox (101) | **25W** | P0 | Stuck due to Plex |
|
|
|
|
**Findings**:
|
|
- RTX A6000: Properly entering P8 (11W idle) - excellent
|
|
- TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used)
|
|
- Modern GPUs have much better idle power management
|
|
- Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory
|
|
- Older Quadro cards don't idle as efficiently with processes attached
|
|
- Power limit fixed at 75W (not adjustable)
|
|
|
|
**Changes made**:
|
|
- [x] Installed QEMU guest agent on lmdev1 (VM 111)
|
|
- [x] Added SSH key access to lmdev1 (10.10.10.111)
|
|
- [x] Updated ~/.ssh/config with lmdev1 entry
|
|
|
|
### CPU Governor Optimization
|
|
|
|
**Issue**: Both servers using `performance` CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle.
|
|
|
|
**Changes**:
|
|
|
|
#### PVE (10.10.10.120)
|
|
- **Driver**: `amd-pstate-epp` (modern AMD P-State with Energy Performance Preference)
|
|
- **Change**: Governor `performance` → `powersave`, EPP `performance` → `balance_power`
|
|
- **Result**: Idle frequencies dropped from ~4GHz to ~1.7GHz
|
|
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
|
|
```ini
|
|
[Unit]
|
|
Description=Set CPU governor to powersave with balance_power EPP
|
|
After=multi-user.target
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
|
|
RemainAfterExit=yes
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
#### PVE2 (10.10.10.102)
|
|
- **Driver**: `acpi-cpufreq` (older driver)
|
|
- **Change**: Governor `performance` → `schedutil`
|
|
- **Result**: Idle frequencies dropped from ~4GHz to ~2.2GHz
|
|
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
|
|
```ini
|
|
[Unit]
|
|
Description=Set CPU governor to schedutil for power savings
|
|
After=multi-user.target
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
|
|
RemainAfterExit=yes
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
**Estimated savings**: 30-60W per server (60-120W total)
|
|
|
|
### ksmtuned Service Disabled
|
|
|
|
**Issue**: The `ksmtuned` (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days.
|
|
|
|
**Fix**:
|
|
```bash
|
|
systemctl stop ksmtuned
|
|
systemctl disable ksmtuned
|
|
```
|
|
|
|
Applied to both PVE and PVE2.
|
|
|
|
**Estimated savings**: ~2-5W
|
|
|
|
### HDD Spindown on PVE2
|
|
|
|
**Issue**: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning.
|
|
|
|
**Fix**:
|
|
```bash
|
|
# Set 30-minute spindown timeout
|
|
hdparm -S 241 /dev/sda /dev/sdb
|
|
```
|
|
|
|
**Persistence**: Created udev rule `/etc/udev/rules.d/69-hdd-spindown.rules`:
|
|
```
|
|
ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k"
|
|
```
|
|
|
|
**Estimated savings**: ~10-16W (when drives spin down)
|
|
|
|
#### Pending Changes
|
|
- [ ] Monitor overall power consumption after all optimizations
|
|
- [ ] Consider PCIe ASPM optimization
|
|
- [ ] Consider NMI watchdog disable
|
|
|
|
### SSH Key Setup
|
|
- Added SSH key authentication to both Proxmox servers
|
|
- Updated `~/.ssh/config` with entries for `pve` and `pve2`
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
### What is KSMD?
|
|
Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when:
|
|
- Overcommitting memory
|
|
- Running many identical VMs
|
|
|
|
Not useful when:
|
|
- Plenty of RAM headroom (our case)
|
|
- Diverse workloads with few duplicate pages
|
|
- `general_profit` is negative
|
|
|
|
### What is Memory Ballooning?
|
|
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.
|