homelab-docs/CHANGELOG.md

# Homelab Changelog

## 2024-12-16

### Power Investigation
Investigated UPS power limit issues across both Proxmox servers.

#### Findings
1. **KSMD (Kernel Same-page Merging Daemon)** was consuming 50-57% CPU constantly on PVE
   - `sleep_millisecs` set to 12ms (extremely aggressive, default is 200ms)
   - `general_profit` was **negative** (-320MB) meaning it was wasting CPU
   - No memory overcommit situation (98GB allocated on 128GB RAM)
   - Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge

2. **GPU Power Draw** identified as major consumers:
   - RTX A6000 on PVE2: up to 300W TDP
   - TITAN RTX on PVE: up to 280W TDP
   - Quadro P2000 on PVE: up to 75W TDP

3. **TrueNAS VM** occasionally spiking to 86% CPU (needs investigation)

#### Changes Made
- [x] **Disabled KSMD on PVE** (10.10.10.120)
  ```bash
  echo 0 > /sys/kernel/mm/ksm/run
  ```
  - Immediate result: KSMD CPU dropped from 51-57% to 0%
  - Load average dropped from 1.88 to 1.28
  - Estimated savings: ~7-10W continuous

#### Additional Changes
- [x] **Made KSMD disable persistent on both hosts**
  - Note: KSM is controlled via sysfs, not sysctl
  - Created systemd service `/etc/systemd/system/disable-ksm.service`:
  ```ini
  [Unit]
  Description=Disable KSM (Kernel Same-page Merging)
  After=multi-user.target

  [Service]
  Type=oneshot
  ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
  RemainAfterExit=yes

  [Install]
  WantedBy=multi-user.target
  ```
  - Enabled on both PVE and PVE2: `systemctl enable disable-ksm.service`

### Syncthing Rescan Interval Fix
**Root Cause**: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days).

**Folders affected** (changed from 60s to 3600s):
- downloads (38GB)
- documents (11GB)
- desktop (7.2GB)
- config, movies, notes, pictures

**Fix applied**:
```bash
# Downloaded config from TrueNAS
ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml'

# Changed all rescanIntervalS="60" to rescanIntervalS="3600"
sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml

# Uploaded and restarted Syncthing
curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
```

**Note**: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net.

**Estimated savings**: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle)

### GPU Power State Investigation

| GPU | VM | Idle Power | P-State | Status |
|-----|-----|-----------|---------|--------|
| RTX A6000 | trading-vm (301) | **11W** | P8 | Optimal |
| TITAN RTX | lmdev1 (111) | **2W** | P8 | Excellent! |
| Quadro P2000 | saltbox (101) | **25W** | P0 | Stuck due to Plex |

**Findings**:
- RTX A6000: Properly entering P8 (11W idle) - excellent
- TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used)
  - Modern GPUs have much better idle power management
- Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory
  - Older Quadro cards don't idle as efficiently with processes attached
  - Power limit fixed at 75W (not adjustable)

**Changes made**:
- [x] Installed QEMU guest agent on lmdev1 (VM 111)
- [x] Added SSH key access to lmdev1 (10.10.10.111)
- [x] Updated ~/.ssh/config with lmdev1 entry

### CPU Governor Optimization

**Issue**: Both servers using `performance` CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle.

**Changes**:

#### PVE (10.10.10.120)
- **Driver**: `amd-pstate-epp` (modern AMD P-State with Energy Performance Preference)
- **Change**: Governor `performance` → `powersave`, EPP `performance` → `balance_power`
- **Result**: Idle frequencies dropped from ~4GHz to ~1.7GHz
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
  ```ini
  [Unit]
  Description=Set CPU governor to powersave with balance_power EPP
  After=multi-user.target

  [Service]
  Type=oneshot
  ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
  RemainAfterExit=yes

  [Install]
  WantedBy=multi-user.target
  ```

#### PVE2 (10.10.10.102)
- **Driver**: `acpi-cpufreq` (older driver)
- **Change**: Governor `performance` → `schedutil`
- **Result**: Idle frequencies dropped from ~4GHz to ~2.2GHz
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
  ```ini
  [Unit]
  Description=Set CPU governor to schedutil for power savings
  After=multi-user.target

  [Service]
  Type=oneshot
  ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
  RemainAfterExit=yes

  [Install]
  WantedBy=multi-user.target
  ```

**Estimated savings**: 30-60W per server (60-120W total)

### ksmtuned Service Disabled

**Issue**: The `ksmtuned` (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days.

**Fix**:
```bash
systemctl stop ksmtuned
systemctl disable ksmtuned
```

Applied to both PVE and PVE2.

**Estimated savings**: ~2-5W

### HDD Spindown on PVE2

**Issue**: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning.

**Fix**:
```bash
# Set 30-minute spindown timeout
hdparm -S 241 /dev/sda /dev/sdb
```

**Persistence**: Created udev rule `/etc/udev/rules.d/69-hdd-spindown.rules`:
```
ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k"
```

**Estimated savings**: ~10-16W (when drives spin down)

#### Pending Changes
- [ ] Monitor overall power consumption after all optimizations
- [ ] Consider PCIe ASPM optimization
- [ ] Consider NMI watchdog disable

### SSH Key Setup
- Added SSH key authentication to both Proxmox servers
- Updated `~/.ssh/config` with entries for `pve` and `pve2`

---

## Notes

### What is KSMD?
Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when:
- Overcommitting memory
- Running many identical VMs

Not useful when:
- Plenty of RAM headroom (our case)
- Diverse workloads with few duplicate pages
- `general_profit` is negative

### What is Memory Ballooning?
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.