Files
homelab-docs/CHANGELOG.md
Hutson 93821d1557 Initial commit: Homelab infrastructure documentation
- CLAUDE.md: Main homelab assistant context and instructions
- IP-ASSIGNMENTS.md: Complete IP address assignments
- NETWORK.md: Network bridges, VLANs, and configuration
- EMC-ENCLOSURE.md: EMC storage enclosure documentation
- SYNCTHING.md: Syncthing setup and device list
- SHELL-ALIASES.md: ZSH aliases for Claude Code sessions
- HOMEASSISTANT.md: Home Assistant API and automations
- INFRASTRUCTURE.md: Server hardware and power management
- configs/: Shared shell configurations
- scripts/: Utility scripts
- mcp-central/: MCP server configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-20 02:31:02 -05:00

198 lines
6.5 KiB
Markdown

# Homelab Changelog
## 2024-12-16
### Power Investigation
Investigated UPS power limit issues across both Proxmox servers.
#### Findings
1. **KSMD (Kernel Same-page Merging Daemon)** was consuming 50-57% CPU constantly on PVE
- `sleep_millisecs` set to 12ms (extremely aggressive, default is 200ms)
- `general_profit` was **negative** (-320MB) meaning it was wasting CPU
- No memory overcommit situation (98GB allocated on 128GB RAM)
- Diverse workloads (TrueNAS, Windows, Linux) = few duplicate pages to merge
2. **GPU Power Draw** identified as major consumers:
- RTX A6000 on PVE2: up to 300W TDP
- TITAN RTX on PVE: up to 280W TDP
- Quadro P2000 on PVE: up to 75W TDP
3. **TrueNAS VM** occasionally spiking to 86% CPU (needs investigation)
#### Changes Made
- [x] **Disabled KSMD on PVE** (10.10.10.120)
```bash
echo 0 > /sys/kernel/mm/ksm/run
```
- Immediate result: KSMD CPU dropped from 51-57% to 0%
- Load average dropped from 1.88 to 1.28
- Estimated savings: ~7-10W continuous
#### Additional Changes
- [x] **Made KSMD disable persistent on both hosts**
- Note: KSM is controlled via sysfs, not sysctl
- Created systemd service `/etc/systemd/system/disable-ksm.service`:
```ini
[Unit]
Description=Disable KSM (Kernel Same-page Merging)
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo 0 > /sys/kernel/mm/ksm/run"
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
- Enabled on both PVE and PVE2: `systemctl enable disable-ksm.service`
### Syncthing Rescan Interval Fix
**Root Cause**: Syncthing on TrueNAS was rescanning 56GB of data every 60 seconds, causing constant 100% CPU usage (~3172 minutes CPU time in 3 days).
**Folders affected** (changed from 60s to 3600s):
- downloads (38GB)
- documents (11GB)
- desktop (7.2GB)
- config, movies, notes, pictures
**Fix applied**:
```bash
# Downloaded config from TrueNAS
ssh pve 'qm guest exec 100 -- cat /mnt/.ix-apps/app_mounts/syncthing/config/config/config.xml'
# Changed all rescanIntervalS="60" to rescanIntervalS="3600"
sed -i 's/rescanIntervalS="60"/rescanIntervalS="3600"/g' config.xml
# Uploaded and restarted Syncthing
curl -X POST -H "X-API-Key: xxx" http://localhost:20910/rest/system/restart
```
**Note**: fsWatcher is enabled, so changes are detected in real-time. The rescan is just a safety net.
**Estimated savings**: ~60-80W (TrueNAS VM CPU will drop from 86% to ~5-10% at idle)
### GPU Power State Investigation
| GPU | VM | Idle Power | P-State | Status |
|-----|-----|-----------|---------|--------|
| RTX A6000 | trading-vm (301) | **11W** | P8 | Optimal |
| TITAN RTX | lmdev1 (111) | **2W** | P8 | Excellent! |
| Quadro P2000 | saltbox (101) | **25W** | P0 | Stuck due to Plex |
**Findings**:
- RTX A6000: Properly entering P8 (11W idle) - excellent
- TITAN RTX: Only 2W at idle despite ComfyUI/Python processes (436MiB VRAM used)
- Modern GPUs have much better idle power management
- Quadro P2000: Stuck in P0 at 25W because Plex Transcoder holds GPU memory
- Older Quadro cards don't idle as efficiently with processes attached
- Power limit fixed at 75W (not adjustable)
**Changes made**:
- [x] Installed QEMU guest agent on lmdev1 (VM 111)
- [x] Added SSH key access to lmdev1 (10.10.10.111)
- [x] Updated ~/.ssh/config with lmdev1 entry
### CPU Governor Optimization
**Issue**: Both servers using `performance` CPU governor, keeping CPUs at high frequencies (3-4GHz) even when 99% idle.
**Changes**:
#### PVE (10.10.10.120)
- **Driver**: `amd-pstate-epp` (modern AMD P-State with Energy Performance Preference)
- **Change**: Governor `performance` → `powersave`, EPP `performance` → `balance_power`
- **Result**: Idle frequencies dropped from ~4GHz to ~1.7GHz
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to powersave with balance_power EPP
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > "$gov"; done; for epp in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > "$epp"; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
#### PVE2 (10.10.10.102)
- **Driver**: `acpi-cpufreq` (older driver)
- **Change**: Governor `performance` → `schedutil`
- **Result**: Idle frequencies dropped from ~4GHz to ~2.2GHz
- **Persistence**: Created `/etc/systemd/system/cpu-powersave.service`
```ini
[Unit]
Description=Set CPU governor to schedutil for power savings
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > "$gov"; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
**Estimated savings**: 30-60W per server (60-120W total)
### ksmtuned Service Disabled
**Issue**: The `ksmtuned` (KSM tuning daemon) was still running on both servers even after KSMD was disabled. Consuming ~39 min CPU on PVE and ~12 min CPU on PVE2 over 3 days.
**Fix**:
```bash
systemctl stop ksmtuned
systemctl disable ksmtuned
```
Applied to both PVE and PVE2.
**Estimated savings**: ~2-5W
### HDD Spindown on PVE2
**Issue**: Two WD Red 6TB drives (local-zfs2 pool) spinning 24/7 despite pool having only 768KB used. Each drive uses 5-8W spinning.
**Fix**:
```bash
# Set 30-minute spindown timeout
hdparm -S 241 /dev/sda /dev/sdb
```
**Persistence**: Created udev rule `/etc/udev/rules.d/69-hdd-spindown.rules`:
```
ACTION=="add", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/usr/sbin/hdparm -S 241 /dev/%k"
```
**Estimated savings**: ~10-16W (when drives spin down)
#### Pending Changes
- [ ] Monitor overall power consumption after all optimizations
- [ ] Consider PCIe ASPM optimization
- [ ] Consider NMI watchdog disable
### SSH Key Setup
- Added SSH key authentication to both Proxmox servers
- Updated `~/.ssh/config` with entries for `pve` and `pve2`
---
## Notes
### What is KSMD?
Kernel Same-page Merging Daemon - scans memory for duplicate pages across VMs and merges them. Trades CPU cycles for RAM savings. Useful when:
- Overcommitting memory
- Running many identical VMs
Not useful when:
- Plenty of RAM headroom (our case)
- Diverse workloads with few duplicate pages
- `general_profit` is negative
### What is Memory Ballooning?
Guest-cooperative memory management. Hypervisor can request VMs to give back unused RAM. Independent from KSMD. Both are Proxmox/KVM memory optimization features but serve different purposes.