Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/POWER-MANAGEMENT.md
+++ b/POWER-MANAGEMENT.md
@@ -0,0 +1,509 @@
+# Power Management and Optimization
+
+Documentation of power optimizations applied to reduce idle power consumption and heat generation.
+
+## Overview
+
+Combined estimated power draw: **~1000-1350W under load**, **500-700W idle**
+
+Through various optimizations, we've reduced idle power consumption by approximately **150-250W** compared to default settings.
+
+---
+
+## Power Draw Estimates
+
+### PVE (10.10.10.120)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA TITAN RTX | 2-3W | 250W | 280W |
+| NVIDIA Quadro P2000 | 25W | 70W | 75W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + SSD) | 20-30W | 40-50W | - |
+| HBAs, fans, misc | 20-30W | 20-30W | - |
+| **Total** | **250-350W** | **800-940W** | - |
+
+### PVE2 (10.10.10.102)
+
+| Component | Idle | Load | TDP |
+|-----------|------|------|-----|
+| Threadripper PRO 3975WX | 150-200W | 400-500W | 280W |
+| NVIDIA RTX A6000 | 11W | 280W | 300W |
+| RAM (128 GB DDR4) | 30-40W | 30-40W | - |
+| Storage (NVMe + HDD) | 20-30W | 40-50W | - |
+| Fans, misc | 15-20W | 15-20W | - |
+| **Total** | **226-330W** | **765-890W** | - |
+
+### Combined
+
+| Metric | Idle | Load |
+|--------|------|------|
+| Servers | 476-680W | 1565-1830W |
+| Network gear | ~50W | ~50W |
+| **Total** | **~530-730W** | **~1615-1880W** |
+| **UPS Load** | 40-55% | 120-140% ⚠️ |
+
+**Note**: UPS capacity is 1320W. Under heavy load, servers can exceed UPS capacity, which is acceptable since high load is rare.
+
+---
+
+## Optimizations Applied
+
+### 1. KSMD Disabled (2024-12-17)
+
+**KSM** (Kernel Same-page Merging) scans memory to deduplicate identical pages across VMs.
+
+**Problem**:
+- KSMD was consuming 44-57% CPU continuously on PVE
+- Caused CPU temp to rise from 74°C to 83°C
+- **Negative profit**: More power spent scanning than saved from deduplication
+
+**Solution**: Disabled KSM permanently
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/disable-ksm.service`
+```ini
+[Unit]
+Description=Disable KSM (Kernel Same-page Merging)
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'echo 0 > /sys/kernel/mm/ksm/run'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable and start**:
+```bash
+systemctl daemon-reload
+systemctl enable --now disable-ksm
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Verify**:
+```bash
+# KSM should be disabled (run=0)
+cat /sys/kernel/mm/ksm/run  # Should output: 0
+
+# ksmd should show 0% CPU
+ps aux | grep ksmd
+```
+
+**Savings**: ~60-80W + significant temperature reduction (74°C → 83°C prevented)
+
+**⚠️ Important**: Proxmox updates sometimes re-enable KSM. If CPU is unexpectedly hot, check:
+```bash
+cat /sys/kernel/mm/ksm/run
+# If 1, disable it:
+echo 0 > /sys/kernel/mm/ksm/run
+systemctl mask ksmtuned
+```
+
+---
+
+### 2. CPU Governor Optimization (2024-12-16)
+
+Default CPU governor keeps cores at max frequency even when idle, wasting power.
+
+#### PVE: `amd-pstate-epp` Driver
+
+**Driver**: `amd-pstate-epp` (modern AMD P-state driver)
+**Governor**: `powersave`
+**EPP**: `balance_power`
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to powersave with balance_power EPP
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo powersave > $cpu; done'
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference; do echo balance_power > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+# Check governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: powersave
+
+# Check EPP
+cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference
+# Output: balance_power
+
+# Check current frequency (should be low when idle)
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle, up to 4000 MHz under load
+```
+
+#### PVE2: `acpi-cpufreq` Driver
+
+**Driver**: `acpi-cpufreq` (older ACPI driver)
+**Governor**: `schedutil` (adaptive, better than powersave for this driver)
+
+**Configuration**:
+
+**Systemd service**: `/etc/systemd/system/cpu-powersave.service`
+```ini
+[Unit]
+Description=Set CPU governor to schedutil
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/bin/sh -c 'for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo schedutil > $cpu; done'
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable**:
+```bash
+systemctl daemon-reload
+systemctl enable --now cpu-powersave
+```
+
+**Verify**:
+```bash
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+# Output: schedutil
+
+grep MHz /proc/cpuinfo | head -5
+# Should show ~1700-2200 MHz idle
+```
+
+**Savings**: ~60-120W combined (CPUs now idle at 1.7-2.2 GHz instead of 4 GHz)
+
+**Performance impact**: Minimal - CPU still boosts to max frequency under load
+
+---
+
+### 3. GPU Power States (2024-12-16)
+
+GPUs automatically enter low-power states when idle. Verified optimal.
+
+| GPU | Location | Idle Power | P-State | Notes |
+|-----|----------|------------|---------|-------|
+| RTX A6000 | PVE2 | 11W | P8 | Excellent idle power |
+| TITAN RTX | PVE | 2-3W | P8 | Excellent idle power |
+| Quadro P2000 | PVE | 25W | P0 | Plex keeps it active |
+
+**Check GPU power state**:
+```bash
+# Via nvidia-smi (if installed in VM)
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,pstate --format=csv'
+
+# Expected output:
+# name, power.draw [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, P8
+
+# Via lspci (from Proxmox host - shows link speed, not power)
+ssh pve 'lspci | grep -i nvidia'
+```
+
+**P-States**:
+- **P0**: Maximum performance
+- **P8**: Minimum power (idle)
+
+**No action needed** - GPUs automatically manage power states.
+
+**Savings**: N/A (already optimal)
+
+---
+
+### 4. Syncthing Rescan Intervals (2024-12-16)
+
+Aggressive 60-second rescans were keeping TrueNAS VM at 86% CPU constantly.
+
+**Changed**:
+- Large folders: 60s → **3600s** (1 hour)
+- Affected: downloads (38GB), documents (11GB), desktop (7.2GB), movies, pictures, notes, config
+
+**Configuration**: Via Syncthing UI on each device
+- Settings → Folders → [Folder Name] → Advanced → Rescan Interval
+
+**Savings**: ~60-80W (TrueNAS CPU usage dropped from 86% to <10%)
+
+**Trade-off**: Changes take up to 1 hour to detect instead of 1 minute
+- Still acceptable for most use cases
+- Manual rescan available if needed: `curl -X POST "http://localhost:8384/rest/db/scan?folder=FOLDER" -H "X-API-Key: API_KEY"`
+
+---
+
+### 5. ksmtuned Disabled (2024-12-16)
+
+**ksmtuned** is the daemon that tunes KSM parameters. Even with KSM disabled, the tuning daemon was still running.
+
+**Solution**: Stopped and disabled on both servers
+
+```bash
+systemctl stop ksmtuned
+systemctl disable ksmtuned
+systemctl mask ksmtuned  # Prevent re-enabling
+```
+
+**Savings**: ~2-5W
+
+---
+
+### 6. HDD Spindown on PVE2 (2024-12-16)
+
+**Problem**: `local-zfs2` pool (2x WD Red 6TB HDD) had only 768 KB used but drives spinning 24/7
+
+**Solution**: Configure 30-minute spindown timeout
+
+**Udev rule**: `/etc/udev/rules.d/69-hdd-spindown.rules`
+```udev
+# Spin down WD Red 6TB drives after 30 minutes idle
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC WD60EFRX-68L*", RUN+="/sbin/hdparm -S 241 /dev/%k"
+```
+
+**hdparm value**: 241 = 30 minutes
+- Formula: `value * 5 seconds = timeout`
+- 241 * 5 = 1205 seconds = 20 minutes (approx 30 min with tolerances)
+
+**Apply rule**:
+```bash
+udevadm control --reload-rules
+udevadm trigger
+
+# Verify drives have spindown set
+hdparm -I /dev/sda | grep -i standby
+hdparm -I /dev/sdb | grep -i standby
+```
+
+**Check if drives are spun down**:
+```bash
+hdparm -C /dev/sda
+# Output: drive state is:  standby  (spun down)
+# or:     drive state is:  active/idle  (spinning)
+```
+
+**Savings**: ~10-16W when spun down (8W per drive)
+
+**Trade-off**: 5-10 second delay when accessing pool after spindown
+
+---
+
+## Potential Optimizations (Not Yet Applied)
+
+### PCIe ASPM (Active State Power Management)
+
+**Benefit**: Reduce power of idle PCIe devices
+**Risk**: May cause stability issues with some devices
+**Estimated savings**: 5-15W
+
+**Test**:
+```bash
+# Check current ASPM state
+lspci -vv | grep -i aspm
+
+# Enable ASPM (test first)
+# Add to kernel cmdline: pcie_aspm=force
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force"
+
+# Update grub
+update-grub
+reboot
+```
+
+### NMI Watchdog Disable
+
+**Benefit**: Reduce CPU wakeups
+**Risk**: Harder to debug kernel hangs
+**Estimated savings**: 1-3W
+
+**Test**:
+```bash
+# Disable NMI watchdog
+echo 0 > /proc/sys/kernel/nmi_watchdog
+
+# Make permanent (add to kernel cmdline)
+# Edit /etc/default/grub:
+GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"
+
+update-grub
+reboot
+```
+
+---
+
+## Monitoring
+
+### CPU Frequency
+
+```bash
+# Current frequency on all cores
+ssh pve 'grep MHz /proc/cpuinfo | head -10'
+
+# Governor
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
+
+# Available governors
+ssh pve 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors'
+```
+
+### CPU Temperature
+
+```bash
+# PVE
+ssh pve 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE Tctl: $(($(cat $f)/1000))°C"; fi; done'
+
+# PVE2
+ssh pve2 'for f in /sys/class/hwmon/hwmon*/temp*_input; do label=$(cat ${f%_input}_label 2>/dev/null); if [ "$label" = "Tctl" ]; then echo "PVE2 Tctl: $(($(cat $f)/1000))°C"; fi; done'
+```
+
+**Healthy temps**: 70-80°C under load
+**Warning**: >85°C
+**Throttle**: 90°C (Tctl max for Threadripper PRO)
+
+### GPU Power Draw
+
+```bash
+# If nvidia-smi installed in VM
+ssh lmdev1 'nvidia-smi --query-gpu=name,power.draw,power.limit,pstate --format=csv'
+
+# Sample output:
+# name, power.draw [W], power.limit [W], pstate
+# NVIDIA TITAN RTX, 2.50 W, 280.00 W, P8
+```
+
+### Power Consumption (UPS)
+
+```bash
+# Check UPS load percentage
+ssh pve 'upsc cyberpower@localhost ups.load'
+
+# Battery runtime (seconds)
+ssh pve 'upsc cyberpower@localhost battery.runtime'
+
+# Full UPS status
+ssh pve 'upsc cyberpower@localhost'
+```
+
+See [UPS.md](UPS.md) for more UPS monitoring details.
+
+### ZFS ARC Memory Usage
+
+```bash
+# PVE
+ssh pve 'arc_summary | grep -A5 "ARC size"'
+
+# TrueNAS
+ssh truenas 'arc_summary | grep -A5 "ARC size"'
+```
+
+**ARC** (Adaptive Replacement Cache) uses RAM for ZFS caching. Adjust if needed:
+
+```bash
+# Limit ARC to 32 GB (example)
+# Edit /etc/modprobe.d/zfs.conf:
+options zfs zfs_arc_max=34359738368
+
+# Apply (reboot required)
+update-initramfs -u
+reboot
+```
+
+---
+
+## Troubleshooting
+
+### CPU Not Downclocking
+
+```bash
+# Check current governor
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
+
+# Should be: powersave (PVE) or schedutil (PVE2)
+# If not, systemd service may have failed
+
+# Check service status
+systemctl status cpu-powersave
+
+# Manually set governor (temporary)
+echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+# Check frequency
+grep MHz /proc/cpuinfo | head -5
+```
+
+### High Idle Power After Update
+
+**Common causes**:
+1. **KSM re-enabled** after Proxmox update
+   - Check: `cat /sys/kernel/mm/ksm/run`
+   - Fix: `echo 0 > /sys/kernel/mm/ksm/run && systemctl mask ksmtuned`
+
+2. **CPU governor reset** to default
+   - Check: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`
+   - Fix: `systemctl restart cpu-powersave`
+
+3. **GPU stuck in high-performance mode**
+   - Check: `nvidia-smi --query-gpu=pstate --format=csv`
+   - Fix: Restart VM or power cycle GPU
+
+### HDDs Won't Spin Down
+
+```bash
+# Check spindown setting
+hdparm -I /dev/sda | grep -i standby
+
+# Set spindown manually (temporary)
+hdparm -S 241 /dev/sda
+
+# Check if drive is idle (ZFS may keep it active)
+zpool iostat -v 1 5  # Watch for activity
+
+# Check what's accessing the drive
+lsof | grep /mnt/pool
+```
+
+---
+
+## Power Optimization Summary
+
+| Optimization | Savings | Applied | Notes |
+|--------------|---------|---------|-------|
+| **KSMD disabled** | 60-80W | ✅ | Also reduces CPU temp significantly |
+| **CPU governor** | 60-120W | ✅ | PVE: powersave+balance_power, PVE2: schedutil |
+| **GPU power states** | 0W | ✅ | Already optimal (automatic) |
+| **Syncthing rescans** | 60-80W | ✅ | Reduced TrueNAS CPU usage |
+| **ksmtuned disabled** | 2-5W | ✅ | Minor but easy win |
+| **HDD spindown** | 10-16W | ✅ | Only when drives idle |
+| PCIe ASPM | 5-15W | ❌ | Not yet tested |
+| NMI watchdog | 1-3W | ❌ | Not yet tested |
+| **Total savings** | **~150-300W** | - | Significant reduction |
+
+---
+
+## Related Documentation
+
+- [UPS.md](UPS.md) - UPS capacity and power monitoring
+- [STORAGE.md](STORAGE.md) - HDD spindown configuration
+- [VMS.md](VMS.md) - VM resource allocation
+
+---
+
+**Last Updated**: 2025-12-22