Files
homelab-docs/HARDWARE.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

456 lines
11 KiB
Markdown

# Hardware Inventory
Complete hardware specifications for all homelab equipment.
## Servers
### PVE (10.10.10.120) - Primary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Cores**: 32 cores / 64 threads
- **Base Clock**: 3.5 GHz
- **Boost Clock**: 4.2 GHz
- **TDP**: 280W
- **Architecture**: Zen 2 (7nm)
- **Socket**: sTRX4
- **Features**: ECC support, PCIe 4.0
#### RAM
- **Capacity**: 128 GB
- **Type**: DDR4 ECC Registered
- **Speed**: Unknown (needs investigation)
- **Channels**: 8-channel (quad-channel per socket)
- **Idle Power**: ~30-40W
#### Storage
**OS/VM Storage:**
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror1` | 2x Sabrent Rocket Q NVMe | ZFS Mirror | 3.6 TB usable | High-performance VM storage |
| `nvme-mirror2` | 2x Kingston SFYRD 2TB NVMe | ZFS Mirror | 1.8 TB usable | Additional fast VM storage |
| `rpool` | 2x Samsung 870 QVO 4TB SSD | ZFS Mirror | 3.6 TB usable | Proxmox OS, containers, backups |
**Total Storage**: ~9 TB usable
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA Quadro P2000 | PCIe slot 1 | 5 GB GDDR5 | 75W | Plex transcoding | Host |
| NVIDIA TITAN RTX | PCIe slot 2 | 24 GB GDDR6 | 280W | AI workloads | Saltbox (101), lmdev1 (111) |
**Total GPU Power**: 75W + 280W = 355W (under load)
#### Network Cards
| Interface | Model | Speed | Purpose | Bridge |
|-----------|-------|-------|---------|--------|
| enp1s0 | Intel I210 (onboard) | 1 Gb | Management | vmbr0 |
| enp35s0f0 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed LXC | vmbr1 |
| enp35s0f1 | Intel X520 (dual-port SFP+) | 10 Gb | High-speed VM | vmbr2 |
**10Gb Transceivers**: Intel FTLX8571D3BCV (SFP+ 10GBASE-SR, 850nm, multimode)
#### Storage Controllers
| Model | Interface | Purpose |
|-------|-----------|---------|
| LSI SAS2308 HBA | PCIe 3.0 x8 | Passed to TrueNAS VM for EMC enclosure |
| Samsung NVMe controller | PCIe | Passed to TrueNAS VM for ZFS caching |
#### Motherboard
- **Model**: Unknown - needs investigation
- **Chipset**: AMD TRX40
- **Form Factor**: ATX/EATX
- **PCIe Slots**: Multiple PCIe 4.0 slots
- **Features**: IOMMU support, ECC memory
#### Power Supply
- **Model**: Unknown
- **Wattage**: Likely 1000W+ (needs investigation)
- **Type**: ATX, 80+ certification unknown
#### Cooling
- **CPU Cooler**: Unknown - likely large tower or AIO
- **Case Fans**: Unknown quantity
- **Note**: CPU temps 70-80°C under load (healthy)
---
### PVE2 (10.10.10.102) - Secondary Proxmox Server
#### CPU
- **Model**: AMD Ryzen Threadripper PRO 3975WX
- **Specs**: Same as PVE (32C/64T, 280W TDP)
#### RAM
- **Capacity**: 128 GB DDR4 ECC
- **Same specs as PVE**
#### Storage
| Pool | Devices | Type | Capacity | Purpose |
|------|---------|------|----------|---------|
| `nvme-mirror3` | 2x NVMe (model unknown) | ZFS Mirror | Unknown | High-performance VM storage |
| `local-zfs2` | 2x WD Red 6TB HDD | ZFS Mirror | ~6 TB usable | Bulk/archival storage (spins down) |
**HDD Spindown**: Configured for 30-min idle spindown (saves ~10-16W)
#### GPUs
| Model | Slot | VRAM | TDP | Purpose | Passed To |
|-------|------|------|-----|---------|-----------|
| NVIDIA RTX A6000 | PCIe slot 1 | 48 GB GDDR6 | 300W | AI trading workloads | trading-vm (301) |
#### Network Cards
| Interface | Model | Speed | Purpose |
|-----------|-------|-------|---------|
| nic1 | Unknown (onboard) | 1 Gb | Management |
**Note**: MTU set to 9000 for jumbo frames
#### Motherboard
- **Model**: Unknown
- **Chipset**: AMD TRX40
- **Similar to PVE**
---
## Network Equipment
### UniFi Dream Machine Pro (UCG-Fiber)
- **Model**: UniFi Cloud Gateway Fiber
- **IP**: 10.10.10.1
- **Ports**: Multiple 1Gb + SFP+ uplink
- **Features**: Router, firewall, VPN, IDS/IPS
- **MTU**: 9216 (supports jumbo frames)
- **Tailscale**: Installed for VPN failover
### Switches
**Details needed** - investigate current switch setup:
- 10Gb switch for high-speed connections?
- 1Gb switch for general devices?
- PoE capabilities?
```bash
# Check what's connected to 10Gb interfaces
ssh pve 'ip link show enp35s0f0'
ssh pve 'ip link show enp35s0f1'
```
---
## Storage Hardware
### EMC Storage Enclosure
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) for complete details**
- **Model**: EMC KTN-STL4 (or similar)
- **Form Factor**: 4U rackmount
- **Drive Bays**: 25x 3.5" SAS/SATA
- **Controllers**: Dual LCC (Link Control Cards)
- **Connection**: SAS via LSI SAS2308 HBA
- **Passed to**: TrueNAS VM (VMID 100)
**Current Status**:
- LCC A: Active (working)
- LCC B: Failed (replacement ordered)
**Drive Inventory**: Unknown - needs audit
```bash
# Get drive list from TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'lsblk'
```
### NVMe Drives
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Sabrent Rocket Q | 2 | Unknown | PVE | nvme-mirror1 |
| Kingston SFYRD | 2 | 2 TB each | PVE | nvme-mirror2 |
| Unknown model | 2 | Unknown | PVE2 | nvme-mirror3 |
| Samsung (model unknown) | 1 | Unknown | TrueNAS (passed) | ZFS cache |
### SSDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| Samsung 870 QVO | 2 | 4 TB each | PVE | rpool |
### HDDs
| Model | Quantity | Capacity | Location | Pool |
|-------|----------|----------|----------|------|
| WD Red | 2 | 6 TB each | PVE2 | local-zfs2 |
| Unknown (in EMC) | Unknown | Unknown | TrueNAS | vault |
---
## UPS
### Current UPS
| Specification | Value |
|---------------|-------|
| **Model** | CyberPower OR2200PFCRT2U |
| **Capacity** | 2200VA / 1320W |
| **Form Factor** | 2U rackmount |
| **Input** | NEMA 5-15P (rewired from 5-20P) |
| **Outlets** | 2x 5-20R + 6x 5-15R |
| **Output** | PFC Sinewave |
| **Runtime** | ~15-20 min @ 33% load |
| **Interface** | USB (connected to PVE) |
**See [UPS.md](UPS.md) for configuration details**
---
## Client Devices
### Mac Mini (Hutson's Workstation)
- **Model**: Unknown generation
- **CPU**: Unknown
- **RAM**: Unknown
- **Storage**: Unknown
- **Network**: 1Gb Ethernet (en0) - MTU 9000
- **Tailscale IP**: 100.108.89.58
- **Local IP**: 10.10.10.125 (static)
- **Purpose**: Primary workstation, Happy Coder daemon host
### MacBook (Mobile)
- **Model**: Unknown
- **Network**: Wi-Fi + Ethernet adapter
- **Tailscale IP**: Unknown
- **Purpose**: Mobile work, development
### Windows PC
- **Model**: Unknown
- **CPU**: Unknown
- **Network**: 1Gb Ethernet
- **IP**: 10.10.10.150
- **Purpose**: Gaming, Windows development, Syncthing node
### Phone (Android)
- **Model**: Unknown
- **IP**: 10.10.10.54 (when on Wi-Fi)
- **Purpose**: Syncthing mobile node, Happy Coder client
---
## Rack Layout (If Applicable)
**Needs documentation** - Current rack configuration unknown
Suggested format:
```
U42: Blank panel
U41: UPS (CyberPower 2U)
U40: UPS (CyberPower 2U)
U39: Switch (10Gb)
U38-U35: EMC Storage Enclosure (4U)
U34: PVE Server
U33: PVE2 Server
...
```
---
## Power Consumption
### Measured Power Draw
| Component | Idle | Typical | Peak | Notes |
|-----------|------|---------|------|-------|
| PVE Server | 250-350W | 500W | 750W | CPU + GPUs + storage |
| PVE2 Server | 200-300W | 400W | 600W | CPU + GPU + storage |
| Network Gear | ~50W | ~50W | ~50W | Router + switches |
| **Total** | **500-700W** | **~950W** | **~1400W** | Exceeds UPS under peak load |
**UPS Capacity**: 1320W
**Typical Load**: 33-50% (safe margin)
**Peak Load**: Can exceed UPS capacity temporarily (acceptable)
### Power Optimizations Applied
**See [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) for details**
- KSMD disabled: ~60-80W saved
- CPU governors: ~60-120W saved
- Syncthing rescans: ~60-80W saved
- HDD spindown: ~10-16W saved when idle
- **Total savings**: ~150-300W
---
## Thermal Management
### CPU Cooling
**PVE & PVE2**:
- CPU cooler: Unknown model
- Thermal paste: Unknown, likely needs refresh if temps >85°C
- Target temp: 70-80°C under load
- Max safe: 90°C Tctl (Threadripper PRO spec)
### GPU Cooling
All GPUs are passively managed (stock coolers):
- TITAN RTX: 2-3W idle, 280W load
- RTX A6000: 11W idle, 300W load
- Quadro P2000: 25W constant (Plex active)
### Case Airflow
**Unknown** - needs investigation:
- Case model?
- Fan configuration?
- Positive or negative pressure?
---
## Cable Management
### Network Cables
| Connection | Type | Length | Speed |
|------------|------|--------|-------|
| PVE → Switch (10Gb) | OM3 fiber | Unknown | 10Gb |
| PVE2 → Router | Cat6 | Unknown | 1Gb |
| Mac Mini → Switch | Cat6 | Unknown | 1Gb |
| TrueNAS → EMC | SAS cable | Unknown | 6Gb/s |
### Power Cables
**Critical**: All servers on UPS battery-backed outlets
---
## Maintenance Schedule
### Annual Maintenance
- [ ] Clean dust from servers (every 6-12 months)
- [ ] Check thermal paste on CPUs (every 2-3 years)
- [ ] Test UPS battery runtime (annually)
- [ ] Verify all fans operational
- [ ] Check for bulging capacitors on PSUs
### Drive Health
```bash
# Check SMART status on all drives
ssh pve 'smartctl -a /dev/nvme0'
ssh pve2 'smartctl -a /dev/sda'
ssh truenas 'smartctl --scan | while read dev type; do echo "=== $dev ==="; smartctl -a $dev | grep -E "Model|Serial|Health|Reallocated|Current_Pending"; done'
```
### Temperature Monitoring
```bash
# Check all temps (needs lm-sensors installed)
ssh pve 'sensors'
ssh pve2 'sensors'
```
---
## Warranty & Purchase Info
**Needs documentation**:
- When were servers purchased?
- Where were components bought?
- Any warranties still active?
- Replacement part sources?
---
## Upgrade Path
### Short-term Upgrades (< 6 months)
- [ ] 20A circuit for UPS (restore original 5-20P plug)
- [ ] Document missing hardware specs
- [ ] Label all cables
- [ ] Create rack diagram
### Medium-term Upgrades (6-12 months)
- [ ] Additional 10Gb NIC for PVE2?
- [ ] More NVMe storage?
- [ ] Upgrade network switches?
- [ ] Replace EMC enclosure with newer model?
### Long-term Upgrades (1-2 years)
- [ ] CPU upgrade to newer Threadripper?
- [ ] RAM expansion to 256GB?
- [ ] Additional GPU for AI workloads?
- [ ] Migrate to PCIe 5.0 storage?
---
## Investigation Needed
High-priority items to document:
- [ ] Get exact motherboard model (both servers)
- [ ] Get PSU model and wattage
- [ ] CPU cooler models
- [ ] Network switch models and configuration
- [ ] Complete drive inventory in EMC enclosure
- [ ] RAM speed and timings
- [ ] Case models
- [ ] Exact NVMe models for all drives
**Commands to gather info**:
```bash
# Motherboard
ssh pve 'dmidecode -t baseboard'
# CPU details
ssh pve 'lscpu'
# RAM details
ssh pve 'dmidecode -t memory | grep -E "Size|Speed|Manufacturer"'
# Storage devices
ssh pve 'lsblk -o NAME,SIZE,TYPE,TRAN,MODEL'
# Network cards
ssh pve 'lspci | grep -i network'
# GPU details
ssh pve 'lspci | grep -i vga'
ssh pve 'nvidia-smi -L' # If nvidia-smi available
```
---
## Related Documentation
- [VMS.md](VMS.md) - VM resource allocation
- [STORAGE.md](STORAGE.md) - Storage pools and usage
- [POWER-MANAGEMENT.md](POWER-MANAGEMENT.md) - Power optimizations
- [UPS.md](UPS.md) - UPS configuration
- [NETWORK.md](NETWORK.md) - Network configuration
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure details
---
**Last Updated**: 2025-12-22
**Status**: ⚠️ Incomplete - many specs need investigation