Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Hutson
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions

510
STORAGE.md Normal file
View File

@@ -0,0 +1,510 @@
# Storage Architecture
Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
## Overview
### Storage Distribution
| Location | Type | Capacity | Purpose |
|----------|------|----------|---------|
| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
---
## PVE (10.10.10.120) Storage Pools
### nvme-mirror1 (Primary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Sabrent Rocket Q NVMe
- **Capacity**: 3.6 TB usable
- **Purpose**: High-performance VM storage
- **Used By**:
- Critical VMs requiring fast IO
- Database workloads
- Development environments
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror1'
ssh pve 'zpool list nvme-mirror1'
```
### nvme-mirror2 (Secondary Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x Kingston SFYRD 2TB NVMe
- **Capacity**: 1.8 TB usable
- **Purpose**: Additional fast VM storage
- **Used By**: TBD
**Check status**:
```bash
ssh pve 'zpool status nvme-mirror2'
ssh pve 'zpool list nvme-mirror2'
```
### rpool (Root Pool)
- **Type**: ZFS mirror
- **Devices**: 2x Samsung 870 QVO 4TB SSD
- **Capacity**: 3.6 TB usable
- **Purpose**: Proxmox OS, container storage, VM backups
- **Used By**:
- Proxmox root filesystem
- LXC containers
- Local VM backups
**Check status**:
```bash
ssh pve 'zpool status rpool'
ssh pve 'df -h /var/lib/vz'
```
### Storage Pool Usage Summary (PVE)
**Get current usage**:
```bash
ssh pve 'zpool list'
ssh pve 'pvesm status'
```
---
## PVE2 (10.10.10.102) Storage Pools
### nvme-mirror3 (Fast Storage)
- **Type**: ZFS mirror
- **Devices**: 2x NVMe (model unknown)
- **Capacity**: Unknown (needs investigation)
- **Purpose**: High-performance VM storage
- **Used By**: Trading VM (301), other VMs
**Check status**:
```bash
ssh pve2 'zpool status nvme-mirror3'
ssh pve2 'zpool list nvme-mirror3'
```
### local-zfs2 (Bulk Storage)
- **Type**: ZFS mirror
- **Devices**: 2x WD Red 6TB HDD
- **Capacity**: ~6 TB usable
- **Purpose**: Bulk/archival storage
- **Power Management**: 30-minute spindown configured
- Saves ~10-16W when idle
- Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
- Command: `hdparm -S 241` (30 min)
**Notes**:
- Pool had only 768 KB used as of 2024-12-16
- Drives configured to spin down after 30 min idle
- Good for archival, NOT for active workloads
**Check status**:
```bash
ssh pve2 'zpool status local-zfs2'
ssh pve2 'zpool list local-zfs2'
# Check if drives are spun down
ssh pve2 'hdparm -C /dev/sdX' # Shows active/standby
```
---
## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
### ZFS Pool: vault
**Primary storage pool** for all shared data.
**Devices**: ❓ Needs investigation
- EMC storage enclosure with multiple drives
- SAS connection via LSI SAS2308 HBA (passed through to VM)
**Capacity**: ❓ Needs investigation
**Check pool status**:
```bash
ssh truenas 'zpool status vault'
ssh truenas 'zpool list vault'
# Get detailed capacity
ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
```
### Datasets (Known)
Based on Syncthing configuration, likely datasets:
| Dataset | Purpose | Synced Devices | Notes |
|---------|---------|----------------|-------|
| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
| vault/desktop | Desktop sync | Unknown | 7.2 GB |
| vault/movies | Movie library | Unknown | Unknown size |
| vault/config | Config files | Mac Mini, MacBook | Unknown size |
**Get complete dataset list**:
```bash
ssh truenas 'zfs list -r vault'
```
### NFS/SMB Shares
**Status**: ❓ Not documented
**Needs investigation**:
```bash
# List NFS exports
ssh truenas 'showmount -e localhost'
# List SMB shares
ssh truenas 'smbclient -L localhost -N'
# Via TrueNAS API/UI
# Sharing → Unix Shares (NFS)
# Sharing → Windows Shares (SMB)
```
**Expected shares**:
- Media libraries for Plex (on Saltbox VM)
- Document storage
- VM backups?
- ISO storage?
### EMC Storage Enclosure
**Model**: EMC KTN-STL4 (or similar)
**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
**Drives**: ❓ Unknown count and capacity
**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
- SES commands
- Fan control
- LCC (Link Control Card) troubleshooting
- Maintenance procedures
**Check enclosure status**:
```bash
ssh truenas 'sg_ses --page=0x02 /dev/sgX' # Element descriptor
ssh truenas 'smartctl --scan' # List all drives
```
---
## Storage Network Architecture
### Internal Storage Network (10.10.10.20.0/24)
**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
**Subnet**: 10.10.10.20.0/24
**DHCP**: No
**Gateway**: No (internal only, no internet)
**Connected VMs**:
- TrueNAS VM (secondary NIC)
- Saltbox VM (secondary NIC) - for NFS mounts
- Other VMs needing storage access
**Configuration**:
```bash
# On TrueNAS VM - check second NIC
ssh truenas 'ip addr show enp6s19'
# On Saltbox - check NFS mounts
ssh saltbox 'mount | grep nfs'
```
**Benefits**:
- Separates storage traffic from general network
- Prevents NFS/SMB from saturating main network
- Better performance for storage-heavy workloads
---
## Storage Capacity Planning
### Current Usage (Estimate)
**Needs actual audit**:
```bash
# PVE pools
ssh pve 'zpool list -o name,size,alloc,free'
# PVE2 pools
ssh pve2 'zpool list -o name,size,alloc,free'
# TrueNAS vault pool
ssh truenas 'zpool list vault'
# Get detailed breakdown
ssh truenas 'zfs list -r vault -o name,used,avail'
```
### Growth Rate
**Needs tracking** - recommend monthly snapshots of capacity:
```bash
#!/bin/bash
# Save as ~/bin/storage-capacity-report.sh
DATE=$(date +%Y-%m-%d)
REPORT=~/Backups/storage-reports/capacity-$DATE.txt
mkdir -p ~/Backups/storage-reports
echo "Storage Capacity Report - $DATE" > $REPORT
echo "================================" >> $REPORT
echo "" >> $REPORT
echo "PVE Pools:" >> $REPORT
ssh pve 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "PVE2 Pools:" >> $REPORT
ssh pve2 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Pools:" >> $REPORT
ssh truenas 'zpool list' >> $REPORT
echo "" >> $REPORT
echo "TrueNAS Datasets:" >> $REPORT
ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
echo "Report saved to $REPORT"
```
**Run monthly via cron**:
```cron
0 9 1 * * ~/bin/storage-capacity-report.sh
```
### Expansion Planning
**When to expand**:
- Pool reaches 80% capacity
- Performance degrades
- New workloads require more space
**Expansion options**:
1. Add drives to existing pools (if mirrors, add mirror vdev)
2. Add new NVMe drives to PVE/PVE2
3. Expand EMC enclosure (add more drives)
4. Add second EMC enclosure
**Cost estimates**: TBD
---
## ZFS Health Monitoring
### Daily Health Checks
```bash
# Check for errors on all pools
ssh pve 'zpool status -x' # Shows only unhealthy pools
ssh pve2 'zpool status -x'
ssh truenas 'zpool status -x'
# Check scrub status
ssh pve 'zpool status | grep scrub'
ssh pve2 'zpool status | grep scrub'
ssh truenas 'zpool status | grep scrub'
```
### Scrub Schedule
**Recommended**: Monthly scrub on all pools
**Configure scrub**:
```bash
# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
# Or via cron:
0 2 1 * * /sbin/zpool scrub nvme-mirror1
0 2 1 * * /sbin/zpool scrub rpool
```
**On TrueNAS**:
- Configure via UI: Storage → Pools → Scrub Tasks
- Recommended: 1st of every month at 2 AM
### SMART Monitoring
**Check drive health**:
```bash
# PVE
ssh pve 'smartctl -a /dev/nvme0'
ssh pve 'smartctl -a /dev/sda'
# TrueNAS
ssh truenas 'smartctl --scan'
ssh truenas 'smartctl -a /dev/sdX' # For each drive
```
**Configure SMART tests**:
- TrueNAS UI: Tasks → S.M.A.R.T. Tests
- Recommended: Weekly short test, monthly long test
### Alerts
**Set up email alerts for**:
- ZFS pool errors
- SMART test failures
- Pool capacity > 80%
- Scrub failures
---
## Storage Performance Tuning
### ZFS ARC (Cache)
**Check ARC usage**:
```bash
ssh pve 'arc_summary'
ssh truenas 'arc_summary'
```
**Tuning** (if needed):
- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
- TrueNAS: Configure via UI (System → Advanced → Tunables)
### NFS Performance
**Mount options** (on clients like Saltbox):
```
rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
```
**Verify NFS mounts**:
```bash
ssh saltbox 'mount | grep nfs'
```
### Record Size Optimization
**Different workloads need different record sizes**:
- VMs: 64K (default, good for VMs)
- Databases: 8K or 16K
- Media files: 1M (large sequential reads)
**Set record size** (on TrueNAS datasets):
```bash
ssh truenas 'zfs set recordsize=1M vault/movies'
```
---
## Disaster Recovery
### Pool Recovery
**If a pool fails to import**:
```bash
# Try importing with different name
zpool import -f -N poolname newpoolname
# Check pool with readonly
zpool import -f -o readonly=on poolname
# Force import (last resort)
zpool import -f -F poolname
```
### Drive Replacement
**When a drive fails**:
```bash
# Identify failed drive
zpool status poolname
# Replace drive
zpool replace poolname old-device new-device
# Monitor resilver
watch zpool status poolname
```
### Data Recovery
**If pool is completely lost**:
1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
2. Recreate pool structure
3. Restore data
**Critical**: This is why we need offsite backups!
---
## Quick Reference
### Common Commands
```bash
# Pool status
zpool status [poolname]
zpool list
# Dataset usage
zfs list
zfs list -r vault
# Check pool health (only unhealthy)
zpool status -x
# Scrub pool
zpool scrub poolname
# Get pool IO stats
zpool iostat -v 1
# Snapshot management
zfs snapshot poolname/dataset@snapname
zfs list -t snapshot
zfs rollback poolname/dataset@snapname
zfs destroy poolname/dataset@snapname
```
### Storage Locations by Use Case
| Use Case | Recommended Storage | Why |
|----------|---------------------|-----|
| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
| Database | nvme-mirror1/2 | Low latency |
| Media files | TrueNAS vault | Large capacity |
| Development | nvme-mirror2 | Fast, mid-tier |
| Containers | rpool | Good performance |
| Backups | TrueNAS or rpool | Large capacity |
| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
---
## Investigation Needed
- [ ] Get complete TrueNAS dataset list
- [ ] Document NFS/SMB share configuration
- [ ] Inventory EMC enclosure drives (count, capacity, model)
- [ ] Document current pool usage percentages
- [ ] Set up monthly capacity reports
- [ ] Configure ZFS scrub schedules
- [ ] Set up storage health alerts
---
## Related Documentation
- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
- [VMS.md](VMS.md) - VM storage assignments
- [NETWORK.md](NETWORK.md) - Storage network configuration
---
**Last Updated**: 2025-12-22