Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE

Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00
parent 23e9df68c9
commit 56b82df497
14 changed files with 6328 additions and 1036 deletions
--- a/STORAGE.md
+++ b/STORAGE.md
@@ -0,0 +1,510 @@
+# Storage Architecture
+
+Documentation of all storage pools, datasets, shares, and capacity planning across the homelab.
+
+## Overview
+
+### Storage Distribution
+
+| Location | Type | Capacity | Purpose |
+|----------|------|----------|---------|
+| **PVE** | NVMe + SSD mirrors | ~9 TB usable | VM storage, fast IO |
+| **PVE2** | NVMe + HDD mirrors | ~6+ TB usable | VM storage, bulk data |
+| **TrueNAS** | ZFS pool + EMC enclosure | ~12+ TB usable | Central file storage, NFS/SMB |
+
+---
+
+## PVE (10.10.10.120) Storage Pools
+
+### nvme-mirror1 (Primary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Sabrent Rocket Q NVMe
+- **Capacity**: 3.6 TB usable
+- **Purpose**: High-performance VM storage
+- **Used By**:
+  - Critical VMs requiring fast IO
+  - Database workloads
+  - Development environments
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror1'
+ssh pve 'zpool list nvme-mirror1'
+```
+
+### nvme-mirror2 (Secondary Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x Kingston SFYRD 2TB NVMe
+- **Capacity**: 1.8 TB usable
+- **Purpose**: Additional fast VM storage
+- **Used By**: TBD
+
+**Check status**:
+```bash
+ssh pve 'zpool status nvme-mirror2'
+ssh pve 'zpool list nvme-mirror2'
+```
+
+### rpool (Root Pool)
+- **Type**: ZFS mirror
+- **Devices**: 2x Samsung 870 QVO 4TB SSD
+- **Capacity**: 3.6 TB usable
+- **Purpose**: Proxmox OS, container storage, VM backups
+- **Used By**:
+  - Proxmox root filesystem
+  - LXC containers
+  - Local VM backups
+
+**Check status**:
+```bash
+ssh pve 'zpool status rpool'
+ssh pve 'df -h /var/lib/vz'
+```
+
+### Storage Pool Usage Summary (PVE)
+
+**Get current usage**:
+```bash
+ssh pve 'zpool list'
+ssh pve 'pvesm status'
+```
+
+---
+
+## PVE2 (10.10.10.102) Storage Pools
+
+### nvme-mirror3 (Fast Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x NVMe (model unknown)
+- **Capacity**: Unknown (needs investigation)
+- **Purpose**: High-performance VM storage
+- **Used By**: Trading VM (301), other VMs
+
+**Check status**:
+```bash
+ssh pve2 'zpool status nvme-mirror3'
+ssh pve2 'zpool list nvme-mirror3'
+```
+
+### local-zfs2 (Bulk Storage)
+- **Type**: ZFS mirror
+- **Devices**: 2x WD Red 6TB HDD
+- **Capacity**: ~6 TB usable
+- **Purpose**: Bulk/archival storage
+- **Power Management**: 30-minute spindown configured
+  - Saves ~10-16W when idle
+  - Udev rule: `/etc/udev/rules.d/69-hdd-spindown.rules`
+  - Command: `hdparm -S 241` (30 min)
+
+**Notes**:
+- Pool had only 768 KB used as of 2024-12-16
+- Drives configured to spin down after 30 min idle
+- Good for archival, NOT for active workloads
+
+**Check status**:
+```bash
+ssh pve2 'zpool status local-zfs2'
+ssh pve2 'zpool list local-zfs2'
+
+# Check if drives are spun down
+ssh pve2 'hdparm -C /dev/sdX'  # Shows active/standby
+```
+
+---
+
+## TrueNAS (VM 100 @ 10.10.10.200) - Central Storage
+
+### ZFS Pool: vault
+
+**Primary storage pool** for all shared data.
+
+**Devices**: ❓ Needs investigation
+- EMC storage enclosure with multiple drives
+- SAS connection via LSI SAS2308 HBA (passed through to VM)
+
+**Capacity**: ❓ Needs investigation
+
+**Check pool status**:
+```bash
+ssh truenas 'zpool status vault'
+ssh truenas 'zpool list vault'
+
+# Get detailed capacity
+ssh truenas 'zfs list -o name,used,avail,refer,mountpoint'
+```
+
+### Datasets (Known)
+
+Based on Syncthing configuration, likely datasets:
+
+| Dataset | Purpose | Synced Devices | Notes |
+|---------|---------|----------------|-------|
+| vault/documents | Personal documents | Mac Mini, MacBook, Windows PC, Phone | ~11 GB |
+| vault/downloads | Downloads folder | Mac Mini, TrueNAS | ~38 GB |
+| vault/pictures | Photos | Mac Mini, MacBook, Phone | Unknown size |
+| vault/notes | Note files | Mac Mini, MacBook, Phone | Unknown size |
+| vault/desktop | Desktop sync | Unknown | 7.2 GB |
+| vault/movies | Movie library | Unknown | Unknown size |
+| vault/config | Config files | Mac Mini, MacBook | Unknown size |
+
+**Get complete dataset list**:
+```bash
+ssh truenas 'zfs list -r vault'
+```
+
+### NFS/SMB Shares
+
+**Status**: ❓ Not documented
+
+**Needs investigation**:
+```bash
+# List NFS exports
+ssh truenas 'showmount -e localhost'
+
+# List SMB shares
+ssh truenas 'smbclient -L localhost -N'
+
+# Via TrueNAS API/UI
+# Sharing → Unix Shares (NFS)
+# Sharing → Windows Shares (SMB)
+```
+
+**Expected shares**:
+- Media libraries for Plex (on Saltbox VM)
+- Document storage
+- VM backups?
+- ISO storage?
+
+### EMC Storage Enclosure
+
+**Model**: EMC KTN-STL4 (or similar)
+**Connection**: SAS via LSI SAS2308 HBA (passthrough to TrueNAS VM)
+**Drives**: ❓ Unknown count and capacity
+
+**See [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md)** for:
+- SES commands
+- Fan control
+- LCC (Link Control Card) troubleshooting
+- Maintenance procedures
+
+**Check enclosure status**:
+```bash
+ssh truenas 'sg_ses --page=0x02 /dev/sgX'  # Element descriptor
+ssh truenas 'smartctl --scan'              # List all drives
+```
+
+---
+
+## Storage Network Architecture
+
+### Internal Storage Network (10.10.10.20.0/24)
+
+**Purpose**: Dedicated network for NFS/iSCSI traffic to reduce congestion on main network.
+
+**Bridge**: vmbr3 on PVE (virtual bridge, no physical NIC)
+**Subnet**: 10.10.10.20.0/24
+**DHCP**: No
+**Gateway**: No (internal only, no internet)
+
+**Connected VMs**:
+- TrueNAS VM (secondary NIC)
+- Saltbox VM (secondary NIC) - for NFS mounts
+- Other VMs needing storage access
+
+**Configuration**:
+```bash
+# On TrueNAS VM - check second NIC
+ssh truenas 'ip addr show enp6s19'
+
+# On Saltbox - check NFS mounts
+ssh saltbox 'mount | grep nfs'
+```
+
+**Benefits**:
+- Separates storage traffic from general network
+- Prevents NFS/SMB from saturating main network
+- Better performance for storage-heavy workloads
+
+---
+
+## Storage Capacity Planning
+
+### Current Usage (Estimate)
+
+**Needs actual audit**:
+```bash
+# PVE pools
+ssh pve 'zpool list -o name,size,alloc,free'
+
+# PVE2 pools
+ssh pve2 'zpool list -o name,size,alloc,free'
+
+# TrueNAS vault pool
+ssh truenas 'zpool list vault'
+
+# Get detailed breakdown
+ssh truenas 'zfs list -r vault -o name,used,avail'
+```
+
+### Growth Rate
+
+**Needs tracking** - recommend monthly snapshots of capacity:
+
+```bash
+#!/bin/bash
+# Save as ~/bin/storage-capacity-report.sh
+
+DATE=$(date +%Y-%m-%d)
+REPORT=~/Backups/storage-reports/capacity-$DATE.txt
+
+mkdir -p ~/Backups/storage-reports
+
+echo "Storage Capacity Report - $DATE" > $REPORT
+echo "================================" >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE Pools:" >> $REPORT
+ssh pve 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "PVE2 Pools:" >> $REPORT
+ssh pve2 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Pools:" >> $REPORT
+ssh truenas 'zpool list' >> $REPORT
+echo "" >> $REPORT
+
+echo "TrueNAS Datasets:" >> $REPORT
+ssh truenas 'zfs list -r vault -o name,used,avail' >> $REPORT
+
+echo "Report saved to $REPORT"
+```
+
+**Run monthly via cron**:
+```cron
+0 9 1 * * ~/bin/storage-capacity-report.sh
+```
+
+### Expansion Planning
+
+**When to expand**:
+- Pool reaches 80% capacity
+- Performance degrades
+- New workloads require more space
+
+**Expansion options**:
+1. Add drives to existing pools (if mirrors, add mirror vdev)
+2. Add new NVMe drives to PVE/PVE2
+3. Expand EMC enclosure (add more drives)
+4. Add second EMC enclosure
+
+**Cost estimates**: TBD
+
+---
+
+## ZFS Health Monitoring
+
+### Daily Health Checks
+
+```bash
+# Check for errors on all pools
+ssh pve 'zpool status -x'     # Shows only unhealthy pools
+ssh pve2 'zpool status -x'
+ssh truenas 'zpool status -x'
+
+# Check scrub status
+ssh pve 'zpool status | grep scrub'
+ssh pve2 'zpool status | grep scrub'
+ssh truenas 'zpool status | grep scrub'
+```
+
+### Scrub Schedule
+
+**Recommended**: Monthly scrub on all pools
+
+**Configure scrub**:
+```bash
+# Via Proxmox UI: Node → Disks → ZFS → Select pool → Scrub
+# Or via cron:
+0 2 1 * * /sbin/zpool scrub nvme-mirror1
+0 2 1 * * /sbin/zpool scrub rpool
+```
+
+**On TrueNAS**:
+- Configure via UI: Storage → Pools → Scrub Tasks
+- Recommended: 1st of every month at 2 AM
+
+### SMART Monitoring
+
+**Check drive health**:
+```bash
+# PVE
+ssh pve 'smartctl -a /dev/nvme0'
+ssh pve 'smartctl -a /dev/sda'
+
+# TrueNAS
+ssh truenas 'smartctl --scan'
+ssh truenas 'smartctl -a /dev/sdX'  # For each drive
+```
+
+**Configure SMART tests**:
+- TrueNAS UI: Tasks → S.M.A.R.T. Tests
+- Recommended: Weekly short test, monthly long test
+
+### Alerts
+
+**Set up email alerts for**:
+- ZFS pool errors
+- SMART test failures
+- Pool capacity > 80%
+- Scrub failures
+
+---
+
+## Storage Performance Tuning
+
+### ZFS ARC (Cache)
+
+**Check ARC usage**:
+```bash
+ssh pve 'arc_summary'
+ssh truenas 'arc_summary'
+```
+
+**Tuning** (if needed):
+- PVE/PVE2: Set max ARC in `/etc/modprobe.d/zfs.conf`
+- TrueNAS: Configure via UI (System → Advanced → Tunables)
+
+### NFS Performance
+
+**Mount options** (on clients like Saltbox):
+```
+rsize=131072,wsize=131072,hard,timeo=600,retrans=2,vers=3
+```
+
+**Verify NFS mounts**:
+```bash
+ssh saltbox 'mount | grep nfs'
+```
+
+### Record Size Optimization
+
+**Different workloads need different record sizes**:
+- VMs: 64K (default, good for VMs)
+- Databases: 8K or 16K
+- Media files: 1M (large sequential reads)
+
+**Set record size** (on TrueNAS datasets):
+```bash
+ssh truenas 'zfs set recordsize=1M vault/movies'
+```
+
+---
+
+## Disaster Recovery
+
+### Pool Recovery
+
+**If a pool fails to import**:
+```bash
+# Try importing with different name
+zpool import -f -N poolname newpoolname
+
+# Check pool with readonly
+zpool import -f -o readonly=on poolname
+
+# Force import (last resort)
+zpool import -f -F poolname
+```
+
+### Drive Replacement
+
+**When a drive fails**:
+```bash
+# Identify failed drive
+zpool status poolname
+
+# Replace drive
+zpool replace poolname old-device new-device
+
+# Monitor resilver
+watch zpool status poolname
+```
+
+### Data Recovery
+
+**If pool is completely lost**:
+1. Restore from offsite backup (see [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md))
+2. Recreate pool structure
+3. Restore data
+
+**Critical**: This is why we need offsite backups!
+
+---
+
+## Quick Reference
+
+### Common Commands
+
+```bash
+# Pool status
+zpool status [poolname]
+zpool list
+
+# Dataset usage
+zfs list
+zfs list -r vault
+
+# Check pool health (only unhealthy)
+zpool status -x
+
+# Scrub pool
+zpool scrub poolname
+
+# Get pool IO stats
+zpool iostat -v 1
+
+# Snapshot management
+zfs snapshot poolname/dataset@snapname
+zfs list -t snapshot
+zfs rollback poolname/dataset@snapname
+zfs destroy poolname/dataset@snapname
+```
+
+### Storage Locations by Use Case
+
+| Use Case | Recommended Storage | Why |
+|----------|---------------------|-----|
+| VM OS disk | nvme-mirror1 (PVE) | Fastest IO |
+| Database | nvme-mirror1/2 | Low latency |
+| Media files | TrueNAS vault | Large capacity |
+| Development | nvme-mirror2 | Fast, mid-tier |
+| Containers | rpool | Good performance |
+| Backups | TrueNAS or rpool | Large capacity |
+| Archive | local-zfs2 (PVE2) | Cheap, can spin down |
+
+---
+
+## Investigation Needed
+
+- [ ] Get complete TrueNAS dataset list
+- [ ] Document NFS/SMB share configuration
+- [ ] Inventory EMC enclosure drives (count, capacity, model)
+- [ ] Document current pool usage percentages
+- [ ] Set up monthly capacity reports
+- [ ] Configure ZFS scrub schedules
+- [ ] Set up storage health alerts
+
+---
+
+## Related Documentation
+
+- [BACKUP-STRATEGY.md](BACKUP-STRATEGY.md) - Backup and snapshot strategy
+- [EMC-ENCLOSURE.md](EMC-ENCLOSURE.md) - Storage enclosure maintenance
+- [VMS.md](VMS.md) - VM storage assignments
+- [NETWORK.md](NETWORK.md) - Storage network configuration
+
+---
+
+**Last Updated**: 2025-12-22