Files
homelab-docs/BACKUP-STRATEGY.md
Hutson 56b82df497 Complete Phase 2 documentation: Add HARDWARE, SERVICES, MONITORING, MAINTENANCE
Phase 2 documentation implementation:
- Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards)
- Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services)
- Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan
- Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists
- Updated README.md: Added all Phase 2 documentation links
- Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines)

All detailed content now in specialized documentation files with cross-references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-23 00:34:21 -05:00

359 lines
9.6 KiB
Markdown

# Backup Strategy
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
## Executive Summary
### What We Have ✅
- **Syncthing**: File synchronization across 5+ devices
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
- **Proxmox**: Built-in backup capabilities (not yet configured)
### What We DON'T Have 🚨
- ❌ No documented VM/CT backups
- ❌ No ZFS snapshot schedule
- ❌ No offsite backups
- ❌ No disaster recovery plan
- ❌ No tested restore procedures
- ❌ No configuration backups
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
---
## Current State Analysis
### Syncthing (File Synchronization)
**What it is**: Real-time file sync across devices
**What it is NOT**: A backup solution
| Folder | Devices | Size | Protected? |
|--------|---------|------|------------|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
**Limitations**:
- ❌ Accidental deletion → deleted everywhere
- ❌ Ransomware/corruption → spreads everywhere
- ❌ No point-in-time recovery
- ❌ No version history (unless file versioning enabled - not documented)
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
### ZFS on TrueNAS (Potential Backup Target)
**Current Status**: ❓ Unknown - snapshots may or may not be configured
**Needs Investigation**:
```bash
# Check if snapshots exist
ssh truenas 'zfs list -t snapshot'
# Check if automated snapshots are configured
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
# Check snapshot schedule via TrueNAS API/UI
```
**If configured**, ZFS snapshots provide:
- ✅ Point-in-time recovery
- ✅ Protection against accidental deletion
- ✅ Fast rollback capability
- ⚠️ Still single location (no offsite protection)
### Proxmox VM/CT Backups
**Current Status**: ❓ Unknown - no backup jobs documented
**Needs Investigation**:
```bash
# Check backup configuration
ssh pve 'pvesh get /cluster/backup'
# Check if any backups exist
ssh pve 'ls -lh /var/lib/vz/dump/'
ssh pve2 'ls -lh /var/lib/vz/dump/'
```
**Critical VMs Needing Backup**:
| VM/CT | VMID | Priority | Notes |
|-------|------|----------|-------|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
| homeassistant | 110 | 🟡 HIGH | Home automation config |
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
---
## Recommended Backup Strategy
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
**ZFS Snapshots on TrueNAS**
Schedule automatic snapshots for all datasets:
| Dataset | Frequency | Retention |
|---------|-----------|-----------|
| vault/documents | Every 15 min | 1 hour |
| vault/documents | Hourly | 24 hours |
| vault/documents | Daily | 30 days |
| vault/documents | Weekly | 12 weeks |
| vault/documents | Monthly | 12 months |
**Implementation**:
```bash
# Via TrueNAS UI: Storage → Snapshots → Add
# Or via CLI:
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
```
**Proxmox VM Backups**
Configure weekly backups to local storage:
```bash
# Create backup job via Proxmox UI:
# Datacenter → Backup → Add
# - Schedule: Weekly (Sunday 2 AM)
# - Storage: local-zfs or nvme-mirror1
# - Mode: Snapshot (fast)
# - Retention: 4 backups
```
**Or via CLI**:
```bash
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
```
### Tier 2: Offsite Backups (CRITICAL GAP)
**Option A: Cloud Storage (Recommended)**
Use **rclone** or **restic** to sync critical data to cloud:
| Provider | Cost | Pros | Cons |
|----------|------|------|------|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
**Implementation Example (Backblaze B2)**:
```bash
# Install on TrueNAS
ssh truenas 'pkg install rclone restic'
# Configure B2
rclone config # Follow prompts for B2
# Daily backup critical folders
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
```
**Option B: Offsite TrueNAS Replication**
- Set up second TrueNAS at friend/family member's house
- Use ZFS replication to sync snapshots
- Requires: Static IP or Tailscale, trust
**Option C: USB Drive Rotation**
- Weekly backup to external USB drive
- Rotate 2-3 drives (one always offsite)
- Manual but simple
### Tier 3: Configuration Backups
**Proxmox Configuration**
```bash
# Backup /etc/pve (configs are already in cluster filesystem)
# But also backup to external location:
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
# Copy to safe location
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
```
**VM-Specific Configs**
- Traefik configs: `/etc/traefik/` on CT 202
- Saltbox configs: `/srv/git/saltbox/` on VM 101
- Home Assistant: `/config/` on VM 110
**Script to backup all configs**:
```bash
#!/bin/bash
# Save as ~/bin/backup-homelab-configs.sh
DATE=$(date +%Y%m%d)
BACKUP_DIR=~/Backups/homelab-configs/$DATE
mkdir -p $BACKUP_DIR
# Proxmox configs
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
# Traefik
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
# Saltbox
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
# Home Assistant
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
echo "Configs backed up to $BACKUP_DIR"
```
---
## Disaster Recovery Scenarios
### Scenario 1: Single VM Failure
**Impact**: Medium
**Recovery Time**: 30-60 minutes
1. Restore from Proxmox backup:
```bash
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
```
2. Start VM and verify
3. Update IP if needed
### Scenario 2: TrueNAS Failure
**Impact**: CATASTROPHIC (all storage lost)
**Recovery Time**: Unknown - NO PLAN
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backup of critical datasets
- Documented ZFS pool creation steps
- Share configuration export
### Scenario 3: Complete PVE Server Failure
**Impact**: SEVERE
**Recovery Time**: 4-8 hours
**Current State**: ⚠️ PARTIALLY RECOVERABLE
**Needed**:
- VM backups stored on TrueNAS or PVE2
- Proxmox reinstall procedure
- Network config documentation
### Scenario 4: Complete Site Disaster (Fire/Flood)
**Impact**: TOTAL LOSS
**Recovery Time**: Unknown
**Current State**: 🚨 NO RECOVERY PLAN
**Needed**:
- Offsite backups (cloud or physical)
- Critical data prioritization
- Restore procedures
---
## Action Plan
### Immediate (Next 7 Days)
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
```bash
ssh truenas 'zfs list -t snapshot'
ssh pve 'ls -lh /var/lib/vz/dump/'
```
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
### Short-term (Next 30 Days)
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
- [ ] **Install backup tools**: rclone or restic on TrueNAS
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
### Long-term (Next 90 Days)
- [ ] **Implement monitoring**: Alerts for backup failures
- [ ] **Quarterly restore test**: Verify backups actually work
- [ ] **Backup rotation policy**: Automate old backup cleanup
- [ ] **Configuration backup automation**: Weekly cron job
---
## Monitoring & Validation
### Backup Health Checks
```bash
# Check last ZFS snapshot
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
# Check Proxmox backup status
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
# Check cloud sync status (if using rclone)
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
```
### Alerts to Set Up
- Email alert if no snapshot created in 24 hours
- Email alert if Proxmox backup fails
- Email alert if cloud sync fails
- Weekly backup status report
---
## Cost Estimate
**Monthly Backup Costs**:
| Component | Cost | Notes |
|-----------|------|-------|
| Local storage (already owned) | $0 | Using existing TrueNAS |
| Proxmox backups (local) | $0 | Using existing storage |
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
**One-time**:
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
---
## Related Documentation
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
- [VMS.md](VMS.md) - VM inventory and prioritization
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
---
**Last Updated**: 2025-12-22
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED