Phase 2 documentation implementation: - Created HARDWARE.md: Complete hardware inventory (servers, GPUs, storage, network cards) - Created SERVICES.md: Service inventory with URLs, credentials, health checks (25+ services) - Created MONITORING.md: Health monitoring recommendations, alert setup, implementation plan - Created MAINTENANCE.md: Regular procedures, update schedules, testing checklists - Updated README.md: Added all Phase 2 documentation links - Updated CLAUDE.md: Cleaned up to quick reference only (1340→377 lines) All detailed content now in specialized documentation files with cross-references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
359 lines
9.6 KiB
Markdown
359 lines
9.6 KiB
Markdown
# Backup Strategy
|
|
|
|
## 🚨 Current Status: CRITICAL GAPS IDENTIFIED
|
|
|
|
This document outlines the backup strategy for the homelab infrastructure. **As of 2025-12-22, there are significant gaps in backup coverage that need to be addressed.**
|
|
|
|
## Executive Summary
|
|
|
|
### What We Have ✅
|
|
- **Syncthing**: File synchronization across 5+ devices
|
|
- **ZFS on TrueNAS**: Copy-on-write filesystem with snapshot capability (not yet configured)
|
|
- **Proxmox**: Built-in backup capabilities (not yet configured)
|
|
|
|
### What We DON'T Have 🚨
|
|
- ❌ No documented VM/CT backups
|
|
- ❌ No ZFS snapshot schedule
|
|
- ❌ No offsite backups
|
|
- ❌ No disaster recovery plan
|
|
- ❌ No tested restore procedures
|
|
- ❌ No configuration backups
|
|
|
|
**Risk Level**: HIGH - A catastrophic failure could result in significant data loss.
|
|
|
|
---
|
|
|
|
## Current State Analysis
|
|
|
|
### Syncthing (File Synchronization)
|
|
|
|
**What it is**: Real-time file sync across devices
|
|
**What it is NOT**: A backup solution
|
|
|
|
| Folder | Devices | Size | Protected? |
|
|
|--------|---------|------|------------|
|
|
| documents | Mac Mini, MacBook, TrueNAS, Windows PC, Phone | 11 GB | ⚠️ Sync only |
|
|
| downloads | Mac Mini, TrueNAS | 38 GB | ⚠️ Sync only |
|
|
| pictures | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
|
|
| notes | Mac Mini, MacBook, TrueNAS, Phone | Unknown | ⚠️ Sync only |
|
|
| config | Mac Mini, MacBook, TrueNAS | Unknown | ⚠️ Sync only |
|
|
|
|
**Limitations**:
|
|
- ❌ Accidental deletion → deleted everywhere
|
|
- ❌ Ransomware/corruption → spreads everywhere
|
|
- ❌ No point-in-time recovery
|
|
- ❌ No version history (unless file versioning enabled - not documented)
|
|
|
|
**Verdict**: Syncthing provides redundancy and availability, NOT backup protection.
|
|
|
|
### ZFS on TrueNAS (Potential Backup Target)
|
|
|
|
**Current Status**: ❓ Unknown - snapshots may or may not be configured
|
|
|
|
**Needs Investigation**:
|
|
```bash
|
|
# Check if snapshots exist
|
|
ssh truenas 'zfs list -t snapshot'
|
|
|
|
# Check if automated snapshots are configured
|
|
ssh truenas 'cat /etc/cron.d/zfs-auto-snapshot' || echo "Not configured"
|
|
|
|
# Check snapshot schedule via TrueNAS API/UI
|
|
```
|
|
|
|
**If configured**, ZFS snapshots provide:
|
|
- ✅ Point-in-time recovery
|
|
- ✅ Protection against accidental deletion
|
|
- ✅ Fast rollback capability
|
|
- ⚠️ Still single location (no offsite protection)
|
|
|
|
### Proxmox VM/CT Backups
|
|
|
|
**Current Status**: ❓ Unknown - no backup jobs documented
|
|
|
|
**Needs Investigation**:
|
|
```bash
|
|
# Check backup configuration
|
|
ssh pve 'pvesh get /cluster/backup'
|
|
|
|
# Check if any backups exist
|
|
ssh pve 'ls -lh /var/lib/vz/dump/'
|
|
ssh pve2 'ls -lh /var/lib/vz/dump/'
|
|
```
|
|
|
|
**Critical VMs Needing Backup**:
|
|
| VM/CT | VMID | Priority | Notes |
|
|
|-------|------|----------|-------|
|
|
| TrueNAS | 100 | 🔴 CRITICAL | All storage lives here |
|
|
| Saltbox | 101 | 🟡 HIGH | Media stack, complex config |
|
|
| homeassistant | 110 | 🟡 HIGH | Home automation config |
|
|
| gitea-vm | 300 | 🟡 HIGH | Git repositories |
|
|
| pihole | 200 | 🟢 MEDIUM | DNS config (easy to rebuild) |
|
|
| traefik | 202 | 🟢 MEDIUM | Reverse proxy config |
|
|
| trading-vm | 301 | 🟡 HIGH | AI trading platform |
|
|
| lmdev1 | 111 | 🟢 LOW | Development (ephemeral) |
|
|
|
|
---
|
|
|
|
## Recommended Backup Strategy
|
|
|
|
### Tier 1: Local Snapshots (IMPLEMENT IMMEDIATELY)
|
|
|
|
**ZFS Snapshots on TrueNAS**
|
|
|
|
Schedule automatic snapshots for all datasets:
|
|
|
|
| Dataset | Frequency | Retention |
|
|
|---------|-----------|-----------|
|
|
| vault/documents | Every 15 min | 1 hour |
|
|
| vault/documents | Hourly | 24 hours |
|
|
| vault/documents | Daily | 30 days |
|
|
| vault/documents | Weekly | 12 weeks |
|
|
| vault/documents | Monthly | 12 months |
|
|
|
|
**Implementation**:
|
|
```bash
|
|
# Via TrueNAS UI: Storage → Snapshots → Add
|
|
# Or via CLI:
|
|
ssh truenas 'zfs snapshot vault/documents@daily-$(date +%Y%m%d)'
|
|
```
|
|
|
|
**Proxmox VM Backups**
|
|
|
|
Configure weekly backups to local storage:
|
|
|
|
```bash
|
|
# Create backup job via Proxmox UI:
|
|
# Datacenter → Backup → Add
|
|
# - Schedule: Weekly (Sunday 2 AM)
|
|
# - Storage: local-zfs or nvme-mirror1
|
|
# - Mode: Snapshot (fast)
|
|
# - Retention: 4 backups
|
|
```
|
|
|
|
**Or via CLI**:
|
|
```bash
|
|
ssh pve 'pvesh create /cluster/backup --schedule "sun 02:00" --storage local-zfs --mode snapshot --prune-backups keep-last=4'
|
|
```
|
|
|
|
### Tier 2: Offsite Backups (CRITICAL GAP)
|
|
|
|
**Option A: Cloud Storage (Recommended)**
|
|
|
|
Use **rclone** or **restic** to sync critical data to cloud:
|
|
|
|
| Provider | Cost | Pros | Cons |
|
|
|----------|------|------|------|
|
|
| Backblaze B2 | $6/TB/mo | Cheap, reliable | Egress fees |
|
|
| AWS S3 Glacier | $4/TB/mo | Very cheap storage | Slow retrieval |
|
|
| Wasabi | $6.99/TB/mo | No egress fees | Minimum 90-day retention |
|
|
|
|
**Implementation Example (Backblaze B2)**:
|
|
```bash
|
|
# Install on TrueNAS
|
|
ssh truenas 'pkg install rclone restic'
|
|
|
|
# Configure B2
|
|
rclone config # Follow prompts for B2
|
|
|
|
# Daily backup critical folders
|
|
0 3 * * * rclone sync /mnt/vault/documents b2:homelab-backup/documents --transfers 4
|
|
```
|
|
|
|
**Option B: Offsite TrueNAS Replication**
|
|
|
|
- Set up second TrueNAS at friend/family member's house
|
|
- Use ZFS replication to sync snapshots
|
|
- Requires: Static IP or Tailscale, trust
|
|
|
|
**Option C: USB Drive Rotation**
|
|
|
|
- Weekly backup to external USB drive
|
|
- Rotate 2-3 drives (one always offsite)
|
|
- Manual but simple
|
|
|
|
### Tier 3: Configuration Backups
|
|
|
|
**Proxmox Configuration**
|
|
|
|
```bash
|
|
# Backup /etc/pve (configs are already in cluster filesystem)
|
|
# But also backup to external location:
|
|
ssh pve 'tar czf /tmp/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/network/interfaces /etc/systemd/system/*.service'
|
|
|
|
# Copy to safe location
|
|
scp pve:/tmp/pve-config-*.tar.gz ~/Backups/proxmox/
|
|
```
|
|
|
|
**VM-Specific Configs**
|
|
|
|
- Traefik configs: `/etc/traefik/` on CT 202
|
|
- Saltbox configs: `/srv/git/saltbox/` on VM 101
|
|
- Home Assistant: `/config/` on VM 110
|
|
|
|
**Script to backup all configs**:
|
|
```bash
|
|
#!/bin/bash
|
|
# Save as ~/bin/backup-homelab-configs.sh
|
|
|
|
DATE=$(date +%Y%m%d)
|
|
BACKUP_DIR=~/Backups/homelab-configs/$DATE
|
|
|
|
mkdir -p $BACKUP_DIR
|
|
|
|
# Proxmox configs
|
|
ssh pve 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve-config.tar.gz
|
|
ssh pve2 'tar czf -' /etc/pve /etc/network > $BACKUP_DIR/pve2-config.tar.gz
|
|
|
|
# Traefik
|
|
ssh pve 'pct exec 202 -- tar czf -' /etc/traefik > $BACKUP_DIR/traefik-config.tar.gz
|
|
|
|
# Saltbox
|
|
ssh saltbox 'tar czf -' /srv/git/saltbox > $BACKUP_DIR/saltbox-config.tar.gz
|
|
|
|
# Home Assistant
|
|
ssh pve 'qm guest exec 110 -- tar czf -' /config > $BACKUP_DIR/homeassistant-config.tar.gz
|
|
|
|
echo "Configs backed up to $BACKUP_DIR"
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery Scenarios
|
|
|
|
### Scenario 1: Single VM Failure
|
|
|
|
**Impact**: Medium
|
|
**Recovery Time**: 30-60 minutes
|
|
|
|
1. Restore from Proxmox backup:
|
|
```bash
|
|
ssh pve 'qmrestore /path/to/backup.vma.zst VMID'
|
|
```
|
|
2. Start VM and verify
|
|
3. Update IP if needed
|
|
|
|
### Scenario 2: TrueNAS Failure
|
|
|
|
**Impact**: CATASTROPHIC (all storage lost)
|
|
**Recovery Time**: Unknown - NO PLAN
|
|
|
|
**Current State**: 🚨 NO RECOVERY PLAN
|
|
**Needed**:
|
|
- Offsite backup of critical datasets
|
|
- Documented ZFS pool creation steps
|
|
- Share configuration export
|
|
|
|
### Scenario 3: Complete PVE Server Failure
|
|
|
|
**Impact**: SEVERE
|
|
**Recovery Time**: 4-8 hours
|
|
|
|
**Current State**: ⚠️ PARTIALLY RECOVERABLE
|
|
**Needed**:
|
|
- VM backups stored on TrueNAS or PVE2
|
|
- Proxmox reinstall procedure
|
|
- Network config documentation
|
|
|
|
### Scenario 4: Complete Site Disaster (Fire/Flood)
|
|
|
|
**Impact**: TOTAL LOSS
|
|
**Recovery Time**: Unknown
|
|
|
|
**Current State**: 🚨 NO RECOVERY PLAN
|
|
**Needed**:
|
|
- Offsite backups (cloud or physical)
|
|
- Critical data prioritization
|
|
- Restore procedures
|
|
|
|
---
|
|
|
|
## Action Plan
|
|
|
|
### Immediate (Next 7 Days)
|
|
|
|
- [ ] **Audit existing backups**: Check if ZFS snapshots or Proxmox backups exist
|
|
```bash
|
|
ssh truenas 'zfs list -t snapshot'
|
|
ssh pve 'ls -lh /var/lib/vz/dump/'
|
|
```
|
|
|
|
- [ ] **Enable ZFS snapshots**: Configure via TrueNAS UI for critical datasets
|
|
|
|
- [ ] **Configure Proxmox backup jobs**: Weekly backups of critical VMs (100, 101, 110, 300)
|
|
|
|
- [ ] **Test restore**: Pick one VM, back it up, restore it to verify process works
|
|
|
|
### Short-term (Next 30 Days)
|
|
|
|
- [ ] **Set up offsite backup**: Choose provider (Backblaze B2 recommended)
|
|
|
|
- [ ] **Install backup tools**: rclone or restic on TrueNAS
|
|
|
|
- [ ] **Configure daily cloud sync**: Critical folders to cloud storage
|
|
|
|
- [ ] **Document restore procedures**: Step-by-step guides for each scenario
|
|
|
|
### Long-term (Next 90 Days)
|
|
|
|
- [ ] **Implement monitoring**: Alerts for backup failures
|
|
|
|
- [ ] **Quarterly restore test**: Verify backups actually work
|
|
|
|
- [ ] **Backup rotation policy**: Automate old backup cleanup
|
|
|
|
- [ ] **Configuration backup automation**: Weekly cron job
|
|
|
|
---
|
|
|
|
## Monitoring & Validation
|
|
|
|
### Backup Health Checks
|
|
|
|
```bash
|
|
# Check last ZFS snapshot
|
|
ssh truenas 'zfs list -t snapshot -o name,creation -s creation | tail -5'
|
|
|
|
# Check Proxmox backup status
|
|
ssh pve 'pvesh get /cluster/backup-info/not-backed-up'
|
|
|
|
# Check cloud sync status (if using rclone)
|
|
ssh truenas 'rclone ls b2:homelab-backup | wc -l'
|
|
```
|
|
|
|
### Alerts to Set Up
|
|
|
|
- Email alert if no snapshot created in 24 hours
|
|
- Email alert if Proxmox backup fails
|
|
- Email alert if cloud sync fails
|
|
- Weekly backup status report
|
|
|
|
---
|
|
|
|
## Cost Estimate
|
|
|
|
**Monthly Backup Costs**:
|
|
|
|
| Component | Cost | Notes |
|
|
|-----------|------|-------|
|
|
| Local storage (already owned) | $0 | Using existing TrueNAS |
|
|
| Proxmox backups (local) | $0 | Using existing storage |
|
|
| Cloud backup (1 TB) | $6-10/mo | Backblaze B2 or Wasabi |
|
|
| **Total** | **~$10/mo** | Minimal cost for peace of mind |
|
|
|
|
**One-time**:
|
|
- External USB drives (3x 4TB) | ~$300 | Optional, for rotation backup
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [STORAGE.md](STORAGE.md) - ZFS pool layouts and capacity
|
|
- [VMS.md](VMS.md) - VM inventory and prioritization
|
|
- [DISASTER-RECOVERY.md](#) - Recovery procedures (coming soon)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-12-22
|
|
**Status**: 🚨 CRITICAL GAPS - IMMEDIATE ACTION REQUIRED
|