Auto-sync: 20260105-122831

This commit is contained in:
Hutson
2026-01-05 12:28:33 -05:00
parent 56b82df497
commit eddd98c57f
17 changed files with 1770 additions and 27 deletions

339
GATEWAY.md Normal file
View File

@@ -0,0 +1,339 @@
# UniFi Gateway (UCG-Fiber)
Documentation for the UniFi Cloud Gateway Fiber (10.10.10.1) - the primary network gateway and router.
## Overview
| Property | Value |
|----------|-------|
| **Device** | UniFi Cloud Gateway Fiber (UCG-Fiber) |
| **IP Address** | 10.10.10.1 |
| **SSH User** | root |
| **SSH Auth** | SSH key (`~/.ssh/id_ed25519`) |
| **Host Aliases** | `ucg-fiber`, `gateway` |
| **Firmware** | v4.4.9 (as of 2026-01-02) |
| **UniFi Core** | 4.4.19 |
| **RAM** | 2.9 GB (shared with UniFi apps) |
---
## SSH Access
SSH key authentication is configured. Use host aliases:
```bash
# Quick access
ssh ucg-fiber 'hostname'
ssh gateway 'free -m'
# Or use IP directly
ssh root@10.10.10.1 'uptime'
```
**Note**: SSH key may need re-deployment after firmware updates if UniFi clears authorized_keys.
---
## Monitoring Services
Two custom monitoring services run on the gateway to prevent and diagnose issues.
### Internet Watchdog Service
**Purpose**: Auto-reboots gateway if internet connectivity is lost for 5+ minutes
**Location**: `/data/scripts/internet-watchdog.sh`
**How it works**:
1. Pings 1.1.1.1, 8.8.8.8, 208.67.222.222 every 60 seconds
2. If all three fail, increments failure counter
3. After 5 consecutive failures (~5 minutes), triggers reboot
4. Logs all activity to `/var/log/internet-watchdog.log`
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status internet-watchdog'
# View recent logs
ssh ucg-fiber 'tail -50 /var/log/internet-watchdog.log'
# Stop temporarily (if troubleshooting)
ssh ucg-fiber 'systemctl stop internet-watchdog'
# Restart
ssh ucg-fiber 'systemctl restart internet-watchdog'
```
**Log Format**:
```
2026-01-02 22:45:01 - Watchdog started
2026-01-02 22:46:01 - Internet check failed (1/5)
2026-01-02 22:47:01 - Internet restored after 1 failures
```
---
### Memory Monitor Service
**Purpose**: Logs memory usage and top processes every 10 minutes for diagnostics
**Location**: `/data/scripts/memory-monitor.sh`
**Log File**: `/data/logs/memory-history.log`
**How it works**:
1. Every 10 minutes, logs current memory usage (`free -m`)
2. Logs top 12 memory-consuming processes
3. Auto-rotates log when it exceeds 10MB (keeps one .old file)
**Commands**:
```bash
# Check service status
ssh ucg-fiber 'systemctl status memory-monitor'
# View recent memory history
ssh ucg-fiber 'tail -100 /data/logs/memory-history.log'
# Check current memory usage
ssh ucg-fiber 'free -m'
# See top memory consumers right now
ssh ucg-fiber 'ps -eo pid,rss,comm --sort=-rss | head -12'
```
**Log Format**:
```
========== 2026-01-02 22:30:00 ==========
--- MEMORY ---
total used free shared buff/cache available
Mem: 2892 1890 102 456 899 1002
Swap: 512 88 424
--- TOP MEMORY PROCESSES ---
PID RSS COMMAND
1234 327456 unifi-protect
2345 252108 mongod
3456 236544 java
...
```
---
## Known Memory Consumers
| Process | Typical Memory | Purpose |
|---------|----------------|---------|
| unifi-protect | ~320 MB | Camera/NVR management |
| mongod | ~250 MB | UniFi configuration database |
| java (controller) | ~230 MB | UniFi Network controller |
| postgres | ~180 MB | PostgreSQL database |
| unifi-core | ~150 MB | UniFi OS core |
| tailscaled | ~80 MB | Tailscale VPN |
**Total available**: ~2.9 GB
**Typical usage**: ~1.8-2.0 GB (leaves ~1 GB free)
**Warning threshold**: <500 MB free
**Critical**: <200 MB free or swap >50% used
---
## Disabled Services
The following services were disabled to reduce memory usage:
| Service | Memory Saved | Reason Disabled |
|---------|--------------|-----------------|
| UniFi Connect | ~200 MB | Not needed (cameras use Protect) |
To re-enable if needed:
```bash
ssh ucg-fiber 'systemctl enable unifi-connect && systemctl start unifi-connect'
```
---
## Common Issues
### Gateway Freeze / Network Loss
**Symptoms**:
- All devices lose internet
- Cannot ping 10.10.10.1
- Physical reboot required
**Root Cause**: Memory exhaustion causing soft lockup
**Prevention**:
1. Internet watchdog auto-reboots after 5 min outage
2. Memory monitor logs help identify runaway processes
3. UniFi Connect disabled to free ~200 MB
**Post-Incident Analysis**:
```bash
# Check memory history for spike before freeze
ssh ucg-fiber 'grep -B5 "Swap:" /data/logs/memory-history.log | tail -50'
# Check watchdog logs
ssh ucg-fiber 'cat /var/log/internet-watchdog.log'
# Check system logs for errors
ssh ucg-fiber 'dmesg | tail -100'
ssh ucg-fiber 'journalctl -p err --since "1 hour ago"'
```
---
### High Memory Usage
**Check current state**:
```bash
ssh ucg-fiber 'free -m && echo "---" && ps -eo pid,rss,comm --sort=-rss | head -15'
```
**If swap is heavily used**:
```bash
# Check swap usage
ssh ucg-fiber 'cat /proc/swaps'
# See what's in swap
ssh ucg-fiber 'for pid in $(ls /proc | grep -E "^[0-9]+$"); do
swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk "{print \$2}");
[ "$swap" -gt 10000 ] 2>/dev/null && echo "$pid: ${swap}kB - $(cat /proc/$pid/comm)";
done | sort -t: -k2 -rn | head -10'
```
**Consider reboot if**:
- Available memory <200 MB
- Swap usage >300 MB
- System becoming unresponsive
---
### Tailscale Issues
**Check Tailscale status**:
```bash
ssh ucg-fiber 'tailscale status'
```
**Common errors and fixes**:
| Error | Fix |
|-------|-----|
| `DNS resolution failed` | Check upstream DNS (Pi-hole at 10.10.10.10) |
| `TLS handshake failed` | Usually temporary; Tailscale auto-reconnects |
| `Not connected` | `ssh ucg-fiber 'tailscale up'` |
---
## Firmware Updates
**Check current version**:
```bash
ssh ucg-fiber 'ubnt-systool version'
```
**Update process**:
1. Check UniFi site for latest stable firmware
2. Download via UI or CLI
3. Schedule update during low-usage time
**After update**:
- Verify SSH key still works
- Check custom services still running
- Verify Tailscale reconnects
**Re-deploy SSH key if needed**:
```bash
ssh-copy-id -i ~/.ssh/id_ed25519 root@10.10.10.1
```
---
## Service Locations
| File | Purpose |
|------|---------|
| `/data/scripts/internet-watchdog.sh` | Watchdog script |
| `/data/scripts/memory-monitor.sh` | Memory monitor script |
| `/etc/systemd/system/internet-watchdog.service` | Watchdog systemd unit |
| `/etc/systemd/system/memory-monitor.service` | Memory monitor systemd unit |
| `/var/log/internet-watchdog.log` | Watchdog log |
| `/data/logs/memory-history.log` | Memory history log |
**Note**: `/data/` persists across firmware updates. `/var/log/` may not.
---
## Quick Reference Commands
```bash
# System status
ssh ucg-fiber 'uptime && free -m'
# Check both monitoring services
ssh ucg-fiber 'systemctl status internet-watchdog memory-monitor'
# Memory history (last hour)
ssh ucg-fiber 'tail -60 /data/logs/memory-history.log'
# Watchdog activity
ssh ucg-fiber 'tail -20 /var/log/internet-watchdog.log'
# Network devices (ARP table)
ssh ucg-fiber 'cat /proc/net/arp'
# Tailscale status
ssh ucg-fiber 'tailscale status'
# System logs
ssh ucg-fiber 'journalctl -p warning --since "1 hour ago" | head -50'
```
---
## Backup Considerations
Custom services in `/data/scripts/` persist across firmware updates but may need:
- Systemd services re-enabled after major updates
- Script permissions re-applied if wiped
**Backup critical files**:
```bash
# Copy scripts locally for reference
scp ucg-fiber:/data/scripts/*.sh ~/Projects/homelab/data/scripts/
```
---
## Related Documentation
- [SSH-ACCESS.md](SSH-ACCESS.md) - SSH configuration and host aliases
- [NETWORK.md](NETWORK.md) - Network architecture
- [MONITORING.md](MONITORING.md) - Overall monitoring strategy
- [HOMEASSISTANT.md](HOMEASSISTANT.md) - Home Assistant integration
---
## Incident History
### 2025-12-27 to 2025-12-29: Gateway Freeze
**Timeline**:
- Dec 7: Firmware update to v4.4.9
- Dec 24: Last healthy system logs
- Dec 27-29: "No internet detected" errors in logs
- Dec 29+: Complete silence (gateway frozen)
- Jan 2: Physical reboot restored access
**Root Cause**: Memory exhaustion causing soft lockup (no crash dump saved)
**Resolution**:
- Deployed internet-watchdog service
- Deployed memory-monitor service
- Disabled UniFi Connect (~200 MB saved)
- Configured SSH key auth
---
**Last Updated**: 2026-01-02