# ZFS on CasaOS — Project Plan**Thread:** #setting-up-zfs-on-casaOS**Status:** Planning**Created:** 2026-04-16## OverviewMigrate from current CasaOS setup to ZFS-backed storage with proper redundancy and an offsite backup strategy. Long-term play tied to new rig build.## Current Hardware- **OS:** CasaOS on existing Linux box- **Drives (11 total, all spinners):** 6×4TB, 3×3TB, 1×8TB, 1×10TB — 48TB raw- **Current usage:** Under 6TB- **8TB + 10TB:** Currently in CasaOS — Plex, old machine backups, Minecraft servers, non-critical stuff- **SSDs available:** Smaller sizes (details TBD)- **Note:** 2TB drive replaced with 4TB (2026-04-18 update)## ZFS Topology (Agreed)### Main Pool ("tank")- **Vdev 1:** 5×4TB raidz2 → ~12TB usable, 2-drive redundancy (the vault — irreplaceable data)- **Vdev 2:** 3×3TB raidz1 → ~6TB usable, 1-drive redundancy (media, Minecraft, replaceable stuff)### Backup (8TB + 10TB)- Standalone backup targets, NOT part of the main pool- Format TBD: ZFS for zfs send/recv OR ext4/NTFS for SHTF portability- Consider: one ZFS for automated snapshots, one NTFS/exFAT for "grab and run" scenarios### 4TB Drive (replaced 2TB)- Scratch/overflow/ISO storage- Possible L2ARC candidate (probably not needed — working set likely fits in RAM)- Could also serve as hot spare for the raidz2 if desired## Key Decisions Made- **CasaOS stays** — not switching back to TrueNAS. Bare metal access (OpenClaw) matters.- **ZFS layered underneath CasaOS** — CasaOS sees mountpoints, doesn't manage ZFS- **Cockpit + ZFS plugin** for GUI management of ZFS layer- **No dedup** — ever- **Striped mirrors considered** but rejected for this drive mix (too much wasted capacity with mismatched sizes)- **raidz2 on 5×4TB** chosen for main vdev — 2-drive redundancy, best for the vault- **raidz1 on 3×3TB** — acceptable risk for replaceable data## Key ZFS Rules Learned- Can't add drives to existing raidz vdev — only replace or add new vdevs- Mixed-size vdevs waste capacity (capped at smallest drive)- Vdev expansion only after ALL drives in vdev are replaced with larger ones- Striped mirrors are only topology where single-drive swap gives immediate capacity boost- 50% overhead on mirrored setups## Offsite Backup Strategy (Long-Term)### 3-2-1 Compliance1. **Main copy:** raidz2 pool on new rig2. **Local backup:** 8/10TB standalone drives3. **Offsite backup:** Old box relocated to brewery after migration### Brewery Infrastructure- HA box (HAOS) — hands off- N100 (Batocera) — hands off- Older Mac — janky ZFS, skip- Raspberry Pis — possible backup target but slow- **Best option:** Old server box moved to brewery as dedicated backup target### Brewery Backup Setup (Future)- Fresh Linux install on old box- ZFS pool on backup drives- `zfs send/recv` over SSH for automated incremental snapshots- Cron job or systemd timer## Migration Roadmap1. Build new rig (see AI thread for hardware planning)2. Set up ZFS pools on new rig with planned topology3. Migrate CasaOS + all services4. Verify everything works on new iron5. Wipe old box, fresh Linux + ZFS as backup target6. Relocate old box to brewery7. Set up zfs send/recv over SSH for nightly incremental backups8. Decommission old setup with confidence## SHTF Portability Notes- ZFS: Linux-readable only (no native Windows, read-only Mac)- ext4: Linux native, Windows/Mac need tools- NTFS/exFAT: Universal — any random box can read- Consider keeping one backup drive as NTFS for "grab and run" scenarios- Linux live USB can read ZFS/ext4 on any machine in a pinch## Special ZFS Drive Uses (Reference)| Type | Use | Drive Type | Verdict ||---|---|---|---|| L2ARC | Read cache | SSD only | Skip unless ARC hit rate is low || SLOG/ZIL | Sync write log | SSD with PLP | Only for NFS/VMs/databases || Special vdev | Metadata storage | SSD preferred | High risk if not mirrored, overkill for home || Dedup vdev | Dedup tables | SSD | NO. Just no. |## SMART Monitoring (Critical for Old Drives)### Why It Matters- All drives are old with no RMA coverage — failure = replace from own pocket- SMART warnings give days to weeks of notice before total failure- Running a degraded pool with old drives is risking data loss### Key Metrics to Watch| Attribute | Warning Threshold | Critical ||---|---|---|| Reallocated Sectors (5) | >0 | >100 || Current Pending Sectors (197) | >0 | >10 || Uncorrectable Errors (198) | >0 | any || Power-On Hours (9) | — | >50,000 || Temperature (194) | >40°C | >45°C |### Setup```bash# Install smartmontoolssudo apt install smartmontools# Enable and start smartdsudo systemctl enable smartdsudo systemctl start smartd# Run short test (5-10 min)smartctl -t short /dev/sdX# Run long/comprehensive test (1-2 hours)smartctl -t long /dev/sdX# Check resultssmartctl -l selftest /dev/sdXsmartctl -a /dev/sdX```### smartd.conf Configuration```# Email alerts on smartd warningsDEVICESCAN -a -m user@example.com -M daily# Or per-drive with specific schedules/dev/sda -a -m admin@example.com -M daily -s (S/../.././02|L/../../6/03)# Short test daily at 2am, long test Saturdays at 3am```### Power Cycle Count (12) — Also Worth Tracking- Drive spinup count. High number = old drive that’s been powered on/off a lot- Not a failure predictor on its own, but tells you wear history### Pre-Pool Drive Health CheckBefore building the pool, run full SMART tests on all drives:```bashsmartctl -t long /dev/sdasmartctl -t long /dev/sdb# ... etc```And check `smartctl -a /dev/sdX | grep -E '(Reallocated|Current_Pending|Uncorrectable)'`Drives with any reallocated sectors or pending sectors should be considered questionable — use for non-critical vdevs or retire entirely.### ZFS IntegrationZFS doesn't do its own SMART polling, but `zpool status` shows drive errors. A rising error count in `zpool status` alongside SMART warnings = replace that drive now.```bash# Check for ZFS errorszpool status -v```## Pre-Pool Drive Testing (2026-04-18)### Strategy- 11 drives to test, 4 at a time (SATA port limitation)- Sequential testing — run one batch, swap drives, run next batch- Two-phase testing per drive: 1. SMART long test (1-4 hours depending on drive size) 2. badblocks non-destructive scan (6-12 hours per 4TB drive)### Phase 1: SMART Long TestsTests all drives in current batch simultaneously.```bash#!/bin/bash# drive-smart-test.sh# Usage: ./drive-smart-test.sh /dev/sda /dev/sdb /dev/sdc /dev/sddDRIVES=("$@")LOGDIR="/root/drive-health-logs"mkdir -p "$LOGDIR"for DRIVE in "${DRIVES[@]}"; do echo "[$(date)] Starting SMART long test on $DRIVE" | tee -a "$LOGDIR/test.log" smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/test.log" echo "[$(date)] SMART long test complete on $DRIVE" | tee -a "$LOGDIR/test.log"doneecho "[$(date)] All SMART tests initiated. Check results with:"echo " smartctl -l selftest $DRIVE"echo " smartctl -a $DRIVE | grep -E '(Reallocated|Current_Pending|Uncorrectable|Power_On_Hours)'"```### Phase 2: badblocks Scan (Sequential)One drive at a time to avoid port contention.```bash#!/bin/bash# drive-badblocks.sh# Usage: ./drive-badblocks.sh /dev/sdaDRIVE="$1"LOGDIR="/root/drive-health-logs"LOGFILE="$LOGDIR/badblocks-$(basename $DRIVE).log"if [ -z "$DRIVE" ]; then echo "Usage: $0 /dev/sdX" exit 1fiecho "[$(date)] Starting badblocks non-destructive scan on $DRIVE" | tee "$LOGFILE"badblocks -nvs "$DRIVE" 2>&1 | tee -a "$LOGFILE"echo "[$(date)] badblocks complete on $DRIVE" | tee -a "$LOGFILE"```### Full Sequential Test Workflow```bash#!/bin/bash# drive-test-batch.sh# Run one batch of 4 drives through full testing pipelineDRIVES=("$@") # pass 4 drives as argsLOGDIR="/root/drive-health-logs"mkdir -p "$LOGDIR"for DRIVE in "${DRIVES[@]}"; do SERIAL=$(smartctl -a "$DRIVE" | grep 'Serial Number' | awk '{print $NF}') SIZE=$(smartctl -a "$DRIVE" | grep 'User Capacity' | awk '{print $5,$6}') echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" echo "Drive: $DRIVE | Serial: $SERIAL | Size: $SIZE" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" echo "==========" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" # Phase 1: SMART long test echo "[$(date)] Phase 1: SMART long test" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" smartctl -t long "$DRIVE" 2>&1 | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" # Wait for SMART test to complete (poll every 60s) while true; do STATUS=$(smartctl -H "$DRIVE" | grep 'SMART overall-health' | awk '{print $NF}') if [ "$STATUS" = "PASSED" ] || [ "$STATUS" = "FAILED" ]; then echo "[$(date)] SMART test result: $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" break fi echo "[$(date)] Waiting for SMART test to complete..." | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" sleep 60 done # Capture SMART attributes echo "[$(date)] Capturing SMART attributes" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" smartctl -a "$DRIVE" > "$LOGDIR/smart-$(basename $DRIVE)-$(date +%Y%m%d).log" # Phase 2: badblocks (only if SMART passed) if [ "$STATUS" = "PASSED" ]; then echo "[$(date)] Phase 2: badblocks non-destructive scan" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" badblocks -nvs "$DRIVE" > "$LOGDIR/badblocks-$(basename $DRIVE)-$(date +%Y%m%d).log" 2>&1 echo "[$(date)] badblocks complete" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" else echo "[$(date)] SKIPPING badblocks — SMART test $STATUS" | tee -a "$LOGDIR/batch-$(date +%Y%m%d).log" fidoneecho "[$(date)] Batch complete. Results in $LOGDIR"```### Current Drive Availability (2026-04-18 update)- **2×4TB hot** (SATA power available): WDC WD40EFRX (sdb), MDD4000GSA (sde)- **4×4TB cold** (no power connectors available yet) — swap in batches for testing- **3×3TB, 8TB, 10TB** — also cold, same power limitation### Drive Batch Schedule| Batch | Drives | Status ||---|---|---|| 1 | 2×4TB (sdb, sde) | **Ready to test** || 2 | 2×4TB | Cold swap || 3 | 2×4TB | Cold swap || 4 | 1×3TB + 8TB + 10TB | Cold swap |### Pass/Fail Criteria- **PASS:** SMART `Reallocated Sectors Count = 0`, `Current Pending Sectors = 0`, `Uncorrectable Errors = 0`, badblocks finds 0 bad sectors- **CONDITIONAL:** Any reallocated/pending sectors — demote to non-critical vdev (3×3TB raidz1)- **FAIL:** Any uncorrectable errors, badblocks bad sectors, or SMART health = FAILED — retire drive### Next Steps- [ ] Get drive inventory (exact models, ages, health)- [ ] Finalize new rig hardware (cross-ref AI thread)- [ ] Decide backup drive format (ZFS vs NTFS)- [ ] Plan CasaOS migration steps when new rig is ready- [ ] Set up SMART monitoring on all drives before pool creation- [ ] Source cold spare 4TB drive (to keep on shelf for old-drive replacement)- [ ] Run pre-pool drive tests (batches 1-3)