Rescuing a Corrupted LND Umbrel Node With ZFS Snapshots

Word of caution: you could loose your channels’ balance when restoring LND from backup; be sure to read on!

For the better part of a year now, I’ve been running an LND node in a virtual machine (VM) with Umbrel handling the underlying setup and container orchestration.

Although Umbrel is not yet considered production-ready, I’ve found it to be quite reliable thanks to several additional implementations I’ve put in place. These include using an SSL proxy, setting up backup scripts, and running the node on server-grade hardware. However, even with these enhancements in place, I recently encountered an issue that tested the reliability of my setup.

What happened?

I woke up to an alert from Uptime Kuma - btw, another fantastic piece of software - that my Umbrel LND node had gone offline. Upon checking the Umbrel dashboard, I found that the LND app was stuck on an ever-loading icon.

umbrel-lnd-offline

Troubleshooting

To resolve the issue, I SSH’ed into the Umbrel node, restarted all LND-related containers and checked the logs. I discovered that the node’s data had been corrupted. This came as a surprise, given that my node’s underlying storage is a ZFS pool, which should handle data corruption well.

ubuntu@umbrel:
> $ sudo /path/to/umbrel/scripts/app restart lightning

Corruption hint:

ubuntu@umbrel:
> $ docker logs --follow lightning_lnd_1
2024-02-04 06:29:24.480 [INF] LTND: Version: 0.17.3-beta commit=v0.17.3-beta, build=production, logging=default, debuglevel=info
2024-02-04 06:29:24.480 [INF] LTND: Active chain: Bitcoin (network=mainnet)
2024-02-04 06:29:24.488 [INF] RPCS: RPC server listening on 0.0.0.0:10009
2024-02-04 06:29:24.524 [INF] RPCS: gRPC proxy started at 0.0.0.0:8080
2024-02-04 06:29:24.591 [INF] LTND: Opening the main database, this might take a few minutes...
2024-02-04 06:29:24.595 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false
2024-02-04 06:29:24.713 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: waiting to start, RPC services not available
unexpected fault address 0x7ff62803f000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7ff62803f000 pc=0xa125ca]

goroutine 1 [running]:
runtime.throw({0x1e11a05?, 0x10?})
	runtime/panic.go:1077 +0x5c fp=0xc000982190 sp=0xc000982160 pc=0x43b13c
runtime.sigpanic()
	runtime/signal_unix.go:858 +0x116 fp=0xc0009821f0 sp=0xc000982190 pc=0x451fd6
go.etcd.io/bbolt.(*page).fastCheck(0x7ff62803f000, 0x7d3f)
	go.etcd.io/bbolt@v1.3.7/page.go:57 +0x2a fp=0xc0009822b8 sp=0xc0009821f0 pc=0xa125ca

Upon further investigation, I found out that ZFS reported corruption for the channel.db file.

admin@linuxstorage
> $ zpool status -v                                                                                                                                                                                                                     
  pool: nvme-pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 04:53:35 with 0 errors on Mon Jan  4 05:17:36 2024
config:

	NAME                                  STATE     READ WRITE CKSUM
	nvme-pool1                            ONLINE       0     0     0
	  mirror-0                            ONLINE       0     0     0
	    nvme-CT2000yyy                    ONLINE       0     0     0
	    nvme-CT2000zzz                    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /path/to/umbrel-data/app-data/lightning/data/lnd/data/graph/mainnet/channel.db
        nvme-pool1/path/to/umbrel-data@autosnap_2024-02-04_02:56:03_hourly:/app-data/lightning/data/lnd/data/graph/mainnet/channel.db

Remediation paths

When facing this issue, you have two options:

  1. Force close channels and restore your node — this is the recommended approach.
  2. Restore from backup or snapshot and hope for the best.

It is crucial to note that restoring a non-current channel state can be risky, as it might lead to losing the entire channel balance if the restored state is not up-to-date regarding off-chain Lightning Network transactions.

In this particular scenario, I chose to proceed with option 2 nonetheless(restoring from backup or snapshot) for several reasons:

  1. The time window between my lnd node going offline and the restoration process was relatively short, minimizing the potential risks associated with restoring a non-current state of my channel balances.
  2. ZFS provided information on the first snapshot that included a corrupted version of channel.db, allowing me to pinpoint the exact time when the corruption occurred with hourly precision.
  3. I had a ZFS snapshot of the dataset taken just an hour before the corruption, making the restoration more likely to succeed.
  4. While other backup solutions, such as restic, could also be employed for restoration purposes, they might not have the same level of recency as the ZFS snapshot in this case. Additionally, the restoration process would be slightly more complicated compared to using ZFS snapshots.

Recovery Steps

Here’s how I restored my LND node using a ZFS snapshot:

  1. On the ZFS host, copy the last known healthy version of channel.db:
admin@linuxstorage
> $ cp /path/to/umbrel-data/.zfs/snapshot/autosnap_2024-02-04_01:48:47_daily/app-data/lightning/data/lnd/data/graph/mainnet/channel.db /path/to/umbrel-data/app-data/lightning/data/lnd/data/graph/mainnet
  1. On the Umbrel node, restart the LND containers:
ubuntu@umbrel:
> $ sudo /media/host1-nvme1/encrypted/btcnode/scripts/app restart lightning
  1. Check the logs using docker logs --follow lightning_lnd_1 to ensure everything is running smoothly.
2024-02-04 06:41:16.993 [INF] LTND: Version: 0.17.3-beta commit=v0.17.3-beta, build=production, logging=default, debuglevel=info
2024-02-04 06:41:16.994 [INF] LTND: Active chain: Bitcoin (network=mainnet)
2024-02-04 06:41:17.099 [INF] RPCS: RPC server listening on 0.0.0.0:10009
2024-02-04 06:41:17.147 [INF] RPCS: gRPC proxy started at 0.0.0.0:8080
2024-02-04 06:41:17.148 [INF] LTND: Opening the main database, this might take a few minutes...
2024-02-04 06:41:17.150 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false
2024-02-04 06:41:21.380 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: waiting to start, RPC services not available
[...]
2024-02-04 06:42:47.453 [INF] CHDB: Checking for schema update: latest_version=31, db_version=31
2024-02-04 06:42:47.454 [INF] CHDB: Checking for optional update: prune_revocation_log=false, db_version=empty
2024-02-04 06:42:47.507 [INF] LTND: Database(s) now open (time_to_open=1m30.356488446s)!
2024-02-04 06:42:47.508 [INF] LTND: We're not running within systemd or the service type is not 'notify'
2024-02-04 06:42:47.509 [INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create a wallet, `lncli unlock` to unlock an existing wallet, or `lncli changepassword` to change the password of an existing wallet and unlock it.
2024-02-04 06:42:49.024 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: wallet locked, unlock it to enable full RPC access
[...]
2024-02-04 06:42:53.121 [INF] LNWL: Opened wallet

You should see “Opened wallet” in the logs, indicating that the issue has been resolved.

By following these steps, I was able to rescue my LND Umbrel node from channel.db corruption using a ZFS snapshot.

Remember, it’s essential to have regular backups and snapshots of your Umbrel node data to minimize the risk of data loss hence loss of funds in case of corruption. This is especially true for those running on Raspberry Pi or even the official appliance, which does not feature multiple drives and/or RAID. Deploy ZFS and/or a daily backup job on the host!