How to fix a stuck ZFS log device

I ran into a troublesome ZFS bug where a pool with a log device became “stuck”. The zpool remove command would complete but would not remove the device. This was a bad place to be in, because the device was no longer usable, could not be removed, and would most likely prevent the pool from ever being exported and reimported again. Someone had posted on the zfs-discuss mailing list about the same problem and put me in contact with George Wilson, who in turn put me on the right track to a successful workaround. Log devices exhibiting this behavior need to have a specific field of a data structure zeroed out, and then the remove command will actually finish removing the device.

The procedure for fixing this is:

Find the corresponding virtual memory hex address on the left for the stuck log device on the right.

# echo '::spa -v | mdb -k'
...
ffffff0946273d40 HEALTHY   -              /dev/dsk/c4t5000CCA000342E60d0s0
ffffff0946274380 HEALTHY   -              /dev/dsk/c7d0s0
-                -         -              cache
ffffff094b069680 HEALTHY   -              /dev/dsk/c7d0s1

Find the corresponding virtual memory hex address on the left for the field vdev_stat.vs_alloc.

# echo 'ffffff0946273d40::print -a vdev_t vdev_stat' | mdb -k
vdev_stat = {
ffffff0946273f70 vdev_stat.vs_timestamp = 0x52bcba2a6c
ffffff0946273f78 vdev_stat.vs_state = 0
ffffff0946273f80 vdev_stat.vs_aux = 0
ffffff0946273f88 vdev_stat.vs_alloc = 0xfffffffffffe0000
ffffff0946273f90 vdev_stat.vs_space = 0x222c000000
ffffff0946273f98 vdev_stat.vs_dspace = 0x222c000000
ffffff0946273fa0 vdev_stat.vs_rsize = 0
ffffff0946273fa8 vdev_stat.vs_ops = [ 0x5, 0x1c, 0x20b, 0, 0, 0x92 ]
ffffff0946273fd8 vdev_stat.vs_bytes = [ 0, 0x12a000, 0x640000, 0, 0, 0 ]
ffffff0946274008 vdev_stat.vs_read_errors = 0
ffffff0946274010 vdev_stat.vs_write_errors = 0
ffffff0946274018 vdev_stat.vs_checksum_errors = 0
ffffff0946274020 vdev_stat.vs_self_healed = 0
ffffff0946274028 vdev_stat.vs_scan_removing = 0x1
ffffff0946274030 vdev_stat.vs_scan_processed = 0
}

Change the value. Depending on the size of the current value, you may need to use /Z0 to zero the field.

# echo 'ffffff0946273f88/Z0' | mdb -kw
0xffffff0946273f88:             0xffffffff00000000      =       0x0

Verify that zpool iostat -v shows the alloc column for the stuck log device is now 0 and then re-issue the zpool remove command.

# zpool iostat -v data
                             capacity     operations    bandwidth
pool                       alloc  free   read   write  read   write
-------------------------  -----  -----  -----  -----  -----  -----
...
c4t5000CCA000342E60d0          -   137G      0      0      0      0
c7d0s0                     16.0E  9.94G      0      0      0      0
cache                          -      -      -      -      -      -
c7d0s1                     64.9G  7.78M     17     59   394K  7.07M
-------------------------  -----  -----  -----  -----  -----  -----
# zpool remove data c4t5000CCA000342E60d0

I experienced this problem on NexentaCore 3.1 which shares the same kernel bits with NexentaStor 3.x. Someone apparently filed a bug with Oracle (CR 7000154) but the information is unfortunately locked away behind their paywall. At the time of this post there wasn’t an Illumos bug filed and it was still not fixed. Hopefully the above information will help others experiencing this problem, however, please note that I am not responsible for any damage this might do to your data. Please make sure you have things backed up, and be aware that inputting anything to mdb in this manner can cause a kernel panic on a running system!

How to fix a stuck ZFS log device

You may also like...

Recent Posts

Recent Comments

You may also like...

Ultimate Home ZFS Storage Server?

Fun and Profit with UNAS Pro 8: NVMe Cache the Hard Way

Migrating SmartOS instances to SDC

Recent Posts

Recent Comments