How to fix a stuck ZFS log device
I ran into a troublesome ZFS bug several months ago where a pool with a log device became “stuck”. The ‘zpool remove’ command would complete but would not remove the device. This was a bad place to be in, because the device was no longer usable, could not be removed, and would most likely prevent the pool from ever being exported and reimported again. Someone else had posted on the zfs-discuss mailing list about the same problem and put me in contact with George Wilson, who in turn put me on the right track to a successful workaround. Log devices exhibiting this behavior need to have a specific field of a data structure zeroed out, and then the remove command will actually finish removing the device.
The procedure for fixing this is:
Find the corresponding virtual memory hex address on the left for the stuck log device on the right.
# echo '::spa -v | mdb -k'
...
ffffff0946273d40 HEALTHY - /dev/dsk/c4t5000CCA000342E60d0s0
ffffff0946274380 HEALTHY - /dev/dsk/c7d0s0
- - - cache
ffffff094b069680 HEALTHY - /dev/dsk/c7d0s1
Find the corresponding virtual memory hex address on the left for the field vdev_stat.vs_alloc.
# echo 'ffffff0946273d40::print -a vdev_t vdev_stat' | mdb -k
vdev_stat = {
ffffff0946273f70 vdev_stat.vs_timestamp = 0x52bcba2a6c
ffffff0946273f78 vdev_stat.vs_state = 0
ffffff0946273f80 vdev_stat.vs_aux = 0
ffffff0946273f88 vdev_stat.vs_alloc = 0xfffffffffffe0000
ffffff0946273f90 vdev_stat.vs_space = 0x222c000000
ffffff0946273f98 vdev_stat.vs_dspace = 0x222c000000
ffffff0946273fa0 vdev_stat.vs_rsize = 0
ffffff0946273fa8 vdev_stat.vs_ops = [ 0x5, 0x1c, 0x20b, 0, 0, 0x92 ]
ffffff0946273fd8 vdev_stat.vs_bytes = [ 0, 0x12a000, 0x640000, 0, 0, 0 ]
ffffff0946274008 vdev_stat.vs_read_errors = 0
ffffff0946274010 vdev_stat.vs_write_errors = 0
ffffff0946274018 vdev_stat.vs_checksum_errors = 0
ffffff0946274020 vdev_stat.vs_self_healed = 0
ffffff0946274028 vdev_stat.vs_scan_removing = 0x1
ffffff0946274030 vdev_stat.vs_scan_processed = 0
}
Change the value. Depending on the size of the current value, you may need to use /Z0 to zero the field.
# echo 'ffffff0946273f88/Z0' | mdb -kw
0xffffff0946273f88: 0xffffffff00000000 = 0x0
Verify that ‘zpool iostat -v
# zpool iostat -v data
capacity operations bandwidth
pool alloc free read write read write
------------------------- ----- ----- ----- ----- ----- -----
...
c4t5000CCA000342E60d0 - 137G 0 0 0 0
c7d0s0 16.0E 9.94G 0 0 0 0
cache - - - - - -
c7d0s1 64.9G 7.78M 17 59 394K 7.07M
------------------------- ----- ----- ----- ----- ----- -----
# zpool remove data c4t5000CCA000342E60d0
I experienced this problem on NexentaCore 3.1 which shares the same kernel bits with NexentaStor 3.x. Someone apparently filed a bug with Oracle (CR 7000154) but the information is unfortunately locked away behind their paywall. I haven’t looked recently but at the time there wasn’t an Illumos bug filed and it’s still not fixed. Hopefully the above information will help others experiencing this problem, however, please note that I am not responsible for any damage this might do to your data. Please make sure you have things backed up, and be aware that inputting anything to mdb in this manner can cause a kernel panic on a running system!