How to tell which physical drives are failing in a degraded Proxmox ZFS pool

(Or any pool, really; doesn’t need to be Proxmox…)

Just recently, a server I commissioned 5 years ago began emailing me SMART drive errors. The RAIDZ-1 pool of four WD RED 3TB drives had a couple drives going bad at once — bad news indeed.

In a case like this, it’s important to ensure your backups are current (or make a backup in a hurry, if you haven’t already done so!) With peace of mind knowing that your data is safe even if you make a mis-move and blow up your zpool, you are supposed to be able to replace drives in a ZFS pool quite easily. But pulling a drive that is good will probably take down the whole storage pool; so how do you know which physical drive is failing? How do you identify each drive from its identifier in the ZFS pool?

1. View verbose zpool status

Firstly, view the status of your zfs pool with the “zpool status -v” command. Here’s what my output looked like. I know, not good! (note: if you have more than one pool, and you only want to display the status of one, specify the pool name: e.g. “zpool status -v dpool”.)

root@pve:~# zpool status -v dpool
  pool: dpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 1012K in 0 days 08:31:31 with 0 errors on Sun Oct 11 08:55:32 2020
config:

    NAME                                                   STATE     READ WRITE CKSUM
    dpool                                                  DEGRADED     0     0     0
      raidz1-0                                             DEGRADED    70     0     0
        wwn-0x50014ee2b667195e                             DEGRADED    75     0     0  too many errors
        wwn-0x50014ee2b6662089                             ONLINE       0     0     0
        wwn-0x50014ee2b666d820                             FAULTED     24     0     0  too many errors
        wwn-0x50014ee2b665e775                             ONLINE       0     0     0
    logs	
      ata-Samsung_SSD_860_EVO_250GB_S3YHNX0M117855W-part1  ONLINE       0     0     0
    cache
      sdb2                                                 ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        dpool/vms/vm-100-disk-0:<0x1>

Note the drive identifiers ZFS is using, each beginning with “wwn-“. Popping open the cover of my Proxmox host and viewing the drives gets me the serial numbers of each:

4 hard drives in server showing serial numbers

2. Correlate physical drives to their identifiers in the zpool

So without shutting down the server and pulling drives to view their big label stickers, how do we find out which of the drives are failing? Which drive serial numbers correspond to each WWN identifier in the zpool? Turns out it’s easy with smartctl. Let’s find out which disk is DEGRADED in the pool above:

root@pve:~# smartctl -a /dev/disk/by-id/wwn-0x50014ee2b667195e | grep Serial
Serial Number:    WD-WCC4N3VR2V07

All right, a quick compare with the photo of the drives reveals that’s the bottom drive. Next, which one is FAULTED?

root@pve:~# smartctl -a /dev/disk/by-id/wwn-0x50014ee2b666d820 | grep Serial
Serial Number:    WD-WCC4N4HUCHLP

OK, so that’s the second drive down from the top. Now we know exactly which drives are causing the issue and need to be replaced. Actually, after 5 years of 24/7 service, hmmmm… likely they all ought to be replaced! In my case I took a backup, destroyed the pool altogether, and started over with a mirror of two 4TB drives. (I am coming to prefer mirrors over RAID-Z arrays.) For the curious, I opted for Seagate IronWolf drives this time; I’ll see how they treat me.

Obviously there are other methods to arrive at the same place. You can simply run “smartctl -a” on each drive in turn, looking for the Serial number and WWN number for each drive among the information that is printed to screen:

smartctl -a /dev/sda

and so forth. Whatever best suits your style!

Of course, physically pulling the drive and inspecting the top sticker will also reveal serial number, WWN identifier, and so forth. But if all you can see is the serial number, and you want to find out which drive is failing without shutting down the server to pull drives, this method will get you there. And if your pool happens to list its drives by their /dev/sdx identifiers, you’ll for sure need help to figure out which physical devices are specified in pool config. (e.g. “smartctl -a /dev/sdc | grep WWN” or “smartctl -a /dev/sdd | grep Serial”, etc)

You may also like...

Leave a Reply