ZFS on Western Digital EARS drives

Posted by Thoughts and Ramblings on Wednesday, June 16, 2010

According to the stats, my previous post was one of the more popular on this site. This was in response to a question I was asking myself before building a NAS box at home. In looking at the components to use in building it, I came across another question. How does one fix the performance of ZFS on Western Digital’s green drives with model numbers ending in “EARS” (WD15EARS, WD20EARS, etc)? I’ve split this into sections, with a bold title, so readers can read the parts that are most interesting. I’ve described why WD changed their drives, why this is a problem, what the solutions are. Hope you enjoy this.

Background

First, some background on how drives work internally: A hard drive stores data on the platters in blocks, or sectors. The sector size is typically 512 bytes for a magnetic disk. With each sector, the disk must store a start code and an Error Correction Code (ECC). The start code is necessary so the disk can find a particular sector, and the ECC is necessary to correct any errors in the data read from the disk. Western Digital decided to change from 512 byte sectors to 4096 byte sectors. This means that the new sector can store the same amount of data as 8 of the previous sectors. Since each sector needs a start code, this means there are now 7 fewer start codes for each new sector. The ECC of the 4096 byte sector needs to be larger than the ECC of a 512 byte sector, but it doesn’t need to be 8 times as big. This is due to the fact that ECC is better at correcting errors with larger block sizes and thus a lower rate (or percentage in size) ECC can achieve the same Bit Error Rate (BER). Interesting side note: Hard drives are designed to store data so compactly and so close to the limits of the hardware that data is expected to have some corruption as it was stored, and rely on the ECC to correct the corruption. In fact, the manufactures intentionally design the drives so that it will have a few corruptions but few enough that the ECC can handle it. With the reduction in the number of start codes, and decrease in ECC rate, larger sector sizes allow for greater density of data on the drive itself.

Writing Sectors

When a computer does a read, it reads an entire sector at a time. When it does a write, it, again, writes an entire sector at a time. There is no such mechanism to write only a part of a sector. Western Digital’s EARS drives have 4096-byte sectors, but the drive claims it has 512-byte sectors (since some opperating systems still in use cannot handle different sector sizes). This means that the drive does some work internally to handle this difference. Often, when writing data to a hard drive, multiple consecutive sectors are written. If a computer writes 8 512-byte sectors in a row, which all correspond to the same 4096-byte sector, then the drive can simply wait until all 8 sectors are written and write out all the data as a single 4096-byte sector. This is what the EARS drives do when sectors are written. If not all 8 sectors are written, then the drive has to do a partial sector write. This is typically not the case since many file system operate on 4096-byte blocks, and thus write 4096 bytes at a time.

Partial Sector Write

If not all 8 512-byte sectors in a 4096-byte sector are written, then the drive has to write part of a sector. So, the drive must find the data on the drive, and write the part of the sector that has changed, right? Wrong. The problem is the ECC was written with the 4096 bytes that was there previously, so it must be updated. Due to the nature of how the ECC is computed, one cannot write a new ECC without first reading at least some of the data that was in the original sector. While it is possible to read the data to be overwritten, the old ECC, and compute the new ECC, this can propagate errors beyond the ECC’s ability to correct. So, the drive must read the 4096-byte sector, correct any data errors in the sector, substitute in the changed data, compute the new ECC, wait for that part of the drive to spin around again, and write the data to the drive. In comparison, if all 8 512-byte sectors were written, then the drive need only compute the new ECC and write the data. So, if a single 512-byte sector is written, the drive must read 4096 bytes, wait for a rotation, and then write 4096 bytes. On a 7200 RPM drive, the waiting for a single rotation is 8.333 ms. If one assumes no smart scheduling, this results in a maximum throughput of a megabyte every 2 seconds, which is reminiscent of drives from the previous decade. Obviously, intelligent scheduling is utilized, but in the best case scenario, the drives performance will be dropped to half. From this, one can see that it is crucial to always write 4096 bytes at a time to get the best performance.

Alignment

As stated in the previous section, the best performance comes when 4096 bytes is written at a time, but this data must also be properly aligned. If 2048 bytes is written to one sector and 2048 to the next, then this involves 2 reads and 2 writes of 4096 bytes each. So all writes must be aligned on 8 sector boundaries (sector 0, 8, 16, 24, 32, etc) This is why Western Digital put in a jumper to help with the alignment since some formatting methods start on sector 63, which is not an 8 sector boundary. Those on Unix/Linux systems must pay special attention to this when formatting the drive.

Solutions

The solutions depend on the file system you are using. If you know your file system uses 4096-byte blocks, then you only need to ensure that the file system is properly aligned on 8-sector boundaries. So, when partitioning the drive, ensure that the partition offsets are divisible by 8.

ZFS

So, what does this have to do with ZFS? ZFS doesn’t have a pre-set block size. It uses variable sized blocks depending on the amount of data it is writing. If it’s writing 1000 bytes, then it will write the minimum number of sectors necessary to fit that data, which in the case of 512-byte sectors, is 2. This means that writing 1000 bytes requires reading and writing 4096 bytes or maybe 8192 depending on alignment. This means that properly aligning the partition will not solve the issue with ZFS. Here we need a different solution.

File System Agnostic Solution

The best solution is for the file system to think the drive has 4096-byte sectors. Since the drive will not claim this it needs to be done in software (really Western Digital, did you think this was not worthy of a jumper on the drive? It would have made things much easier). Since ZFS has performance issues on Linux anyway due to it’s kernel’s license, I won’t cover it at all. On Linux, you can use the ashift parameter which makes this task easier. See below. Otherwise, the primary case I’ve seen this issue raised is from those using FreeNAS on FreeBSD. Here, there are two solutions, depending on whether you want your drive encrypted or not. Both of these solutions give a new device, which the rest of the OS sees as a normal drive, but the drive has 4096-byte sectors, which are properly aligned. Warning, both of these methods destroy the data on the drive. Not Encrypted The drive can be presented as a drive with 4096-byte sectors by using gnop. Here is how one would create a raidz1 pool for da1-4:

for i in da1 da2 da3 da4; do gnop create -S 4096 $i; done
zpool create tank da1.nop da2.nop da3.nop da4.nop

The for loop must be run on each reboot otherwise ZFS will not see the da*.nop devices. Encrypted The encryption setup is similar:

for i in da1 da2 da3 da4; do geli init -s 4096 $i; done
for i in da1 da2 da3 da4; do geli attach $i; done
zpool create tank da1.eli da2.eli da3.eli da4.eli

Here the second for loop needs to be executed on each reboot. On FreeNAS, the encryption can be made simpler by setting up the drives in the GUI to use encryption, then after the encryption is setup, go into the shell, and execute the two for loops, then continue setting up the encrypted drive in the gui to use ZFS.

Update: ZFS On Linux ZFS on Linux has a parameter that can be applied during pool creation which accomplishes the 4096 byte sector size without jumping through a bunch of hoops. You simply add the ashift parameter on pool creation as seen in their FAQ:

zpool create -o ashift=12 tank sda sdb sdc sdd

So, that’s how to handle ZFS on WD EARS drives. I’ve read reports that some, if not all, of what I described in the solutions will be in the next version of FreeNAS. Here’s hoping that it will.

BTW, the primary legacy Operating System that cannot properly handle the 4096-byte sector is Windows XP.

Update

For those going the unencrypted route, I’d like to modify my above suggestion. Use gnop to make one disk use 4096 byte sectors, create the pool, export it, destroy the gnop disk, then import it:

gnop create -S 4096 da0
zpool create tank da1.nop da2 da3 da4
zpool export tank
gnop destroy da0.nop
zpool import tank

With the above code, only one drive is created with the 4096 byte sectors and then the pool is created. ZFS/ZPOOL is smart enough to create the pool with a sector size that’s the size of the largest sector sizes across the disks and the sector size is part of the pool’s metadata. Additionally, through the export and import, ZPOOL can detect the device change from da0.nop to da0, then eliminating the need to do anything upon boot. The above is one-time setup, and never need be done again.


Legacy Comments:

Remounting geli and ZFS partitions | Veloce - Jan 23, 2012

[…] I have 4 Western Digital disks with data – all in RAID-Z. And because they are Western Digital, they have 4K sectors and don’t work well with ZFS. So I also have encrypted them with geli – a standard trick. […]

Frederic Kinnaer - Sep 9, 2012

Hi, Thanks a lot for this useful information! I’ll be trying it out with FreeNAS 8.2.0. If I understand correctly, I would have to put NO jumper on the drives?

Graham Booker - Sep 9, 2012

Frederic, correct, you don’t need the jumper. In fact, setting the jumper when you are using an entire disk for ZFS will be detrimental to performance. If you are going the unencrypted route, I’d suggest one modification to what wrote here originally. See the update I added after seeing your comment.

Francois - Nov 30, 2012

Hello, How would you do “Here the second for loop needs to be executed on each reboot.” before the ZFS mountpoints get automatically mounted? Regards

Graham Booker - Nov 30, 2012

Francois, there are many routes for attaching an encrypted volume after reboot and so I’d suggest you Google search those. Obviously, a fully automated manner without user interaction completely defeats the point of using encryption, so you’d be looking at manually doing something to mount the encrypted volume after each boot.

Kris - Jul 12, 2013

You can now use the ashift parameter during pool creation. An ashift=12 gives you 4096 block sizes. http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives

Graham Booker - Jul 12, 2013

Kris, yes today you can use the ashift parameter. Unfortunately that parameter wasn’t available when this post was written. Also, at the time, the primary linux ZFS was Fuse, which was quite slow, which is why I skipped the platform in my explanations. Today, using the ashift parameter is certainly the way to go. In fact, I’d recommend doing it on non-EARS drives because drives will likely shift to 4096 byte sectors in the future and a zpool can’t change the ashift after it has already been created.

Kris - Jul 13, 2013

Hi Graham, Thanks for the (quick) response. I meant no disrespect as this is an outstanding article. Your post is still a (top) relevant search result about using ZFS on WD green drives so I only meant to provide some more current information for anyone who stumbles across this page (like I did). I did not mean to come across as pretentious (if I did). This page provided one piece of a larger puzzle. I understand this article is now dated and as you said didn’t touch on linux because the primary means for using ZFS on linux at the time was fuse. Thanks again for taking the time to explain the WD green architecture and why aligning the ZFS file system is important.

Kris - Jul 17, 2013

To add further information: I ran into an issue with ZFS on linux while adding multiple vdev’s where the ashift parameter doesn’t get applied properly to all the vdevs. There is a a work-around where they added ashift parameter to the ‘zpool add’ and ‘zpool attach’ commands See: https://github.com/zfsonlinux/zfs/issues/566

Graham Booker - Jul 15, 2013

Kris, I didn’t see your comment as disrespectful or pretentious but rather informative, so no worries. Since people still run across this post, I added a quick update for linux. For the record, I currently run my ZFS pool on linux since I was getting frustrated trying to run certain tools on FreeBSD (such as Handbrake and MakeMKV). When I created the backup filesystem after switching to linux, I used the ashift parameter.