ZFS on Western Digital EARS drives
Jun 16th, 2010 by Graham Booker
According to the stats, my previous post was one of the more popular on this site. This was in response to a question I was asking myself before building a NAS box at home. In looking at the components to use in building it, I came across another question. How does one fix the performance of ZFS on Western Digital’s green drives with model numbers ending in “EARS” (WD15EARS, WD20EARS, etc)? I’ve split this into sections, with a bold title, so readers can read the parts that are most interesting. I’ve described why WD changed their drives, why this is a problem, what the solutions are. Hope you enjoy this.
Background
First, some background on how drives work internally: A hard drive stores data on the platters in blocks, or sectors. The sector size is typically 512 bytes for a magnetic disk. With each sector, the disk must store a start code and an Error Correction Code (ECC). The start code is necessary so the disk can find a particular sector, and the ECC is necessary to correct any errors in the data read from the disk.
Western Digital decided to change from 512 byte sectors to 4096 byte sectors. This means that the new sector can store the same amount of data as 8 of the previous sectors. Since each sector needs a start code, this means there are now 7 fewer start codes for each new sector.
The ECC of the 4096 byte sector needs to be larger than the ECC of a 512 byte sector, but it doesn’t need to be 8 times as big. This is due to the fact that ECC is better at correcting errors with larger block sizes and thus a lower rate (or percentage in size) ECC can achieve the same Bit Error Rate (BER). Interesting side note: Hard drives are designed to store data so compactly and so close to the limits of the hardware that data is expected to have some corruption as it was stored, and rely on the ECC to correct the corruption. In fact, the manufactures intentionally design the drives so that it will have a few corruptions but few enough that the ECC can handle it.
With the reduction in the number of start codes, and decrease in ECC rate, larger sector sizes allow for greater density of data on the drive itself.
Writing Sectors
When a computer does a read, it reads an entire sector at a time. When it does a write, it, again, writes an entire sector at a time. There is no such mechanism to write only a part of a sector. Western Digital’s EARS drives have 4096-byte sectors, but the drive claims it has 512-byte sectors (since some opperating systems still in use cannot handle different sector sizes). This means that the drive does some work internally to handle this difference. Often, when writing data to a hard drive, multiple consecutive sectors are written. If a computer writes 8 512-byte sectors in a row, which all correspond to the same 4096-byte sector, then the drive can simply wait until all 8 sectors are written and write out all the data as a single 4096-byte sector. This is what the EARS drives do when sectors are written. If not all 8 sectors are written, then the drive has to do a partial sector write. This is typically not the case since many file system operate on 4096-byte blocks, and thus write 4096 bytes at a time.
Partial Sector Write
If not all 8 512-byte sectors in a 4096-byte sector are written, then the drive has to write part of a sector. So, the drive must find the data on the drive, and write the part of the sector that has changed, right?
Wrong. The problem is the ECC was written with the 4096 bytes that was there previously, so it must be updated. Due to the nature of how the ECC is computed, one cannot write a new ECC without first reading at least some of the data that was in the original sector. While it is possible to read the data to be overwritten, the old ECC, and compute the new ECC, this can propagate errors beyond the ECC’s ability to correct. So, the drive must read the 4096-byte sector, correct any data errors in the sector, substitute in the changed data, compute the new ECC, wait for that part of the drive to spin around again, and write the data to the drive. In comparison, if all 8 512-byte sectors were written, then the drive need only compute the new ECC and write the data.
So, if a single 512-byte sector is written, the drive must read 4096 bytes, wait for a rotation, and then write 4096 bytes. On a 7200 RPM drive, the waiting for a single rotation is 8.333 ms. If one assumes no smart scheduling, this results in a maximum throughput of a megabyte every 2 seconds, which is reminiscent of drives from the previous decade. Obviously, intelligent scheduling is utilized, but in the best case scenario, the drives performance will be dropped to half. From this, one can see that it is crucial to always write 4096 bytes at a time to get the best performance.
Alignment
As stated in the previous section, the best performance comes when 4096 bytes is written at a time, but this data must also be properly aligned. If 2048 bytes is written to one sector and 2048 to the next, then this involves 2 reads and 2 writes of 4096 bytes each. So all writes must be aligned on 8 sector boundaries (sector 0, 8, 16, 24, 32, etc) This is why Western Digital put in a jumper to help with the alignment since some formatting methods start on sector 63, which is not an 8 sector boundary. Those on Unix/Linux systems must pay special attention to this when formatting the drive.
Solutions
The solutions depend on the file system you are using. If you know your file system uses 4096-byte blocks, then you only need to ensure that the file system is properly aligned on 8-sector boundaries. So, when partitioning the drive, ensure that the partition offsets are divisible by 8.
ZFS
So, what does this have to do with ZFS? ZFS doesn’t have a pre-set block size. It uses variable sized blocks depending on the amount of data it is writing. If it’s writing 1000 bytes, then it will write the minimum number of sectors necessary to fit that data, which in the case of 512-byte sectors, is 2. This means that writing 1000 bytes requires reading and writing 4096 bytes or maybe 8192 depending on alignment. This means that properly aligning the partition will not solve the issue with ZFS. Here we need a different solution.
File System Agnostic Solution
The best solution is for the file system to think the drive has 4096-byte sectors. Since the drive will not claim this it needs to be done in software (really Western Digital, did you think this was not worthy of a jumper on the drive? It would have made things much easier). Since ZFS has performance issues on Linux anyway due to it’s kernel’s license, I won’t cover it at all. The primary case I’ve seen this issue raised is from those using FreeNAS on FreeBSD. Here, there are two solutions, depending on whether you want your drive encrypted or not. Both of these solutions give a new device, which the rest of the OS sees as a normal drive, but the drive has 4096-byte sectors, which are properly aligned. Warning, both of these methods destroy the data on the drive.
Not Encrypted
The drive can be presented as a drive with 4096-byte sectors by using gnop. Here is how one would create a raidz1 pool for da1-4:
for i in da1 da2 da3 da4; do gnop create -S 4096 $i; done
zpool create tank da1.nop da2.nop da3.nop da4.nop
The for loop must be run on each reboot otherwise ZFS will not see the da*.nop devices.
Encrytped
The encryption setup is similar:
for i in da1 da2 da3 da4; do geli init -s 4096 $i; done
for i in da1 da2 da3 da4; do geli attach $i; done
zpool create tank da1.eli da2.eli da3.eli da4.eli
Here the second for loop needs to be executed on each reboot. On FreeNAS, the encryption can be made simpler by setting up the drives in the GUI to use encryption, then after the encryption is setup, go into the shell, and execute the two for loops, then continue setting up the encrypted drive in the gui to use ZFS.
So, that’s how to handle ZFS on WD EARS drives. I’ve read reports that some, if not all, of what I described in the solutions will be in the next version of FreeNAS. Here’s hoping that it will.
BTW, the primary legacy Operating System that cannot properly handle the 4096-byte sector is Windows XP.
[...] I have 4 Western Digital disks with data – all in RAID-Z. And because they are Western Digital, they have 4K sectors and don’t work well with ZFS. So I also have encrypted them with geli – a standard trick. [...]