• Home
  • Programming

Record and Reverie

General things I find interesting

Feed on
Posts
Comments
« ZFS on different sized disks
Text Compression Code Released »

ZFS on Western Digital EARS drives

Jun 16th, 2010 by Graham Booker

According to the stats, my previous post was one of the more popular on this site. This was in response to a question I was asking myself before building a NAS box at home. In looking at the components to use in building it, I came across another question. How does one fix the performance of ZFS on Western Digital’s green drives with model numbers ending in “EARS” (WD15EARS, WD20EARS, etc)? I’ve split this into sections, with a bold title, so readers can read the parts that are most interesting. I’ve described why WD changed their drives, why this is a problem, what the solutions are. Hope you enjoy this.

Background
First, some background on how drives work internally: A hard drive stores data on the platters in blocks, or sectors. The sector size is typically 512 bytes for a magnetic disk. With each sector, the disk must store a start code and an Error Correction Code (ECC). The start code is necessary so the disk can find a particular sector, and the ECC is necessary to correct any errors in the data read from the disk.
Western Digital decided to change from 512 byte sectors to 4096 byte sectors. This means that the new sector can store the same amount of data as 8 of the previous sectors. Since each sector needs a start code, this means there are now 7 fewer start codes for each new sector.
The ECC of the 4096 byte sector needs to be larger than the ECC of a 512 byte sector, but it doesn’t need to be 8 times as big. This is due to the fact that ECC is better at correcting errors with larger block sizes and thus a lower rate (or percentage in size) ECC can achieve the same Bit Error Rate (BER). Interesting side note: Hard drives are designed to store data so compactly and so close to the limits of the hardware that data is expected to have some corruption as it was stored, and rely on the ECC to correct the corruption. In fact, the manufactures intentionally design the drives so that it will have a few corruptions but few enough that the ECC can handle it.
With the reduction in the number of start codes, and decrease in ECC rate, larger sector sizes allow for greater density of data on the drive itself.

Writing Sectors
When a computer does a read, it reads an entire sector at a time. When it does a write, it, again, writes an entire sector at a time. There is no such mechanism to write only a part of a sector. Western Digital’s EARS drives have 4096-byte sectors, but the drive claims it has 512-byte sectors (since some opperating systems still in use cannot handle different sector sizes). This means that the drive does some work internally to handle this difference. Often, when writing data to a hard drive, multiple consecutive sectors are written. If a computer writes 8 512-byte sectors in a row, which all correspond to the same 4096-byte sector, then the drive can simply wait until all 8 sectors are written and write out all the data as a single 4096-byte sector. This is what the EARS drives do when sectors are written. If not all 8 sectors are written, then the drive has to do a partial sector write. This is typically not the case since many file system operate on 4096-byte blocks, and thus write 4096 bytes at a time.

Partial Sector Write
If not all 8 512-byte sectors in a 4096-byte sector are written, then the drive has to write part of a sector. So, the drive must find the data on the drive, and write the part of the sector that has changed, right?
Wrong. The problem is the ECC was written with the 4096 bytes that was there previously, so it must be updated. Due to the nature of how the ECC is computed, one cannot write a new ECC without first reading at least some of the data that was in the original sector. While it is possible to read the data to be overwritten, the old ECC, and compute the new ECC, this can propagate errors beyond the ECC’s ability to correct. So, the drive must read the 4096-byte sector, correct any data errors in the sector, substitute in the changed data, compute the new ECC, wait for that part of the drive to spin around again, and write the data to the drive. In comparison, if all 8 512-byte sectors were written, then the drive need only compute the new ECC and write the data.
So, if a single 512-byte sector is written, the drive must read 4096 bytes, wait for a rotation, and then write 4096 bytes. On a 7200 RPM drive, the waiting for a single rotation is 8.333 ms. If one assumes no smart scheduling, this results in a maximum throughput of a megabyte every 2 seconds, which is reminiscent of drives from the previous decade. Obviously, intelligent scheduling is utilized, but in the best case scenario, the drives performance will be dropped to half. From this, one can see that it is crucial to always write 4096 bytes at a time to get the best performance.

Alignment
As stated in the previous section, the best performance comes when 4096 bytes is written at a time, but this data must also be properly aligned. If 2048 bytes is written to one sector and 2048 to the next, then this involves 2 reads and 2 writes of 4096 bytes each. So all writes must be aligned on 8 sector boundaries (sector 0, 8, 16, 24, 32, etc) This is why Western Digital put in a jumper to help with the alignment since some formatting methods start on sector 63, which is not an 8 sector boundary. Those on Unix/Linux systems must pay special attention to this when formatting the drive.

Solutions
The solutions depend on the file system you are using. If you know your file system uses 4096-byte blocks, then you only need to ensure that the file system is properly aligned on 8-sector boundaries. So, when partitioning the drive, ensure that the partition offsets are divisible by 8.

ZFS
So, what does this have to do with ZFS? ZFS doesn’t have a pre-set block size. It uses variable sized blocks depending on the amount of data it is writing. If it’s writing 1000 bytes, then it will write the minimum number of sectors necessary to fit that data, which in the case of 512-byte sectors, is 2. This means that writing 1000 bytes requires reading and writing 4096 bytes or maybe 8192 depending on alignment. This means that properly aligning the partition will not solve the issue with ZFS. Here we need a different solution.

File System Agnostic Solution
The best solution is for the file system to think the drive has 4096-byte sectors. Since the drive will not claim this it needs to be done in software (really Western Digital, did you think this was not worthy of a jumper on the drive? It would have made things much easier). Since ZFS has performance issues on Linux anyway due to it’s kernel’s license, I won’t cover it at all. On Linux, you can use the ashift parameter which makes this task easier. See below. Otherwise, the primary case I’ve seen this issue raised is from those using FreeNAS on FreeBSD. Here, there are two solutions, depending on whether you want your drive encrypted or not. Both of these solutions give a new device, which the rest of the OS sees as a normal drive, but the drive has 4096-byte sectors, which are properly aligned. Warning, both of these methods destroy the data on the drive.
Not Encrypted
The drive can be presented as a drive with 4096-byte sectors by using gnop. Here is how one would create a raidz1 pool for da1-4:

for i in da1 da2 da3 da4; do gnop create -S 4096 $i; done
zpool create tank da1.nop da2.nop da3.nop da4.nop

The for loop must be run on each reboot otherwise ZFS will not see the da*.nop devices.
Encrypted
The encryption setup is similar:

for i in da1 da2 da3 da4; do geli init -s 4096 $i; done
for i in da1 da2 da3 da4; do geli attach $i; done
zpool create tank da1.eli da2.eli da3.eli da4.eli

Here the second for loop needs to be executed on each reboot. On FreeNAS, the encryption can be made simpler by setting up the drives in the GUI to use encryption, then after the encryption is setup, go into the shell, and execute the two for loops, then continue setting up the encrypted drive in the gui to use ZFS.

Update: ZFS On Linux
ZFS on Linux has a parameter that can be applied during pool creation which accomplishes the 4096 byte sector size without jumping through a bunch of hoops. You simply add the ashift parameter on pool creation as seen in their FAQ:

zpool create -o ashift=12 tank sda sdb sdc sdd

So, that’s how to handle ZFS on WD EARS drives. I’ve read reports that some, if not all, of what I described in the solutions will be in the next version of FreeNAS. Here’s hoping that it will.

BTW, the primary legacy Operating System that cannot properly handle the 4096-byte sector is Windows XP.

Update
For those going the unencrypted route, I’d like to modify my above suggestion. Use gnop to make one disk use 4096 byte sectors, create the pool, export it, destroy the gnop disk, then import it:

gnop create -S 4096 da0
zpool create tank da1.nop da2 da3 da4
zpool export tank
gnop destroy da0.nop
zpool import tank

With the above code, only one drive is created with the 4096 byte sectors and then the pool is created. ZFS/ZPOOL is smart enough to create the pool with a sector size that’s the size of the largest sector sizes across the disks and the sector size is part of the pool’s metadata. Additionally, through the export and import, ZPOOL can detect the device change from da0.nop to da0, then eliminating the need to do anything upon boot. The above is one-time setup, and never need be done again.

Tags: ZFS

Posted in General

10 Responses to “ZFS on Western Digital EARS drives”

  1. on 23 Jan 2012 at 9:43 pm1Remounting geli and ZFS partitions | Veloce

    […] I have 4 Western Digital disks with data – all in RAID-Z. And because they are Western Digital, they have 4K sectors and don’t work well with ZFS. So I also have encrypted them with geli – a standard trick. […]

  2. on 09 Sep 2012 at 12:07 pm2Frederic Kinnaer

    Hi,

    Thanks a lot for this useful information! I’ll be trying it out with FreeNAS 8.2.0. If I understand correctly, I would have to put NO jumper on the drives?

  3. on 09 Sep 2012 at 1:25 pm3Graham Booker

    Frederic, correct, you don’t need the jumper. In fact, setting the jumper when you are using an entire disk for ZFS will be detrimental to performance.

    If you are going the unencrypted route, I’d suggest one modification to what wrote here originally. See the update I added after seeing your comment.

  4. on 30 Nov 2012 at 10:35 pm4Francois

    Hello,
    How would you do “Here the second for loop needs to be executed on each reboot.” before the ZFS mountpoints get automatically mounted?

    Regards

  5. on 30 Nov 2012 at 11:31 pm5Graham Booker

    Francois, there are many routes for attaching an encrypted volume after reboot and so I’d suggest you Google search those. Obviously, a fully automated manner without user interaction completely defeats the point of using encryption, so you’d be looking at manually doing something to mount the encrypted volume after each boot.

  6. on 12 Jul 2013 at 12:52 am6Kris

    You can now use the ashift parameter during pool creation. An ashift=12 gives you 4096 block sizes.

    http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives

  7. on 12 Jul 2013 at 2:33 pm7Graham Booker

    Kris, yes today you can use the ashift parameter. Unfortunately that parameter wasn’t available when this post was written. Also, at the time, the primary linux ZFS was Fuse, which was quite slow, which is why I skipped the platform in my explanations. Today, using the ashift parameter is certainly the way to go. In fact, I’d recommend doing it on non-EARS drives because drives will likely shift to 4096 byte sectors in the future and a zpool can’t change the ashift after it has already been created.

  8. on 13 Jul 2013 at 6:35 am8Kris

    Hi Graham,
    Thanks for the (quick) response. I meant no disrespect as this is an outstanding article.

    Your post is still a (top) relevant search result about using ZFS on WD green drives so I only meant to provide some more current information for anyone who stumbles across this page (like I did). I did not mean to come across as pretentious (if I did).

    This page provided one piece of a larger puzzle. I understand this article is now dated and as you said didn’t touch on linux because the primary means for using ZFS on linux at the time was fuse.

    Thanks again for taking the time to explain the WD green architecture and why aligning the ZFS file system is important.

  9. on 15 Jul 2013 at 3:04 pm9Graham Booker

    Kris, I didn’t see your comment as disrespectful or pretentious but rather informative, so no worries. Since people still run across this post, I added a quick update for linux. For the record, I currently run my ZFS pool on linux since I was getting frustrated trying to run certain tools on FreeBSD (such as Handbrake and MakeMKV). When I created the backup filesystem after switching to linux, I used the ashift parameter.

  10. on 17 Jul 2013 at 6:27 pm10Kris

    To add further information: I ran into an issue with ZFS on linux while adding multiple vdev’s where the ashift parameter doesn’t get applied properly to all the vdevs. There is a a work-around where they added ashift parameter to the ‘zpool add’ and ‘zpool attach’ commands

    See:
    https://github.com/zfsonlinux/zfs/issues/566

  • Recent Posts

    • What Objective-C can learn from Java, Part 3 (Single Source File)
    • What Objective-C can learn from Java, Part 2 (Abstract Classes)
    • What Objective-C can learn from Java, Part 1 (Generics)
    • Trac.fcgi Memory Usage
    • Google Link Redirection (cont.)
    • Why you shouldn’t buy A Flip Camera
  • Archives

    2022
    April 2022 (1)
    2021
    May 2021 (1)August 2021 (1)
    2020
    March 2020 (1)
    2019
    November 2019 (1)
    2018
    June 2018 (1)July 2018 (1)December 2018 (1)
    2017
    January 2017 (2)June 2017 (1)August 2017 (1)
    2016
    June 2016 (1)August 2016 (1)
    2015
    January 2015 (1)February 2015 (1)December 2015 (1)
    2014
    June 2014 (1)July 2014 (1)August 2014 (2)
    2013
    February 2013 (2)March 2013 (1)April 2013 (1)June 2013 (1)November 2013 (1)
    2012
    April 2012 (2)May 2012 (1)June 2012 (1)November 2012 (1)
    2011
    January 2011 (1)October 2011 (1)November 2011 (1)December 2011 (1)
    2010
    February 2010 (2)April 2010 (1)June 2010 (1)July 2010 (1)August 2010 (1)September 2010 (1)October 2010 (2)December 2010 (3)
    2009
    January 2009 (1)February 2009 (1)March 2009 (2)May 2009 (1)July 2009 (3)September 2009 (1)
    2008
    January 2008 (1)February 2008 (4)March 2008 (1)April 2008 (6)May 2008 (1)June 2008 (3)August 2008 (1)September 2008 (2)October 2008 (2)December 2008 (1)
    2007
    January 2007 (1)February 2007 (4)March 2007 (5)April 2007 (4)May 2007 (1)June 2007 (6)August 2007 (3)September 2007 (3)November 2007 (3)December 2007 (4)
    2006
    January 2006 (4)February 2006 (10)March 2006 (4)April 2006 (6)May 2006 (2)June 2006 (4)July 2006 (1)August 2006 (1)September 2006 (4)October 2006 (6)November 2006 (3)December 2006 (3)
    2005
    October 2005 (6)November 2005 (13)December 2005 (1)
    2004
    February 2004 (2)March 2004 (1)April 2004 (1)May 2004 (6)June 2004 (6)July 2004 (3)August 2004 (2)September 2004 (1)November 2004 (5)
    2003
    September 2003 (1)October 2003 (3)November 2003 (1)December 2003 (1)
  • Categories

    • Breakaway (5)
    • Family (4)
    • Friends (2)
    • General (151)
    • Nature Pictures (8)
    • Politics (2)
    • Programming (41)
    • School (11)
    • SysAdmin (8)
    • Teaching (2)
  • Tags

    AC3 Ads Code Frontrow Java Objective-C Open Source Perian Perl permissions plex plugin RSS Sapphire School Servers ZFS

  • Pages

    • Programming
      • Fire Development
      • Kyocera Ringtone Converter for the Mac
      • Perian
      • Text Compression

Record and Reverie © 2022 All Rights Reserved.

WordPress Themes | Web Hosting Bluebook