• Home
  • Assertions
  • Poetry
  • Programming

Record and Reverie

General things I find interesting

Feed on
Posts
Comments
« Common Media Player Framework
ZFS on Western Digital EARS drives »

ZFS on different sized disks

Apr 7th, 2010 by Graham Booker

Note: Following this is not for the faint of heart. If you aren’t comfortable with partitioning, then don’t follow the steps here.
I’ve read many posts on how to handle ZFS/Raid-Z on differently sized disks. The goal is to gain the most disk space availability while still retaining the redundancy of surviving a single disk failure. The posts I’ve read either would achieve the theoretical capacity, or be capable of expansion, but not both. I devised a way to get both at the same time, and it’s relatively simple.

The problem is the following: A Raid-Z configuration uses n partitions, giving the user the capacity of n-1 of those partitions, with the nth being the redundant to survive a failure. If the n partitions are not the same size, with the smallest being x, only the first x bytes of each partition is used. One cannot remove a Raid-Z from an active pool without a backup/restore. One cannot add a disk to a Raid-Z without a backup/restore (maybe in the future). The only expansion that can be done is to replace the partitions with larger ones, and once the smallest partition is increased, the available space increases.

For the purposes of demonstrating this technique, I will use an example with 4 disks:

  • 125G
  • 250G
  • 500G
  • 750G

The biggest theoretical capacity of the array, while retaining single disk failure resiliency, is simply the total size of the array minus the largest disk. This means that the capacity of the above example is 875G. So, how does one achieve this capacity? I proposed the following structure using 3 raids:

If you add the capacity in the above, you can see it is 3*125G + 2*125G + 250G = 875G, which is the theoretical capacity.

Suppose I wish to replace the smallest disk, 125G, with a new disk, say 1T. Using the same layout, I should see something like this:


If you add the capacity in the above, you can see it is 3*250G + 2*250G + 250G = 1500G, which again is the theoretical capacity.

The question is, can I migrate from the first configuration to the last without a backup/restore process? The answer is YES, and data is moved/copied at most once.

For the purposes of demonstration, I’m going to show the expansion with 5 drives connected, but the expansion can be done by immediately replacing the smallest drive with the largest and relying on the redundancy to keep things intact in the process.

First, connect the 1T drive and partition it as 250G, 250G, 250G, leaving the rest free. Replace the first 125G partition on 750G disk with the first 250G partition on the 1T disk:

Continue the replacement:

Repeat the same process with the 3 partitions on the 500G disk:

At this point, the move of the blue raid is complete. Depending on the sizes of the disks, the blue raid may be bigger, at which point the pool immediately increases in size. In this example, it does not change size.

Repeat with the 2 partitions on the 250G disk:

At this point, the red raid is complete. In this example, the red raid has changed from 250G to 500G, and the job isn’t complete yet!

Repeat for the final time with partition on the 125G disk:

Now we are done, and can disconnect the 125G disk if not done already. In this example, the green raid has changed from 375G to 750G, yielding a total change of 875G to 1500G.

Note: There is one limitation with this structure. Ordering the disks in increasing size, disk(n) must be at least as big as 2*(disk(n-1))-disk(n-2). This essentially means that if a disk is xG bigger than it’s previous disk, then the next disk must be at least xG bigger than this disk. Since disk sizes tend to grow exponentially, this assumption shouldn’t be much of a problem since the requirement is at least linear growth.

The disk sizes in this example meet the requirement I listed above. Given the requirements of this expansion, the disk added in this example must be at least 1000G, since disk 4 is 250G bigger than disk 3, disk 5 must be at least 250G bigger than disk 4, or 1000G.

This technique also allows adding another disk, under the same conditions. Consider adding a 1T disk and creating another Raid-Z:

I didn’t structure the partitions in nice pretty rows to demonstrate the fact that not all of the data needs to be moved (in this case, 750G of data is not moved). In this example, the capacity increased from 875G to 1625G, again the theoretical capacity.

Anyway, I thought of this while considering using ZFS/Raid-Z in a FreeNAS setup. I haven’t tested any of this; it’s theoretical only. What do you think?

Tags: ZFS

Posted in General

18 Responses to “ZFS on different sized disks”

  1. on 09 Apr 2010 at 3:39 am1vlhorton

    I think it is genius! have you looked at the sun open storage stuff?

  2. on 09 Apr 2010 at 1:20 pm2Graham Booker

    vlhorton,
    No, I haven’t but I did in looking at your suggestion. Considering that I’m considering a storage solution for myself, Sun’s open storage is too expensive for what you get. In my case, I don’t need extremely high performance, but rather cheap storage.

  3. on 29 Mar 2011 at 6:39 am3Steffen

    Hi,
    Nixe idea! Am I tight that in the first example you basically end up with 3 raidz’s ? do you combine these or are these different partitions later?

    Thanks

  4. on 29 Mar 2011 at 1:32 pm4Graham Booker

    Steffen,
    Yes, it is three raidzs, on different partitions. It remains three raidzs that are part of the same pool.

  5. on 30 Mar 2011 at 6:54 pm5Steffen

    Thanks.

    Looks good although my collection of discs is 3×500 and one 1.5 TB I’ll probably try it out.

  6. on 30 Mar 2011 at 6:58 pm6Graham Booker

    With 3×0.5T + 1×1.5T, you can really only do a single raidz, with 0.5T on each disk. There’s no advantage to the technique in this post. If you replace a disk with a larger one, then you start to see an advantage.

  7. on 22 May 2011 at 1:03 am7Tobias

    I notice that it’s been a while since you wrote this post. Have you tested it yet? Perhaps in a virtual machine?

    Your idea seems nothing but brilliant, and I love this kind of thought experiments. The reason I found this post is that I am thinking of just a setup like this for myself; a FreeNAS node using the (differently sized) disks that I already have.

    Now, I’m sure that it can be looked up somewhere, but I think some info is hard to interpret, so I’ll ask: Can an existing RAIDZ partition be extended to a larger number of disks? E.g., in the first example above, could the blue partition be a three-way RAIDZ (given that another HDD was added, obviously), without destroying the existing RAIDZ? It would be rather neat to be able to add, instead of 1TB, 2x1TB which both hold one green, one red and one blue RAIDZ “slice” each.

  8. on 23 May 2011 at 12:13 am8Graham Booker

    Tobias,
    I have tested this insdie a VM, but that is the extent of what I’ve tried.
    To answer your question, you cannot change the width of an existing RaidZ (called a vdev). You can only increase the size of each individual slice and add additional vdevs. You cannot decrease the size of a vdev, change it’s width, or remove a vdev from the pool.

  9. on 29 Aug 2012 at 1:38 am9zpool hasn't expanded - The UNIX and Linux Forums

    […] See ZFS on different sized disks. […]

  10. on 16 Dec 2012 at 1:06 pm10Amedee Van Gasse

    This is a clever config, but doesn’t ZFS prefer direct access to the disk over partitions? That’s how I understand the documentation.
    I am considering a similar partition scheme, but with ext4 or btrfs.

  11. on 17 Dec 2012 at 12:05 am11Graham Booker

    ZFS is perfectly happy to work in partitions. I think that whole disk has a slight optimization and ZFS can operate without a partition unlike many other FSs.

  12. on 05 Feb 2014 at 11:55 am12tom

    Graham,
    Interesting read, looking at doing this for the first time for myself as i’m running separate disk with no redundancy! (Although I have a manual backup).

    I know this is a old post now but it seems still relevant, in relation to the guy who has 3 x 500gb and 1 x 1.5tb, why cant he use 500gb partitions and effectively use the 500gb drives as 1 drive against the 1.5tb?

    If the 1.5tb fails then the data is on the 3x500gb and if one of the 500gb fails then the data is on the 1.5tb?

    thx,

    tom,

  13. on 06 Feb 2014 at 7:48 pm13Graham Booker

    Tom, you are correct in banding together 3x500G into a 1.5T and then mirroring that with another 1.5T to achieve redundancy, but these two actions cannot be done in ZFS simultaneously. Doing a raidz across the 3x500G and 1.5T does achieve 1.5T of data that’s protected against a single drive failure, and leaves 1T on the large disk unused which could be re-purposed for other means.

  14. on 07 Feb 2014 at 7:57 am14tom

    Ah ok, thanks for your input and brilliant article. 🙂

  15. on 01 Feb 2015 at 1:33 am15Kirk

    How does this translate to RAIDZ3?

  16. on 02 Feb 2015 at 2:29 pm16Graham Booker

    Kirk:
    The techniques and conclusions listed here would expand to raidz3 and work mostly the same as they do with raidz1, just wider. The primary difference is that the smallest vdev would contain 4 disks and the total capacity would the sum of the disks minus the 3 largest. Although I should stress that if you are considering raidz3 it sounds like you are investing sufficient money that you’d be far better off buying equal sized disks and creating a single vdev.

  17. on 06 Feb 2017 at 10:21 am17Anonymous

    Doesn’t a raidz require 3 drives/partitions minimum? Your “blue” raid would be a mirror I guess?

  18. on 06 Feb 2017 at 2:06 pm18Graham Booker

    If one were to look at how the algorithm for raidz would work for a 2 disk vdev, it would behave identically to a mirror. So whether a 2 disk raidz is allowed or the user must specify a mirror, the end result is the same. So, yes, the “blue” vdev would be a mirror.

  • Recent Posts

    • What Objective-C can learn from Java, Part 3 (Single Source File)
    • What Objective-C can learn from Java, Part 2 (Abstract Classes)
    • What Objective-C can learn from Java, Part 1 (Generics)
    • Trac.fcgi Memory Usage
    • Google Link Redirection (cont.)
    • Why you shouldn’t buy A Flip Camera
  • Archives

    2021
    2020
    March 2020 (1)
    2019
    November 2019 (1)
    2018
    June 2018 (1)July 2018 (1)December 2018 (1)
    2017
    January 2017 (2)June 2017 (1)August 2017 (1)
    2016
    June 2016 (1)August 2016 (1)
    2015
    January 2015 (1)February 2015 (1)December 2015 (1)
    2014
    June 2014 (1)July 2014 (1)August 2014 (2)
    2013
    February 2013 (2)March 2013 (1)April 2013 (1)June 2013 (1)November 2013 (1)
    2012
    April 2012 (2)May 2012 (1)June 2012 (1)November 2012 (1)
    2011
    January 2011 (1)October 2011 (1)November 2011 (1)December 2011 (1)
    2010
    February 2010 (2)April 2010 (1)June 2010 (1)July 2010 (1)August 2010 (1)September 2010 (1)October 2010 (2)December 2010 (3)
    2009
    January 2009 (1)February 2009 (1)March 2009 (2)May 2009 (1)July 2009 (3)September 2009 (1)
    2008
    January 2008 (1)February 2008 (4)March 2008 (1)April 2008 (6)May 2008 (1)June 2008 (3)August 2008 (1)September 2008 (2)October 2008 (2)December 2008 (1)
    2007
    January 2007 (1)February 2007 (4)March 2007 (5)April 2007 (4)May 2007 (1)June 2007 (6)August 2007 (3)September 2007 (3)November 2007 (3)December 2007 (4)
    2006
    January 2006 (4)February 2006 (10)March 2006 (4)April 2006 (6)May 2006 (2)June 2006 (4)July 2006 (1)August 2006 (1)September 2006 (4)October 2006 (6)November 2006 (3)December 2006 (3)
    2005
    October 2005 (6)November 2005 (13)December 2005 (1)
    2004
    February 2004 (2)March 2004 (1)April 2004 (1)May 2004 (6)June 2004 (6)July 2004 (3)August 2004 (2)September 2004 (1)November 2004 (5)
    2003
    September 2003 (1)October 2003 (3)November 2003 (1)December 2003 (1)
  • Categories

    • Breakaway (5)
    • Family (4)
    • Friends (2)
    • General (148)
    • Nature Pictures (8)
    • Politics (2)
    • Programming (41)
    • School (11)
    • SysAdmin (8)
    • Teaching (2)
  • Tags

    AC3 Ads Code Frontrow Java Objective-C Open Source Perian Perl permissions plex plugin RSS Sapphire School Servers ZFS

  • Pages

    • Assertions
      • Female Friends Who Won’t Date You
      • Not Dating Friends
    • Poetry
      • Curtis Staying Over
      • Girl Questions
      • Scaring Girls Off
      • Summer’s End
    • Programming
      • Fire Development
      • Kyocera Ringtone Converter for the Mac
      • Perian
      • Text Compression

Record and Reverie © 2021 All Rights Reserved.

WordPress Themes | Web Hosting Bluebook