Feed on
Posts
Comments

Google, for quite some time, has been redirected clicks on links in their search results to www.google.com/url?…. While I don’t approve of such practices, I didn’t mind it so much since this is presumably an effort to improve their search results. That changed recently, when I noticed that my history in Safari was filled with entries containing that URL as the title. Considering the fact that I often use the history to re-find a page with pertinent information, this is bordering on making my browser usage useless. Note: I tend to cmd-click links so they show up in new tabs. If you just click the link, the title in the history is correct.

Edit: My solution that was here previously didn’t work properly. I’ve developed a Safari Extension to correct this, and am currently testing it. If it passes the tests, I’ll likely put it up here.

After it’s original post, I got a request for the code I used in my text compression technique. I’ve not gotten around to cleaning up the code and separating it from it’s test environment so it can be distributed separately. You can read more about it in its own page.

Sometime soon, I’ll get around to releasing my code for fetching old Escape Pod episodes that I hinted at earlier.

According to the stats, my previous post was one of the more popular on this site. This was in response to a question I was asking myself before building a NAS box at home. In looking at the components to use in building it, I came across another question. How does one fix the performance of ZFS on Western Digital’s green drives with model numbers ending in “EARS” (WD15EARS, WD20EARS, etc)? I’ve split this into sections, with a bold title, so readers can read the parts that are most interesting. I’ve described why WD changed their drives, why this is a problem, what the solutions are. Hope you enjoy this.

Background
First, some background on how drives work internally: A hard drive stores data on the platters in blocks, or sectors. The sector size is typically 512 bytes for a magnetic disk. With each sector, the disk must store a start code and an Error Correction Code (ECC). The start code is necessary so the disk can find a particular sector, and the ECC is necessary to correct any errors in the data read from the disk.
Western Digital decided to change from 512 byte sectors to 4096 byte sectors. This means that the new sector can store the same amount of data as 8 of the previous sectors. Since each sector needs a start code, this means there are now 7 fewer start codes for each new sector.
The ECC of the 4096 byte sector needs to be larger than the ECC of a 512 byte sector, but it doesn’t need to be 8 times as big. This is due to the fact that ECC is better at correcting errors with larger block sizes and thus a lower rate (or percentage in size) ECC can achieve the same Bit Error Rate (BER). Interesting side note: Hard drives are designed to store data so compactly and so close to the limits of the hardware that data is expected to have some corruption as it was stored, and rely on the ECC to correct the corruption. In fact, the manufactures intentionally design the drives so that it will have a few corruptions but few enough that the ECC can handle it.
With the reduction in the number of start codes, and decrease in ECC rate, larger sector sizes allow for greater density of data on the drive itself.

Writing Sectors
When a computer does a read, it reads an entire sector at a time. When it does a write, it, again, writes an entire sector at a time. There is no such mechanism to write only a part of a sector. Western Digital’s EARS drives have 4096-byte sectors, but the drive claims it has 512-byte sectors (since some opperating systems still in use cannot handle different sector sizes). This means that the drive does some work internally to handle this difference. Often, when writing data to a hard drive, multiple consecutive sectors are written. If a computer writes 8 512-byte sectors in a row, which all correspond to the same 4096-byte sector, then the drive can simply wait until all 8 sectors are written and write out all the data as a single 4096-byte sector. This is what the EARS drives do when sectors are written. If not all 8 sectors are written, then the drive has to do a partial sector write. This is typically not the case since many file system operate on 4096-byte blocks, and thus write 4096 bytes at a time.

Partial Sector Write
If not all 8 512-byte sectors in a 4096-byte sector are written, then the drive has to write part of a sector. So, the drive must find the data on the drive, and write the part of the sector that has changed, right?
Wrong. The problem is the ECC was written with the 4096 bytes that was there previously, so it must be updated. Due to the nature of how the ECC is computed, one cannot write a new ECC without first reading at least some of the data that was in the original sector. While it is possible to read the data to be overwritten, the old ECC, and compute the new ECC, this can propagate errors beyond the ECC’s ability to correct. So, the drive must read the 4096-byte sector, correct any data errors in the sector, substitute in the changed data, compute the new ECC, wait for that part of the drive to spin around again, and write the data to the drive. In comparison, if all 8 512-byte sectors were written, then the drive need only compute the new ECC and write the data.
So, if a single 512-byte sector is written, the drive must read 4096 bytes, wait for a rotation, and then write 4096 bytes. On a 7200 RPM drive, the waiting for a single rotation is 8.333 ms. If one assumes no smart scheduling, this results in a maximum throughput of a megabyte every 2 seconds, which is reminiscent of drives from the previous decade. Obviously, intelligent scheduling is utilized, but in the best case scenario, the drives performance will be dropped to half. From this, one can see that it is crucial to always write 4096 bytes at a time to get the best performance.

Alignment
As stated in the previous section, the best performance comes when 4096 bytes is written at a time, but this data must also be properly aligned. If 2048 bytes is written to one sector and 2048 to the next, then this involves 2 reads and 2 writes of 4096 bytes each. So all writes must be aligned on 8 sector boundaries (sector 0, 8, 16, 24, 32, etc) This is why Western Digital put in a jumper to help with the alignment since some formatting methods start on sector 63, which is not an 8 sector boundary. Those on Unix/Linux systems must pay special attention to this when formatting the drive.

Solutions
The solutions depend on the file system you are using. If you know your file system uses 4096-byte blocks, then you only need to ensure that the file system is properly aligned on 8-sector boundaries. So, when partitioning the drive, ensure that the partition offsets are divisible by 8.

ZFS
So, what does this have to do with ZFS? ZFS doesn’t have a pre-set block size. It uses variable sized blocks depending on the amount of data it is writing. If it’s writing 1000 bytes, then it will write the minimum number of sectors necessary to fit that data, which in the case of 512-byte sectors, is 2. This means that writing 1000 bytes requires reading and writing 4096 bytes or maybe 8192 depending on alignment. This means that properly aligning the partition will not solve the issue with ZFS. Here we need a different solution.

File System Agnostic Solution
The best solution is for the file system to think the drive has 4096-byte sectors. Since the drive will not claim this it needs to be done in software (really Western Digital, did you think this was not worthy of a jumper on the drive? It would have made things much easier). Since ZFS has performance issues on Linux anyway due to it’s kernel’s license, I won’t cover it at all. The primary case I’ve seen this issue raised is from those using FreeNAS on FreeBSD. Here, there are two solutions, depending on whether you want your drive encrypted or not. Both of these solutions give a new device, which the rest of the OS sees as a normal drive, but the drive has 4096-byte sectors, which are properly aligned. Warning, both of these methods destroy the data on the drive.
Not Encrypted
The drive can be presented as a drive with 4096-byte sectors by using gnop. Here is how one would create a raidz1 pool for da1-4:
for i in da1 da2 da3 da4; do gnop create -S 4096 $i; done
zpool create tank da1.nop da2.nop da3.nop da4.nop

The for loop must be run on each reboot otherwise ZFS will not see the da*.nop devices.
Encrytped
The encryption setup is similar:
for i in da1 da2 da3 da4; do geli init -s 4096 $i; done
for i in da1 da2 da3 da4; do geli attach $i; done
zpool create tank da1.eli da2.eli da3.eli da4.eli

Here the second for loop needs to be executed on each reboot. On FreeNAS, the encryption can be made simpler by setting up the drives in the GUI to use encryption, then after the encryption is setup, go into the shell, and execute the two for loops, then continue setting up the encrypted drive in the gui to use ZFS.

So, that’s how to handle ZFS on WD EARS drives. I’ve read reports that some, if not all, of what I described in the solutions will be in the next version of FreeNAS. Here’s hoping that it will.

BTW, the primary legacy Operating System that cannot properly handle the 4096-byte sector is Windows XP.

I’ve read many posts on how to handle ZFS/Raid-Z on differently sized disks. The goal is to gain the most disk space availability while still retaining the redundancy of surviving a single disk failure. The posts I’ve read either would achieve the theoretical capacity, or be capable of expansion, but not both. I devised a way to get both at the same time, and it’s relatively simple.

The problem is the following: A Raid-Z configuration uses n partitions, giving the user the capacity of n-1 of those partitions, with the nth being the redundant to survive a failure. If the n partitions are not the same size, with the smallest being x, only the first x bytes of each partition is used. One cannot remove a Raid-Z from an active pool without a backup/restore. One cannot add a disk to a Raid-Z without a backup/restore (maybe in the future). The only expansion that can be done is to replace the partitions with larger ones, and once the smallest partition is increased, the available space increases.

For the purposes of demonstrating this technique, I will use an example with 4 disks:

  • 125G
  • 250G
  • 500G
  • 750G

The biggest theoretical capacity of the array, while retaining single disk failure resiliency, is simply the total size of the array minus the largest disk. This means that the capacity of the above example is 875G. So, how does one achieve this capacity? I proposed the following structure using 3 raids:

If you add the capacity in the above, you can see it is 3*125G + 2*125G + 250G = 875G, which is the theoretical capacity.

Suppose I wish to replace the smallest disk, 125G, with a new disk, say 1T. Using the same layout, I should see something like this:


If you add the capacity in the above, you can see it is 3*250G + 2*250G + 250G = 1500G, which again is the theoretical capacity.

The question is, can I migrate from the first configuration to the last without a backup/restore process? The answer is YES, and data is moved/copied at most once.

For the purposes of demonstration, I’m going to show the expansion with 5 drives connected, but the expansion can be done by immediately replacing the smallest drive with the largest and relying on the redundancy to keep things intact in the process.

First, connect the 1T drive and partition it as 250G, 250G, 250G, leaving the rest free. Replace the first 125G partition on 750G disk with the first 250G partition on the 1T disk:

Continue the replacement:

Repeat the same process with the 3 partitions on the 500G disk:

At this point, the move of the blue raid is complete. Depending on the sizes of the disks, the blue raid may be bigger, at which point the pool immediately increases in size. In this example, it does not change size.

Repeat with the 2 partitions on the 250G disk:

At this point, the red raid is complete. In this example, the red raid has changed from 250G to 500G, and the job isn’t complete yet!

Repeat for the final time with partition on the 125G disk:

Now we are done, and can disconnect the 125G disk if not done already. In this example, the green raid has changed from 375G to 750G, yielding a total change of 875G to 1500G.

Note: There is one limitation with this structure. Ordering the disks in increasing size, disk(n) must be at least as big as 2*(disk(n-1))-disk(n-2). This essentially means that if a disk is xG bigger than it’s previous disk, then the next disk must be at least xG bigger than this disk. Since disk sizes tend to grow exponentially, this assumption shouldn’t be much of a problem since the requirement is at least linear growth.

The disk sizes in this example meet the requirement I listed above. Given the requirements of this expansion, the disk added in this example must be at least 1000G, since disk 4 is 250G bigger than disk 3, disk 5 must be at least 250G bigger than disk 4, or 1000G.

This technique also allows adding another disk, under the same conditions. Consider adding a 1T disk and creating another Raid-Z:

I didn’t structure the partitions in nice pretty rows to demonstrate the fact that not all of the data needs to be moved (in this case, 750G of data is not moved). In this example, the capacity increased from 875G to 1625G, again the theoretical capacity.

Anyway, I thought of this while considering using ZFS/Raid-Z in a FreeNAS setup. I haven’t tested any of this; it’s theoretical only. What do you think?

About 3 weeks ago, Kevin (developer of NitoTV) and I decided it was a bit silly how we were each writing playback mechanisms on the AppleTV with little to no collaboration between us. So, we decided to write a Common Media Player Framework, which is licensed using LGPL.

Kevin sent me the code he used for DVD playback inside NitoTV as a place to start. I stripped it down to a smaller piece, and started the framework. After I had it doing basic playback, I worked on overlays to provide feedback to the user. Now, hitting up and down changes the overlays between normal, chapter view, audio/subtitle selection, and zoom.

Chapter Overlay Sceenshot

Also, I added a pretty menu using a blurred image as the resume menu. This is more consistent with Apple’s own playback

Resume Overlay

In addition, I used my work on AC3 Passthrough in Perian to see if I could pull off the same in DVD Playback. The issue is the AppleTV claims to only have a device which can play uncompressed audio, not one that can send Dolby Digital to a decoder. So, I created an audio driver that claims to do exactly that and pipe the data through to the optical/HDMI port. It was a lot of fun getting that working, and then I added support for DTS.

Anyway, now Sapphire uses the common media player framework for DVD playback. In the future, we’ll add other playback mechanisms as well.

Older Posts »