Over the last year I was getting more and more curious/excited about OpenSolaris. Specifically I got interested in ZFS – Sun’s new filesystem/volume manager.
So I finally got my act together and gave it a whirl.
Test system: Pentium 4, 3.0Ghz in an MSI P4N SLI motherboard. Three ATA Seagate ST3300831A hard drives, one Maxtor 6L300R0 ATA drive (all are nominally 300 gigs – see previous post on slight capacity differences). One Western Digital WDC WD800JD-60LU SATA 80 gig hard drive. Solaris Express Community Release (SXCR) 51.
Originally I started this project running SXCR 41, but back then I only had 3 300 gig drives, and that was interfering with my plans for RAID 5 greatness. In the end the wait was worth it, as ZFS got revved since.
A bit about MSI motherboard. I like it. For a PC system I like it alot. It has two PCI slots, two full length PCI E slots (16x), and one PCIE 1x slot. Technically it supports SLI with two ATI Cross-Fire or Nvidea SLI capable cards, however in that case both full length slots will run at 8x. Single slot will run at 16x. Two dual channel IDE connectors, four SATA connectors, built in high end audio with SPDIF, built in GigE NIC based on Marvell chipset/PHY, serial, parallel, built in IEEE1394 (iLink/Firewire) with 3 ports (one on the back of the board, two more can be brought out). Plenty of USB 2.0 connectors (4 brought out on the back of the board, 6 more can be brought out from conector banks on the motherboard). Overall, pretty shiny.
My setup consists of four IDE hard drives on the IDE bus, and an 80 gig WD on SATA bus for the OS. Motherboard BIOS allowed me to specify that I want to boot from the SATA drive first, so I took advantage of the offer.
Installation of SXCR was from IDE DVD (a pair of hard drives was unplugged for the time).
SXCR recognized pretty much everything in the system, except built in Marvell Gig E nic. Shit happens, I tossed in a PCI 3Com 3c509C NIC that I had kicking around, and restarted. There was a bit of a hold up with SATA drive – Solaris didn’t recognize it, and wanted the geometry, number of heads and number of clusters so that it could create an apropriate volume label. Luckily WD made identical drive but in IDE configuration, for which it actually provided the heads/custers/sectors information, so I plugged those numbers in, and format and fdisk cheered up.
Other then that, normal Solaris install. I did console/text install just because I am alot more familiar with them, however Radeon Sapphire X550 PCIE video card was recognized, and system happily boots into OpenWindows/CDE if you want it to.
So I proceeded to create a ZFS pool.
First thing I wanted to check is how portable ZFS is. Specifically, Sun claims that it’s endinanness neutral (ie I can connect the same drives to the little endian PC, or big endian SPARC system, and as long as both run OS that recognizes ZFS, things will work). I wondered how it deals with device numbers. Traditionally Solaris is very picky about the device IDs, and changing things like controllers or SCSI IDs on a system can be tricky.
Here I wanted to know if I can just create, say, a “travelling zfs pool”, where I’ll have an external enclosure with a few SATA drives, an internal PCI SATA controller card, and if things go wrong in a particular system, I could always unplug the drives, and move them to a different system, and things will work. So I wanted to find out if ZFS can deal with changes in device IDs.
In order for ZFS to work reliably, it has to use a whole drive. It, in turn, writes an EFI disk label on the drive, with a unique identifier. Note that certain PC motherboards choke on EFI disk labels, and refuse to boot. Luckily most of the time this is fixable using a BIOS update.
root@dara:/[03:00 AM]# uname -a SunOS dara.NotBSD.org 5.11 snv_51 i86pc i386 i86pc root@dara:/[03:00 AM]# zpool create raid1 raidz c0d0 c0d1 c1d0 c1d1 root@dara:/[03:01 AM]# zpool status pool: raid1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raid1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c0d1 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c1d1 ONLINE 0 0 0 errors: No known data errors root@dara:/[03:02 AM]# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT raid1 1.09T 238K 1.09T 0% ONLINE - root@dara:/[03:02 AM]# df -h /raid1 Filesystem size used avail capacity Mounted on raid1 822G 37K 822G 1% /raid1 root@dara:/[03:02 AM]#
Here I created a raidz1 (zfs equivalent of RAID5 with one parity disk, giving me (N-1)*[capacity of the drives]. raidz can survive death of one hard drive. zfs pool can also be creatd with raidz2 command, giving an equivalent of raid5 with two parity disks. Such configuration can survive death of 2 disks) pool.
Note the difference in volume that zpool list and df produce. zpool list shows capacity not counting parity. df shows the more traditional available disk space. Using df will likely cause less confusion in normal operation.
So far so good.
Then I proceeded to create a large file on the ZFS pool:
root@dara:/raid1[03:04 AM]# time mkfile 10g reely_beeg_file real 2m8.943s user 0m0.062s sys 0m5.460s root@dara:/raid1[03:06 AM]# ls -la /raid1/reely_beeg_file -rw------T 1 root root 10737418240 Nov 10 03:06 /raid1/reely_beeg_file root@dara:/raid1[03:06 AM]#
While this was running, I was running zpool iostat -v raid1 10 in a different window.
capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- raid1 211M 1.09T 0 187 0 18.7M raidz1 211M 1.09T 0 187 0 18.7M c1d0 - - 0 110 0 6.26M c1d1 - - 0 110 0 6.27M c0d0 - - 0 110 0 6.25M c0d1 - - 0 94 0 6.23M ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- raid1 1014M 1.09T 0 601 0 59.5M raidz1 1014M 1.09T 0 601 0 59.5M c1d0 - - 0 364 0 20.0M c1d1 - - 0 363 0 20.0M c0d0 - - 0 355 0 19.9M c0d1 - - 0 301 0 19.9M ---------- ----- ----- ----- ----- ----- ----- [...] capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- raid1 8.78G 1.08T 0 778 363 91.1M raidz1 8.78G 1.08T 0 778 363 91.1M c1d0 - - 0 412 0 30.4M c1d1 - - 0 411 5.68K 30.4M c0d0 - - 0 411 5.68K 30.4M c0d1 - - 0 383 5.68K 30.4M ---------- ----- ----- ----- ----- ----- -----
10 gigabytes written over 128 seconds. About 80 megabytes a second on continuous writes. I think I can live with that.
Next I wanted to run some md5 digests of some files on the /raid1, then export the pool, shut system down, switch around IDE cables, boot system back up, reimport the pool, and re-run the md5 digests. This would simulate moving a disk pool to a different system, screwing up disk ordering in process.
root@dara:/[12:20 PM]# digest -a md5 /raid1/* (/raid1/reely_beeg_file) = 2dd26c4d4799ebd29fa31e48d49e8e53 (/raid1/sunstudio11-ii-20060829-sol-x86.tar.gz) = e7585f12317f95caecf8cfcf93d71b3e root@dara:/[12:23 PM]# zpool status pool: raid1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raid1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c0d1 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c1d1 ONLINE 0 0 0 errors: No known data errors root@dara:/[12:23 PM]# zpool export raid1 root@dara:/[12:23 PM]# zpool status no pools available root@dara:/[12:23 PM]#
System was shutdown, IDE cables switched around, system was rebooted.
root@dara:/[02:09 PM]# zpool status no pools available root@dara:/[02:09 PM]# zpool import raid1 root@dara:/[02:11 PM]# zpool status pool: raid1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM raid1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c1d1 ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c0d1 ONLINE 0 0 0 errors: No known data errors root@dara:/[02:11 PM]#
Notice that the order of the drives changed. Was c0d0 c0d1 c1d0 c1d1, and now it’s c1d0 c1d1 c0d0 c0d1.
root@dara:/[02:22 PM]# digest -a md5 /raid1/* (/raid1/reely_beeg_file) = 2dd26c4d4799ebd29fa31e48d49e8e53 (/raid1/sunstudio11-ii-20060829-sol-x86.tar.gz) = e7585f12317f95caecf8cfcf93d71b3e root@dara:/[02:25 PM]#
Same digests.
Oh, and a very neat feature…. You want to know what was happening with your disk pools?
root@dara:/[02:12 PM]# zpool history raid1 History for 'raid1': 2006-11-10.03:01:56 zpool create raid1 raidz c0d0 c0d1 c1d0 c1d1 2006-11-10.12:19:47 zpool export raid1 2006-11-10.12:20:07 zpool import raid1 2006-11-10.12:39:49 zpool export raid1 2006-11-10.12:46:14 zpool import raid1 2006-11-10.14:09:54 zpool export raid1 2006-11-10.14:11:00 zpool import raid1
Yes, zfs logs the last bunch of commands on to the zpool devices. So even if you move the pool to a different system, command history will still be with you.
Lastly, some versioning history for ZFS:
root@dara:/[02:19 PM]# zpool upgrade raid1 This system is currently running ZFS version 3. Pool 'raid1' is already formatted using the current version. root@dara:/[02:19 PM]# zpool upgrade -v This system is currently running ZFS version 3. The following versions are suppored: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. root@dara:/[02:19 PM]#