Configuring RAID on NetBSD Setting up software-based RAID for the root partition using RAIDframe, with explanations of RAID terminology and usage. By Peter Clark Introduction RAIDframe as seen in NetBSD is the mechanism that supplies support for software RAID devices. It was originally developed at the Parallel Data Laboratory at Carnegie Mellon University. Its role there is as a framework for rapid prototyping of RAID structures, providing a testbed for ongoing RAID research. RAIDframe was integrated into NetBSD by Greg Oster. He has also made enhancements very useful in production environments, such as support for hot-adding components and root filesystem on RAID. My experience with RAIDframe began with a project needing an affordable but powerful BSD-based web and database server. After coming across the NetBSD RAIDframe page, which mentions such things as "Support for root filesystems on any RAID level" and "The driver has currently been tested on i386 (very heavily tested)", I decided it would be the right tool for the job. Some notable features include: Root filesystem on RAID RAID levels 0, 1, 4, and 5 RAID is an acronym for "Redundant Array of Inexpensive Disks". The purpose of RAID is both to join more than one disk into a single logical device and to prevent a single physical disk failure from causing the entire logical device to fail. RAID 0 is also known as striping. It simply consists of joining two or more disks into a larger logical device. It does not provide any redundancy in itself, but is commonly used to combine RAID sets utilizing other RAID levels, such as RAID 1 or RAID 5. A striped RAID 1 set is known as RAID 0+1 or RAID 10. Likewise, a striped RAID 5 is known as RAID 0+5 or RAID 50. RAID 1 is also known as mirroring. Two disks are combined -- each contain the same data. If one disk fails, the other continues on and the logical device remains available. RAID 4 combines three or more disks into a single device with one of the disks dedicated to parity. If one disk fails, the logical device will remain available, but with degraded performance. This RAID level is not commonly used because the single parity disk introduces a considerable bottleneck for writes. RAID 5 combines three or more disks into a single device with the parity distributed across each drive in the set. If one disk fails, the logical device will remain available, but with degraded performance. Hot spares Hot spares are standby disks which take over for a failed disk in a RAID 1, 4 or 5 set. In the event of a single disk failure, the data that was on the failed disk is rebuilt from parity on the remaining disks onto the hot spare. Once the rebuild is complete, the RAID will function normally with no degradation. Device independence (eg., not tied to only SCSI) No restrictions on RAID combinations allowing, for instance, striped RAID 5, On-demand failing of disks This allows for experimentation with failure events without physically disconnecting or destroying drives. On-demand parity regeneration, Once a failed drive has been replaced with a new one, the RAID can be restored to normal operation while the system is running. On-demand parity reconstruction/copyback Once a failed drive has been replaced with a new one, the RAID can be restored to its original configuration by copying the data from a hot spare back to the disk that was replaced. This can be done while the system is running. Hot-adding of spare disks, Hot spares can be added while the system is running. RAID levels 0, 1, 4 and 5 have received a lot of testing. Requirements OS Version: NetBSD 1.5 is required for root filesystem on RAID. (RAID is available in other NetBSD versions, but support for mounting the root partition on a RAID device is not in previous versions. The difference is autoconfiguration -- the kernel joining the components together into a RAID device before attempting to mount root.) Note: As of this writing the latest formal release of NetBSD is version 1.4.2. NetBSD 1.5 is still in pre-release form. It can be found, along with installation instructions, at ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-1.5_ALPHA/ Hardware: A system using the either the i386 or alpha architecture is currently recommended for production use, since they have been the most heavily tested. Of course, the maintainer would be happy to hear your experiences on other ports and help out if necessary. A minimum two hard-drives are required for RAID 0 (stripe) and RAID 1 (mirror). A minimum of three drives for RAID 4 (dedicated parity) and RAID 5 (distributed parity). And a minimum of four drives are needed for RAID 10 (striped mirror). IDE, SCSI, and even Fibre Channel may be used since RAIDframe is device- and bus-independent. Using only disk per channel is highly recommended for IDE configurations, since IDE only allows one instruction per channel at a time. Boot partition: Since the NetBSD bootloader does not currently know about RAIDframe, the kernel must reside on a non-RAID partition. This article will cover the installation and configuration of NetBSD to utilize a root partition on RAID on the i386 architecture. Getting Started Of course, the first thing necessary for running RAIDframe on NetBSD is a machine running NetBSD. Being the conservative type, I've chosen to keep non-RAID partitions to provide a full system to be used for emergency maintenance. This will also be the location of the kernel. However, if desired, the kernel can be booted such devices as floppy disk or CD-ROM instead. I'll describe installation using own setup, an Athlon 850 with 256MB RAM, Adaptec 29160 SCSI controller and 3 IBM 18gb SCSI Ultra/160 hard-drives. I've chosen to use 400MB for the non-RAID base system and have dedicated the rest to two RAID devices, a RAID 1 with a hot spare for swap, and a RAID 5 for the rest of the system. Giving the non-RAID system 400MB gives a nice balance between not taking too much space from the RAID and having enough room for a base installation combined with kernel source and pkgsrc. The reason behind the swap partition getting its own RAID device is that NetBSD does not currently free the underlying device when a swap partition is turned off. This means that each time the system is booted, parity on any RAID device hosting a swap partition must be reconstructed. It's much more desirable for the system to spend under a minute rebuilding parity of a relatively small swap-sized RAID device than to for it to rebuild parity on a multi-gigabyte filesystem. 1 - Install the base system Install NetBSD version 1.5 according to its documentation, giving NetBSD the entire first disk. When specifying partitions with disklabel, choose custom layout, with a 400MB partition for / (root) on the 'a' partition and '0 for each of the rest when prompted. Since this will be only the bootstrap and maintenance system, I've chosen to forego swap space, opting for it to reside on a RAID device instead. At the distribution sets, choose distribution with sets 'Kernel', 'Base', 'System (/etc)', 'Compiler', 'Manuals', 'Miscellaneous' and 'Text tools'. 2 - Build and install a RAIDframe-enabled kernel Once the base installation is complete the system has had its first boot, install kernel sources. It's a good idea now to back up the installation kernel, /netbsd to /netbsd-INSTALL. Now, build and install a custom kernel to match your system with the following options added: pseudo-device raid 4 # RAIDframe disk device options RAID_AUTOCONFIG The "pseudo-devices raid 4" line specifies to include support for the raid driver with support for 4 devices. The "options RAID_AUTOCONFIG" line turns on support for component auto-detection and auto-configuration, and is necessary for root filesystem support. Disk devices should be configured so that the NetBSD device device name will be tied to the SCSI ID, and disks don't get mixed up in the event of a failure. This isn't strictly necessary on RAID sets using autoconfiguration, since the RAIDframe component labels will distinguish one drive from another. But, it is very helpful in keeping physical disks and the devices assigned to them by NetBSD in sync. Configure scsibus0 to use the specified Adaptec controller (in this case 'ahc?'): scsibus0 at ahc? And, the devices attached to it: sd0 at scsibus0 target 0 lun ? # SCSI disk drives sd1 at scsibus1 target 1 lun ? # SCSI disk drives sd2 at scsibus2 target 2 lun ? # SCSI disk drives With the raid-enabled kernel is compiled and installed, reboot the system to use it. # shutdown -r now If the boot fails, you'll need to tell the system to boot the installation kernel at the boot prompt to give the kernel configuration and compile another shot: boot sd0a:netbsd-INSTALL 3 - Re-label the drives that will be hosting the RAID device After the initial system installation, the partition portion of the disk label (as shown by disklabel sd0) looks like: # size offset fstype [fsize bsize cpg] a: 819957 63 4.2BSD 1024 8192 16 # (Cyl. 0*- 345) c: 35843607 63 unused 0 0 # (Cyl. 0*- 15123*) d: 35843670 0 unused 0 0 # (Cyl. 0 - 15123*) Create a prototype file for the disks that will be used for the RAID components: disklabel sd0 > /root/disklabel.sd0 Edit the prototype file to add the RAID partitions (the 'e' and 'f' partitions), using the remaining space on the device. In this case, I've decided to use a swap space of 512MB, so I've specified the size given to swap as 1050624 sectors, which amounts to 514MB (2MB extra for overhead). And, I've given the main RAID partition the rest of the unused space on the disk, 33973026 sectors. Note that sectors are in 512 byte blocks, so they read as twice what sizes specified in kilobytes would. a: 819957 63 4.2BSD 1024 8192 16 # (Cyl. 0*- 345) c: 35843607 63 unused 0 0 # (Cyl. 0*- 15123*) d: 35843670 0 unused 0 0 # (Cyl. 0 - 15123*) e: 33972978 820020 RAID # f: 1050672 34792998 RAID # I used the following method to end up with the numbers: First, I noted the partitions with known sizes. These are 819957 sectors for the root partition, and 512MB for swap. I then converted the 512MB for swap into sectors by the 512MB by 1024 to get the value in kilobytes. This comes out to 524288 kilobytes. Then, I added 2MB to that number, which is 2048 kilobytes (2 times 1024). The total size of the partition dedicated to swap's RAID 1 component, then is 526336 kilobytes. To convert that into 512 byte sectors, the 526336 kilobyte value is multiplied by 2, giving 1052672. The total space in sectors used on the drive is 63 for the MBR/partition table (which is what's before the offset of 63 for partitions 'a' and 'c') + 819957 (/) + 1052672 (swap), which totals 1872692. So, the amount left for the RAID 5 component on is the total amount of space on the drive, 35843670 (from the 'd' partition) minus the used space, 1872692, which leaves 33970978 sectors. The offset that partition 'e' (for the RAID 5 component) uses is 820020, found by adding the 63 sectors used by the MBR/partition table to the 819957 sectors used by the 'a' (/) partition. To get the offset used for 'f', the size of partition 'e', 33970978 is added to its offset, 820020. This gives 34792998 sectors as the offset for partition 'f'. Save the modified prototype file and make copies for the other devices: cp /root/disklabel.sd0 /root/disklabel.sd1 cp /root/disklabel.sd0 /root/disklabel.sd2 Edit the two new copies to change the 'disk:' entry in the top portion of the label: # /dev/rsd0d: type: unknown disk: mydisk label: flags: In /root/disklabel.sd1: disk: mydisk1 and /root/disklabel.sd2: disk: mydisk2 Then, apply the labels: disklabel -R -r sd0 /root/disklabel.sd0 disklabel -R -r sd1 /root/disklabel.sd1 disklabel -R -r sd2 /root/disklabel.sd2 Whenever the disklabel is changed on a boot device, it is necessary to reinstall the boot sector, since writing the label clears it, rendering the system unbootable. To reinstall the boot sector run installboot like so: /usr/mdec/installboot /usr/mdec/biosboot.sym /dev/rsd0a 4 - Create the configuration file for the RAID 5 set From the raidctl(8) manpage: There are 4 required sections of a configuration file, and 2 optional sections. Each section begins with a `START', followed by the section name, and the configuration parameters associated with that section. The first section is the `array' section, and it specifies the number of rows, columns, and spare disks in the RAID set. For example: START array 1 3 0 indicates an array with 1 row, 3 columns, and 0 spare disks. Note that although multi-dimensional arrays may be specified, they are NOT supported in the driver. The example in the manpage matches up to the requirements of creating a RAID 5, so I'll keep it unchanged. The second section, the `disks' section, specifies the actual components of the device. For example: START disks /dev/sd0e /dev/sd1e /dev/sd2e Note that it is imperative that the order of the components in the configuration file does not change between configurations of a RAID device. These are the devices I've configured with disklabel, so I'll also keep this portion unchanged. The next section, which is the `spare' section, is optional, and, if present, specifies the devices to be used as `hot spares' -- devices which are on-line, but are not actively used by the RAID driver unless one of the main components fail. A simple `spare' section might be: START spare /dev/sd3e for a configuration with a single spare component. If no spare drives are to be used in the configuration, then the `spare' section may be omitted. Since only three components are being used for this RAID device, this section will be omitted. The next section is the `layout' section. This section describes the general layout parameters for the RAID device, and provides such information as sectors per stripe unit, stripe units per parity unit, stripe units per reconstruction unit, and the parity configuration to use. This section might look like: START layout # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 32 1 1 5 The sectors per stripe unit specifies, in blocks, the interleave factor; i.e. the number of contiguous sectors to be written to each component for a single stripe. Appropriate selection of this value (32 in this example) is the subject of much research in RAID architectures. The stripe units per parity unit and stripe units per reconstruction unit are normally each set to 1. While certain values above 1 are permitted, a discussion of valid values and the consequences of using anything other than 1 are outside the scope of this document. The last value in this section (5 in this example) indicates the parity configuration desired. Valid entries include: 0 - RAID level 0. No parity, only simple striping. 1 - RAID level 1. Mirroring. 4 - RAID level 4. Striping across components, with parity stored on the last component. 5 - RAID level 5. Striping across components, parity distributed across all components. I'll use the example given here, since it matches my configuration. However, be sure to have a look at the stripe unit size performance analysis linked at the end of the article for a nice explanation and graph of performance differences for the options. The next required section is the `queue' section. This is most often specified as: START queue fifo 100 where the queuing method is specified as fifo (first-in, first-out), and the size of the per-component queue is limited to 100 requests. Other queuing methods may also be specified, but a discussion of them is beyond the scope of this document. Rather than venturing into uncharted territory, this one can remain unchanged. The final section, the `debug' section, is optional. For more details on this the reader is referred to the RAIDframe documentation discussed in the HISTORY section. And, debug can be left out in the "we'll cross that bridge when we come to it" fashion. So, what remains is: START array 1 3 0 START disks /dev/sd0e /dev/sd1e /dev/sd2e START layout # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 32 1 1 5 START queue fifo 100 Now, save this into the file /etc/raid0.conf Note that the raid0 its name refers to raid device 0 rather than RAID level 0. The next RAID device will use /etc/raid1.conf and more would continue in sequence as /etc/raid2.conf, raid3.conf, etc. 5 - Initialize the raid0 device First, raid0 needs to be configured. The -C specifies to do forcefully since it's being newly created: raidctl -C /etc/raid0.conf raid0 This will complete instantaneously, but causes the kernel to display about a pageful of text, obscuring the shell prompt. So, instead of waiting, hit enter for a prompt to be displayed. The next step is to initialize the component labels using a numeric identifier unique to the RAID device: raidctl -I 123456 raid0 Now, initialize parity, with -v to display a status meter: raidctl -iv raid0 You'll be treated to a display along these lines: Initiating re-write of parity Parity Re-write status: 6% |** | ETA: 21:43 \ When the parity initialization is complete, the RAID is ready to use as a disk device. 6 - Partition the raid0 device To make the RAID ready for a filesystem (or to be used as component for another raid), it must be labeled. In this case, I'll use a 512MB swap partition (twice physical memory), and one large root filesystem. To create the prototype: disklabel raid0 > /root/disklabel.raid0 And, edit the partition portion: # size offset fstype [fsize bsize cpg] d: 67945792 0 4.2BSD 0 0 0 # (Cyl. 0 - 1061652) To become: # size offset fstype [fsize bsize cpg] a: 67945792 0 4.2BSD 1024 8192 16 # (Cyl. 0 - 1061652) d: 67945792 0 unused 0 0 0 # (Cyl. 0 - 1061652) 7 - Create the filesystem: newfs /dev/rraid0a It may be necessary to adjust the -c option of newfs. In this case, the default, 16, caused newfs to refuse to create the filesystem, so I used newfs -c 96 /dev/rraid0a 8 - Copy the base system to the RAID device. First, mount the RAID's root partition on /mnt mount /dev/raid0a /mnt Then, use a tar pipe to copy the install root filesystem over to the RAID. The 'l' (ell) switch tells tar to not traverse filesystem boundaries, and 'p' to preserve permissions. ( cd / ; tar lpcf - . ) | ( cd /mnt ; tar xpf - ) 9 - Configure the raid1 set: This is performed in much the same fashion as the configuration of raid0. First, the /etc/raid1.conf file must be created: Since this is a RAID 1 set with a hot spare, the "START array" section will need to specify 2 components and 1 hot spare: START array 1 2 1 The 'f' partitions are used for the components of the RAID device: START disks /dev/sd0f /dev/sd1f With the hot spare listed separately: START spare /dev/sd The layout section needs '1' as the last parameter, which specifies RAID level: START layout # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level 32 1 1 1 START queue fifo 100 Thus, the completed /etc/raid1.conf is: START array 1 2 1 START disks /dev/sd0f /dev/sd1f START spare /dev/sd2f START layout 32 1 1 1 START queue fifo 100 Now, configure raid1 using its configuration file: raidctl -C /etc/raid1.conf raid1 Next, initialize the component labels, using a different number than the previous RAID set: raidctl -I 123457 raid0 Initialize parity: raidctl -iv raid1 Create a prototype disklabel to use for raid1: disklabel raid1 > /root/disklabel.raid1 Edit the partition section of the prototype: # size offset fstype [fsize bsize cpg] d: 1050560 0 4.2BSD 0 0 0 # (Cyl. 0 - 32829) Specifying a swap partition: # size offset fstype [fsize bsize cpg] b: 1050592 0 swap 0 0 0 # (Cyl. 0 - 32829) d: 1050592 0 4.2BSD 0 0 0 # (Cyl. 0 - 32829) Then, apply the label: disklabel -R -r raid1 /root/disklabel.raid1 The raid1 set is now ready for /dev/rraid0b to be used as a swap device. 10 - Edit the RAID 5's /etc/fstab file. At boot time, the /etc/fstab file is read from the filesystem mounted as root and tells the system which filesystems to mount and which swap devices to enable. Since raid0 will be hosting the root partition and raid1 will be hosting the swap partition, raid0's /etc/fstab on will need to be modified. With raid0's root partition currently mounted on /mnt, the file to edit will be /mnt/etc/fstab The original: /dev/sd0a / ffs rw 1 1 /dev/sd0b none swap sw 0 0 /kern /kern kernfs rw becomes: /dev/raid0a / ffs rw 1 1 /dev/raid1b none swap sw 0 0 /kern /kern kernfs rw 11 - Turn on autoconfiguration Make raid0 autoconfigurable and eligible to be the root partition. This overrides using the partition the kernel was booted from as the root device: raidctl -A root raid0 And, make raid1 autoconfigurable without root partition eligibility: raidctl -A yes raid1 12 - Reboot and enjoy shutdown -r now Once the system has come up, log in and verify that the root filesystem is indeed on the raid0 device: $ df -k Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/raid0a 33286204 301047 31320846 0% / kernfs 1 1 0 100% /kern And, that swap is on the raid1 device: $ swapctl -l Device 512-blocks Used Avail Capacity Priority /dev/raid1b 1050560 0 1050560 0% 0 General Operations The administration of a RAID subsystem after setup is primarily done using the raidctl command. The status of a RAID device can be seen through the following command: raidctl -s raid0 It yields output such as, for a healthy RAID set: Components: /dev/sd0e: optimal /dev/sd1e: optimal /dev/sd2e: optimal No spares. Component label for /dev/sd0e: Row: 0 Column: 0 Num Rows: 1 Num Columns: 3 Version: 2 Serial Number: 123456 Mod Counter: 707 Clean: No Status: 0 sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1 RAID Level: 5 blocksize: 512 numBlocks: 35023552 Autoconfig: Yes Root partition: Yes Last configured as: raid0 ... Parity status: clean Reconstruction is 100% complete. Parity Re-write is 100% complete. Copyback is 100% complete. For a RAID set with one failed component: Components: /dev/sd0e: optimal /dev/sd1e: failed /dev/sd2e: optimal ... For a RAID set reconstructing after replacement of a failed component: Components: /dev/sd0e: optimal /dev/sd1e: reconstructing /dev/sd2e: optimal ... Parity status: clean Reconstruction is 7% complete. Parity Re-write is 100% complete. Copyback is 100% complete. For a RAID set brought on-line after an unclean shutdown: Components: /dev/sd0e: optimal /dev/sd1e: optimal /dev/sd2e: optimal No spares. ... Parity status: DIRTY Reconstruction is 100% complete. Parity Re-write is 2% complete. Copyback is 100% complete. To see only the Parity, Reconstruction, Parity Re-write and Copyback status: raidctl -S If a maintenance operation is occurring on one of these, a progress meter will be attached to the terminal, also. The progress meter can be safely exited using control-c without interrupting the the task. Output yielded for a RAID set with dirty parity (eg., from an unclean shutdown) during a rebuild: Reconstruction is 100% complete. Parity Re-write is 9% complete. Copyback is 100% complete. Parity Re-write status: 27% |********** | ETA: 19:25 \ In the event of failure of a hot-swappable SCSI device, after replacing the component which failed, the new component can be detected by executing: scsictl scsibus0 scan any any Note that the replacement disk must be labeled before it can be reconstructed to. Eg., using the disklabel files from installation time: disklabel -R -r sd1 /root/disklabel.sd1 To rebuild onto a replaced component in-place (without copying from a hot spare): raidctl -R /dev/sd1e raid0 To copy back to a replaced device from a hot-spare which took its place: raidctl -B /dev/sd1e Booting with a non-RAID root partition for emergency maintenance (eg., to reset a forgotten root password, or restore an important shared library from accidental deletion): boot -a (ask for the device to be mounted as root instead of automatically choosing the root partition of the RAID) Unconfiguring a RAID set to start over again from scratch: First, make sure the device is unmounted, then clear the first part of it to get rid of the disklabel - dd if=/dev/zero of=/dev/rraid0d count=512 Then, unconfigure the device: raidctl -u raid0 Now you're free to create a new raid0.conf and reinitialize, unfettered by old configurations.