Configuring RAID on NetBSD 
Setting up software-based RAID for the root partition using RAIDframe, with explanations of RAID terminology and usage. 

By Peter Clark


Introduction
RAIDframe as seen in NetBSD is the mechanism that supplies support for software RAID devices. It was originally developed at the Parallel Data Laboratory at Carnegie Mellon University. Its role there is as a framework for rapid prototyping of RAID structures, providing a testbed for ongoing RAID research. RAIDframe was integrated into NetBSD by Greg Oster. He has also made enhancements very useful in production environments, such as support for hot-adding components and root filesystem on RAID. 
My experience with RAIDframe began with a project needing an affordable but powerful BSD-based web and database server. After coming across the NetBSD RAIDframe page, which mentions such things as "Support for root filesystems on any RAID level" and "The driver has currently been tested on i386 (very heavily tested)", I decided it would be the right tool for the job. Some notable features include: 


Root filesystem on RAID 
RAID levels 0, 1, 4, and 5 
RAID is an acronym for "Redundant Array of Inexpensive Disks". The purpose of RAID is both to join more than one disk into a single logical device and to prevent a single physical disk failure from causing the entire logical device to fail. 

RAID 0 is also known as striping. It simply consists of joining two or more disks into a larger logical device. It does not provide any redundancy in itself, but is commonly used to combine RAID sets utilizing other RAID levels, such as RAID 1 or RAID 5. A striped RAID 1 set is known as RAID 0+1 or RAID 10. Likewise, a striped RAID 5 is known as RAID 0+5 or RAID 50. 

RAID 1 is also known as mirroring. Two disks are combined -- each contain the same data. If one disk fails, the other continues on and the logical device remains available. 

RAID 4 combines three or more disks into a single device with one of the disks dedicated to parity. If one disk fails, the logical device will remain available, but with degraded performance. This RAID level is not commonly used because the single parity disk introduces a considerable bottleneck for writes. 

RAID 5 combines three or more disks into a single device with the parity distributed across each drive in the set. If one disk fails, the logical device will remain available, but with degraded performance. 

Hot spares 
Hot spares are standby disks which take over for a failed disk in a RAID 1, 4 or 5 set. In the event of a single disk failure, the data that was on the failed disk is rebuilt from parity on the remaining disks onto the hot spare. Once the rebuild is complete, the RAID will function normally with no degradation. 

Device independence (eg., not tied to only SCSI) 
No restrictions on RAID combinations allowing, for instance, striped RAID 5, 
On-demand failing of disks 
This allows for experimentation with failure events without physically disconnecting or destroying drives. 

On-demand parity regeneration, 
Once a failed drive has been replaced with a new one, the RAID can be restored to normal operation while the system is running. 

On-demand parity reconstruction/copyback 
Once a failed drive has been replaced with a new one, the RAID can be restored to its original configuration by copying the data from a hot spare back to the disk that was replaced. This can be done while the system is running. 

Hot-adding of spare disks, 
Hot spares can be added while the system is running. 

RAID levels 0, 1, 4 and 5 have received a lot of testing. 

Requirements
OS Version: NetBSD 1.5 is required for root filesystem on RAID. (RAID is available in other NetBSD versions, but support for mounting the root partition on a RAID device is not in previous versions. The difference is autoconfiguration -- the kernel joining the components together into a RAID device before attempting to mount root.) 
Note: As of this writing the latest formal release of NetBSD is version 1.4.2. NetBSD 1.5 is still in pre-release form. It can be found, along with installation instructions, at ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-1.5_ALPHA/ 

Hardware: A system using the either the i386 or alpha architecture is currently recommended for production use, since they have been the most heavily tested. Of course, the maintainer would be happy to hear your experiences on other ports and help out if necessary. 

A minimum two hard-drives are required for RAID 0 (stripe) and RAID 1 (mirror). A minimum of three drives for RAID 4 (dedicated parity) and RAID 5 (distributed parity). And a minimum of four drives are needed for RAID 10 (striped mirror). 

IDE, SCSI, and even Fibre Channel may be used since RAIDframe is device- and bus-independent. Using only disk per channel is highly recommended for IDE configurations, since IDE only allows one instruction per channel at a time. 

Boot partition: Since the NetBSD bootloader does not currently know about RAIDframe, the kernel must reside on a non-RAID partition. 

This article will cover the installation and configuration of NetBSD to utilize a root partition on RAID on the i386 architecture. 

Getting Started
Of course, the first thing necessary for running RAIDframe on NetBSD is a machine running NetBSD. Being the conservative type, I've chosen to keep non-RAID partitions to provide a full system to be used for emergency maintenance. This will also be the location of the kernel. However, if desired, the kernel can be booted such devices as floppy disk or CD-ROM instead. 
I'll describe installation using own setup, an Athlon 850 with 256MB RAM, Adaptec 29160 SCSI controller and 3 IBM 18gb SCSI Ultra/160 hard-drives. I've chosen to use 400MB for the non-RAID base system and have dedicated the rest to two RAID devices, a RAID 1 with a hot spare for swap, and a RAID 5 for the rest of the system. Giving the non-RAID system 400MB gives a nice balance between not taking too much space from the RAID and having enough room for a base installation combined with kernel source and pkgsrc. 

The reason behind the swap partition getting its own RAID device is that NetBSD does not currently free the underlying device when a swap partition is turned off. This means that each time the system is booted, parity on any RAID device hosting a swap partition must be reconstructed. It's much more desirable for the system to spend under a minute rebuilding parity of a relatively small swap-sized RAID device than to for it to rebuild parity on a multi-gigabyte filesystem. 


1 - Install the base system
Install NetBSD version 1.5 according to its documentation, giving NetBSD the entire first disk. When specifying partitions with disklabel, choose custom layout, with a 400MB partition for / (root) on the 'a' partition and '0 for each of the rest when prompted. Since this will be only the bootstrap and maintenance system, I've chosen to forego swap space, opting for it to reside on a RAID device instead. 
At the distribution sets, choose distribution with sets 'Kernel', 'Base', 'System (/etc)', 'Compiler', 'Manuals', 'Miscellaneous' and 'Text tools'. 

2 - Build and install a RAIDframe-enabled kernel
Once the base installation is complete the system has had its first boot, install kernel sources. It's a good idea now to back up the installation kernel, /netbsd to /netbsd-INSTALL. Now, build and install a custom kernel to match your system with the following options added: 

pseudo-device   raid   4       # RAIDframe disk device
options    RAID_AUTOCONFIG

The "pseudo-devices raid 4" line specifies to include support for the raid driver with support for 4 devices. The "options RAID_AUTOCONFIG" line turns on support for component auto-detection and auto-configuration, and is necessary for root filesystem support. 

Disk devices should be configured so that the NetBSD device device name will be tied to the SCSI ID, and disks don't get mixed up in the event of a failure. This isn't strictly necessary on RAID sets using autoconfiguration, since the RAIDframe component labels will distinguish one drive from another. But, it is very helpful in keeping physical disks and the devices assigned to them by NetBSD in sync. 

Configure scsibus0 to use the specified Adaptec controller (in this case 'ahc?'): 

scsibus0 at ahc? 

And, the devices attached to it: 


sd0     at scsibus0 target 0 lun ?      # SCSI disk drives
sd1     at scsibus1 target 1 lun ?      # SCSI disk drives
sd2     at scsibus2 target 2 lun ?      # SCSI disk drives

With the raid-enabled kernel is compiled and installed, reboot the system to use it. 

# shutdown -r now 

If the boot fails, you'll need to tell the system to boot the installation kernel at the boot prompt to give the kernel configuration and compile another shot: 

boot sd0a:netbsd-INSTALL 


3 - Re-label the drives that will be hosting the RAID device
After the initial system installation, the partition portion of the disk label (as shown by disklabel sd0) looks like: 

#        size   offset     fstype   [fsize bsize   cpg]
  a:   819957       63     4.2BSD     1024  8192    16   # (Cyl.    0*- 345)
  c: 35843607       63     unused        0     0         # (Cyl.    0*- 15123*)
  d: 35843670        0     unused        0     0         # (Cyl.    0 - 15123*)

Create a prototype file for the disks that will be used for the RAID components: 

disklabel sd0 > /root/disklabel.sd0 

Edit the prototype file to add the RAID partitions (the 'e' and 'f' partitions), using the remaining space on the device. In this case, I've decided to use a swap space of 512MB, so I've specified the size given to swap as 1050624 sectors, which amounts to 514MB (2MB extra for overhead). And, I've given the main RAID partition the rest of the unused space on the disk, 33973026 sectors. Note that sectors are in 512 byte blocks, so they read as twice what sizes specified in kilobytes would. 


  a:   819957       63     4.2BSD     1024  8192    16   # (Cyl.    0*- 345)
  c: 35843607       63     unused        0     0         # (Cyl.    0*- 15123*)
  d: 35843670        0     unused        0     0         # (Cyl.    0 - 15123*)
  e: 33972978   820020     RAID                          #
  f:  1050672 34792998     RAID                          #

I used the following method to end up with the numbers: 

First, I noted the partitions with known sizes. These are 819957 sectors for the root partition, and 512MB for swap. 

I then converted the 512MB for swap into sectors by the 512MB by 1024 to get the value in kilobytes. This comes out to 524288 kilobytes. Then, I added 2MB to that number, which is 2048 kilobytes (2 times 1024). The total size of the partition dedicated to swap's RAID 1 component, then is 526336 kilobytes. To convert that into 512 byte sectors, the 526336 kilobyte value is multiplied by 2, giving 1052672. 

The total space in sectors used on the drive is 63 for the MBR/partition table (which is what's before the offset of 63 for partitions 'a' and 'c') + 819957 (/) + 1052672 (swap), which totals 1872692. So, the amount left for the RAID 5 component on is the total amount of space on the drive, 35843670 (from the 'd' partition) minus the used space, 1872692, which leaves 33970978 sectors. 

The offset that partition 'e' (for the RAID 5 component) uses is 820020, found by adding the 63 sectors used by the MBR/partition table to the 819957 sectors used by the 'a' (/) partition. To get the offset used for 'f', the size of partition 'e', 33970978 is added to its offset, 820020. This gives 34792998 sectors as the offset for partition 'f'. 

Save the modified prototype file and make copies for the other devices: 

cp /root/disklabel.sd0 /root/disklabel.sd1
cp /root/disklabel.sd0 /root/disklabel.sd2 


Edit the two new copies to change the 'disk:' entry in the top portion of the label: 


# /dev/rsd0d:
type: unknown
disk: mydisk
label: 
flags:


In /root/disklabel.sd1: 

disk: mydisk1 

and /root/disklabel.sd2: 

disk: mydisk2 

Then, apply the labels: 

disklabel -R -r sd0 /root/disklabel.sd0 

disklabel -R -r sd1 /root/disklabel.sd1 

disklabel -R -r sd2 /root/disklabel.sd2 

Whenever the disklabel is changed on a boot device, it is necessary to reinstall the boot sector, since writing the label clears it, rendering the system unbootable. To reinstall the boot sector run installboot like so: 

/usr/mdec/installboot /usr/mdec/biosboot.sym /dev/rsd0a 


4 - Create the configuration file for the RAID 5 set
From the raidctl(8) manpage: 
There are 4 required sections of a configuration file, and 2 optional sections. Each section begins with a `START', followed by the section name, and the configuration parameters associated with that section. 
The first section is the `array' section, and it specifies the number of rows, columns, and spare disks in the RAID set. For example: 

START array
1 3 0


indicates an array with 1 row, 3 columns, and 0 spare disks. Note that although multi-dimensional arrays may be specified, they are NOT supported in the driver. 

The example in the manpage matches up to the requirements of creating a RAID 5, so I'll keep it unchanged. 


The second section, the `disks' section, specifies the actual components of the device. 
For example: 

START disks
/dev/sd0e
/dev/sd1e
/dev/sd2e


Note that it is imperative that the order of the components in the configuration file does not change between configurations of a RAID device. 

These are the devices I've configured with disklabel, so I'll also keep this portion unchanged. 


The next section, which is the `spare' section, is optional, and, if present, specifies the devices to be used as `hot spares' -- devices which are on-line, but are not actively used by the RAID driver unless one of the main components fail. A simple `spare' section might be: 
START spare
/dev/sd3e


for a configuration with a single spare component. If no spare drives are to be used in the configuration, then the `spare' section may be omitted. 

Since only three components are being used for this RAID device, this section will be omitted. 


The next section is the `layout' section. This section describes the general layout parameters for the RAID device, and provides such information as sectors per stripe unit, stripe units per parity unit, stripe units per reconstruction unit, and the parity configuration to use. This section might look like: 
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 5


The sectors per stripe unit specifies, in blocks, the interleave factor; i.e. the number of contiguous sectors to be written to each component for a single stripe. Appropriate selection of this value (32 in this example) is the subject of much research in RAID architectures. The stripe units per parity unit and stripe units per reconstruction unit are normally each set to 1. While certain values above 1 are permitted, a discussion of valid values and the consequences of using anything other than 1 are outside the scope of this document. The last value in this section (5 in this example) indicates the parity configuration desired. Valid entries include: 

0 - RAID level 0. No parity, only simple striping. 

1 - RAID level 1. Mirroring. 

4 - RAID level 4. Striping across components, with parity stored on the last component. 

5 - RAID level 5. Striping across components, parity distributed across all components. 


I'll use the example given here, since it matches my configuration. However, be sure to have a look at the stripe unit size performance analysis linked at the end of the article for a nice explanation and graph of performance differences for the options. 

The next required section is the `queue' section. This is most often specified as: 
START queue
fifo 100


where the queuing method is specified as fifo (first-in, first-out), and the size of the per-component queue is limited to 100 requests. Other queuing methods may also be specified, but a discussion of them is beyond the scope of this document. 

Rather than venturing into uncharted territory, this one can remain unchanged. 


The final section, the `debug' section, is optional. For more details on this the reader is referred to the RAIDframe documentation discussed in the HISTORY section. 
And, debug can be left out in the "we'll cross that bridge when we come to it" fashion. 

So, what remains is: 


START array
1 3 0

START disks
/dev/sd0e
/dev/sd1e
/dev/sd2e

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 5

START queue
fifo 100

Now, save this into the file /etc/raid0.conf Note that the raid0 its name refers to raid device 0 rather than RAID level 0. The next RAID device will use /etc/raid1.conf and more would continue in sequence as /etc/raid2.conf, raid3.conf, etc. 


5 - Initialize the raid0 device
First, raid0 needs to be configured. The -C specifies to do forcefully since it's being newly created: 
raidctl -C /etc/raid0.conf raid0 

This will complete instantaneously, but causes the kernel to display about a pageful of text, obscuring the shell prompt. So, instead of waiting, hit enter for a prompt to be displayed. 

The next step is to initialize the component labels using a numeric identifier unique to the RAID device: 

raidctl -I 123456 raid0 

Now, initialize parity, with -v to display a status meter: 

raidctl -iv raid0 

You'll be treated to a display along these lines: 


Initiating re-write of parity
Parity Re-write status:
  6% |**                                     | ETA:    21:43 \

When the parity initialization is complete, the RAID is ready to use as a disk device. 


6 - Partition the raid0 device
To make the RAID ready for a filesystem (or to be used as component for another raid), it must be labeled. 
In this case, I'll use a 512MB swap partition (twice physical memory), and one large root filesystem. 

To create the prototype: 

disklabel raid0 > /root/disklabel.raid0 

And, edit the partition portion: 


#        size   offset     fstype   [fsize bsize   cpg]
  d: 67945792        0     4.2BSD        0     0     0   # (Cyl.    0 - 1061652)

To become: 

#        size   offset     fstype   [fsize bsize   cpg]
  a: 67945792        0     4.2BSD     1024  8192    16   # (Cyl.    0 - 1061652)
  d: 67945792        0     unused        0     0     0   # (Cyl.    0 - 1061652)


7 - Create the filesystem:
newfs /dev/rraid0a 

It may be necessary to adjust the -c option of newfs. In this case, the default, 16, caused newfs to refuse to create the filesystem, so I used newfs -c 96 /dev/rraid0a 


8 - Copy the base system to the RAID device.
First, mount the RAID's root partition on /mnt 
mount /dev/raid0a /mnt 

Then, use a tar pipe to copy the install root filesystem over to the RAID. The 'l' (ell) switch tells tar to not traverse filesystem boundaries, and 'p' to preserve permissions. 

( cd / ; tar lpcf - . ) | ( cd /mnt ; tar xpf - ) 


9 - Configure the raid1 set:
This is performed in much the same fashion as the configuration of raid0. First, the /etc/raid1.conf file must be created: 
Since this is a RAID 1 set with a hot spare, the "START array" section will need to specify 2 components and 1 hot spare: 

START array 1 2 1 

The 'f' partitions are used for the components of the RAID device: 

START disks
/dev/sd0f
/dev/sd1f


With the hot spare listed separately: 

START spare
/dev/sd


The layout section needs '1' as the last parameter, which specifies RAID level: 

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
32 1 1 1


START queue
fifo 100


Thus, the completed /etc/raid1.conf is: 


START array
1 2 1

START disks
/dev/sd0f
/dev/sd1f

START spare
/dev/sd2f

START layout
32 1 1 1

START queue
fifo 100


Now, configure raid1 using its configuration file: 

raidctl -C /etc/raid1.conf raid1 


Next, initialize the component labels, using a different number than the previous RAID set: 

raidctl -I 123457 raid0 


Initialize parity: 

raidctl -iv raid1 


Create a prototype disklabel to use for raid1: 

disklabel raid1 > /root/disklabel.raid1 

Edit the partition section of the prototype: 

#        size   offset     fstype   [fsize bsize   cpg]
  d:  1050560        0     4.2BSD        0     0     0   # (Cyl.    0 - 32829)

Specifying a swap partition: 

#        size   offset     fstype   [fsize bsize   cpg]
  b:  1050592        0     swap          0     0     0   # (Cyl.    0 - 32829)
  d:  1050592        0     4.2BSD        0     0     0   # (Cyl.    0 - 32829)

Then, apply the label: 

disklabel -R -r raid1 /root/disklabel.raid1 

The raid1 set is now ready for /dev/rraid0b to be used as a swap device. 


10 - Edit the RAID 5's /etc/fstab file. 
At boot time, the /etc/fstab file is read from the filesystem mounted as root and tells the system which filesystems to mount and which swap devices to enable. Since raid0 will be hosting the root partition and raid1 will be hosting the swap partition, raid0's /etc/fstab on will need to be modified. 
With raid0's root partition currently mounted on /mnt, the file to edit will be /mnt/etc/fstab 

The original: 


/dev/sd0a / ffs rw 1 1
/dev/sd0b none swap sw 0 0
/kern /kern kernfs rw

becomes: 


/dev/raid0a / ffs rw 1 1
/dev/raid1b none swap sw 0 0
/kern /kern kernfs rw


11 - Turn on autoconfiguration
Make raid0 autoconfigurable and eligible to be the root partition. This overrides using the partition the kernel was booted from as the root device: 
raidctl -A root raid0 

And, make raid1 autoconfigurable without root partition eligibility: 

raidctl -A yes raid1 


12 - Reboot and enjoy
shutdown -r now 
Once the system has come up, log in and verify that the root filesystem is indeed on the raid0 device: 

$ df -k


Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/dev/raid0a  33286204   301047 31320846     0%    /
kernfs              1        1        0   100%    /kern

And, that swap is on the raid1 device: 

$ swapctl -l


Device      512-blocks     Used    Avail Capacity  Priority
/dev/raid1b    1050560        0  1050560     0%    0

General Operations
The administration of a RAID subsystem after setup is primarily done using the raidctl command. 

The status of a RAID device can be seen through the following command: 

raidctl -s raid0 

It yields output such as, for a healthy RAID set: 


Components:
           /dev/sd0e: optimal
           /dev/sd1e: optimal
           /dev/sd2e: optimal
No spares.
Component label for /dev/sd0e:
   Row: 0 Column: 0 Num Rows: 1 Num Columns: 3
   Version: 2 Serial Number: 123456 Mod Counter: 707
   Clean: No Status: 0
   sectPerSU: 32 SUsPerPU: 1 SUsPerRU: 1
   RAID Level: 5  blocksize: 512 numBlocks: 35023552
   Autoconfig: Yes
   Root partition: Yes
   Last configured as: raid0
...
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

For a RAID set with one failed component: 


Components:
           /dev/sd0e: optimal
           /dev/sd1e: failed
           /dev/sd2e: optimal
...

For a RAID set reconstructing after replacement of a failed component: 


Components:
           /dev/sd0e: optimal
           /dev/sd1e: reconstructing
           /dev/sd2e: optimal
...
Parity status: clean
Reconstruction is 7% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.

For a RAID set brought on-line after an unclean shutdown: 


Components:
           /dev/sd0e: optimal
           /dev/sd1e: optimal
           /dev/sd2e: optimal
No spares.
...
Parity status: DIRTY
Reconstruction is 100% complete.
Parity Re-write is 2% complete.
Copyback is 100% complete.

To see only the Parity, Reconstruction, Parity Re-write and Copyback status: 

raidctl -S 

If a maintenance operation is occurring on one of these, a progress meter will be attached to the terminal, also. The progress meter can be safely exited using control-c without interrupting the the task. Output yielded for a RAID set with dirty parity (eg., from an unclean shutdown) during a rebuild: 


Reconstruction is 100% complete.
Parity Re-write is 9% complete.
Copyback is 100% complete.
Parity Re-write status:
 27% |**********                             | ETA:    19:25 \

In the event of failure of a hot-swappable SCSI device, after replacing the component which failed, the new component can be detected by executing: 

scsictl scsibus0 scan any any 

Note that the replacement disk must be labeled before it can be reconstructed to. Eg., using the disklabel files from installation time: 

disklabel -R -r sd1 /root/disklabel.sd1 

To rebuild onto a replaced component in-place (without copying from a hot spare): 

raidctl -R /dev/sd1e raid0 

To copy back to a replaced device from a hot-spare which took its place: 

raidctl -B /dev/sd1e 

Booting with a non-RAID root partition for emergency maintenance (eg., to reset a forgotten root password, or restore an important shared library from accidental deletion): 

boot -a 

(ask for the device to be mounted as root instead of automatically choosing the root partition of the RAID) 

Unconfiguring a RAID set to start over again from scratch: 

First, make sure the device is unmounted, then clear the first part of it to get rid of the disklabel - 

dd if=/dev/zero of=/dev/rraid0d count=512 

Then, unconfigure the device: 

raidctl -u raid0 

Now you're free to create a new raid0.conf and reinitialize, unfettered by old configurations.