HDFS RAID

Overview

The HDFS RAID module provides a DistributedRaidFileSystem (DRFS) that is used along with an instance of the Hadoop DistributedFileSystem (DFS). A file stored in the DRFS (the source file) is divided into stripes consisting of several blocks. For each stripe, a number of parity blocks are stored in the parity file corresponding to this source file. This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.

The main benefit of the DRFS is the increased protection against data corruption it provides. Because of this increased protection, replication levels can be lowered while maintaining the same availability guarantees, which results in significant storage space savings.

Architecture and implementation

HDFS Raid consists of several software components:

DRFS client

The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes them on to the underlying client. Whenever the underlying DFS throws a ChecksumException or a BlockMissingException (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates the parity file for the current source file and recomputes the missing blocks before returning them to the application.

It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled. The BlockFixer daemon and the RaidShell tool can be used to persistently fix bad blocks.

RaidNode

The RaidNode periodically scans all paths for which the configuration specifies that they should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks and selects those that have not been recently modified (default is within the last 24 hours). Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding source file is lowered as specified in the configuration. The RaidNode also periodically deletes parity files that have become orphaned or outdated.

There are currently two implementations of the RaidNode:

BlockFixer

The BlockFixer is a daemon that runs at the RaidNode and periodically inspects the health of the paths for which DRFS is configured. When a file with missing or corrupt blocks is encountered, these blocks are recomputed and inserted back into the file system.

There are two implementations of the BlockFixer:

RaidShell

The RaidShell is a tool that allows the administrator to maintain and inspect a DRFS. It supports commands for manually triggering the recomputation of bad data blocks and also allows the administrator to display a list of irrecoverable files (i.e., files for which too many data or parity blocks have been lost).

To validate the integrity of a file system, run RaidFSCK as follows:

$HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidShell -fsck [path]

This will print a list of corrupt files (i.e., files which have lost too many blocks and can no longer be fixed by Raid).

ErasureCode

(currently under development)

ErasureCode is the underlying component used by BlockFixer and RaiNode to generate parity blocks and to fix parity/source blocks. ErasureCode does encode and decode. When encoding, ErasureCode takes several source bytes and generate some parity bytes. When decoding, ErasureCode generates the missing bytes (can be parity or source bytes) by looking at the remaining source bytes and parity bytes.

The number of missing bytes can be recovered is equal to the number of parity bytes created. For example, if we encode 10 source bytes to 3 parity bytes. We can recover any 3 missing bytes by the other 10 remaining bytes.

There are two kinds of erasure codes implemented in Raid: XOR code and Reed-Solomon code. The difference between them is that XOR only allows creating one parity bytes but Reed-Solomon code allows creating any given number of parity bytes. As a result, the replication on the source file can be reduce to 1 when using Reed-Solomon without losing data safety. The downside of having only one replica of a block is that reads of a block have to go to a single machine, reducing parallelism. Thus Reed-Solomon should be used on data that is not supposed to be used frequently.

Using HDFS RAID

Installation

The entire code is packaged in the form of a single jar file named hadoop-*-raid.jar. To use HDFS Raid, you need to put the above mentioned jar file into Hadoop's CLASSPATH. The easiest way to achieve this is to copy hadoop-*-raid.jar from $HADOOP_HOME/build/contrib/raid to $HADOOP_HOME/lib. Alternatively you can modify $HADOOP_CLASSPATH (defined in conf/hadoop-env.sh) to include this jar.

Configuration

There is a single configuration file named raid.xml that describes the HDFS paths for which RAID should be used. This provides a list of directory/file patterns that need to be RAIDed. There are quite a few options that can be specified for each pattern. A sample of this file can be found insrc/contrib/raid/conf/raid.xml. To apply the policies defined in raid.xml, a reference has to be added to hdfs-site.xml:

<property>
  <name>raid.config.file</name>
  <value>/mnt/hdfs/DFS/conf/raid.xml</value>
  <description>This is needed by the RaidNode </description>
</property>

In order to get clients to use the DRFS client (which transparently fixes bad blocks), add the following configuration property to hdfs-site.xml:

<property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.dfs.DistributedRaidFileSystem</value>
  <description>The FileSystem for hdfs: uris.</description>
</property>

Additional (optional) configuration

The following properties can be set in hdfs-site.xml to further tune the DRFS configuration:

Running DRFS

The DRFS provides support for administration at runtime without any downtime to cluster services. It is possible to add/delete new paths to be raided without interrupting any load on the cluster. Changes to raid.xml are detected periodically (every few seconds) and new policies are applied immediately.

Designate one machine in your cluster to run the RaidNode software. You can run this daemon on any machine irrespective of whether that machine is running any other hadoop daemon or not. You can start the RaidNode by running the following on the selected machine:

nohup $HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidNode >> /xxx/logs/hadoop-root-raidnode-hadoop.xxx.com.log &

We also provide two scripts to start and stop the RaidNode more easily. Copy the scripts start-raidnode.sh and stop-raidnode.sh to the directory $HADOOP_HOME/bin on the machine where the RaidNode is to be deployed. You can then start or stop the RaidNode by directly calling these scripts on that machine. To deploy the RaidNode remotely, copy start-raidnode-remote.sh and stop-raidnode-remote.sh to $HADOOP_HOME/bin at the machine from which you want to trigger the remote deployment and create a text file $HADOOP_HOME/conf/raidnode on the same machine containing the name of the machine where the RaidNode should be deployed. These scripts ssh to the specified machine and invoke start-raidnode.sh/stop-raidnode.sh there.

For easy maintencance, you might want to change start-mapred.sh on the JobTracker machine so that it automatically calls start-raidnode-remote.sh (and make a similar change tostop-mapred.sh to call stop-raidnode-remote.sh).

To monitor the health of a DRFS, use the fsck command provided by the RaidShell.

HDFS-RAID (last edited 2011-11-02 19:58:41 by MaheswaranSathiamoorthy)