The HDFS RAID module provides a DistributedRaidFileSystem (DRFS) that is used along with an instance of the Hadoop DistributedFileSystem (DFS). A file stored in the DRFS (the source file) is divided into stripes consisting of several blocks. For each stripe, a number of parity blocks are stored in the parity file corresponding to this source file. This makes it possible to recompute blocks in the source file or parity file when they are lost or corrupted.
The main benefit of the DRFS is the increased protection against data corruption it provides. Because of this increased protection, replication levels can be lowered while maintaining the same availability guarantees, which results in significant storage space savings.
Architecture and implementation
HDFS Raid consists of several software components:
- the DRFS client, which provides application access to the the files in the DRFS and transparently recovers any corrupt or missing blocks encountered when reading a file,
the RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS,
the BlockFixer, which periodically recomputes blocks that have been lost or corrupted,
the RaidShell utility, which allows the administrator to manually trigger the recomputation of missing or corrupt blocks and to check for files that have become irrecoverably corrupted.
the ErasureCode, which provides the encode and decode of the bytes in blocks
The DRFS client is implemented as a layer on top of the DFS client that intercepts all incoming calls and passes them on to the underlying client. Whenever the underlying DFS throws a ChecksumException or a BlockMissingException (because he source file contains corrupt or missing blocks), the DRFS client catches these exceptions, locates the parity file for the current source file and recomputes the missing blocks before returning them to the application.
It is important to note that while the DRFS client recomputes missing blocks when reading corrupt files it does not insert these missing blocks back into the file system. Instead, it discards them once the application request has been fulfilled. The BlockFixer daemon and the RaidShell tool can be used to persistently fix bad blocks.
The RaidNode periodically scans all paths for which the configuration specifies that they should be stored in the DRFS. For each path, it recursively inspects all files that have more than 2 blocks and selects those that have not been recently modified (default is within the last 24 hours). Once it has selected a source file, it iterates over all its stripes and creates the appropriate number of parity blocks for each stripe. The parity blocks are then concatenated together and stored as the parity file corresponding to this source file. Once the parity file has been created, the replication factor for the corresponding source file is lowered as specified in the configuration. The RaidNode also periodically deletes parity files that have become orphaned or outdated.
There are currently two implementations of the RaidNode:
DistributedRaidNode, which dispatches map reduce tasks to compute parity blocks.
The BlockFixer is a daemon that runs at the RaidNode and periodically inspects the health of the paths for which DRFS is configured. When a file with missing or corrupt blocks is encountered, these blocks are recomputed and inserted back into the file system.
There are two implementations of the BlockFixer:
DistBlockFixer, which dispatches map reduce jobs to recompute blocks.
The RaidShell is a tool that allows the administrator to maintain and inspect a DRFS. It supports commands for manually triggering the recomputation of bad data blocks and also allows the administrator to display a list of irrecoverable files (i.e., files for which too many data or parity blocks have been lost).
To validate the integrity of a file system, run RaidFSCK as follows:
$HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidShell -fsck [path]
This will print a list of corrupt files (i.e., files which have lost too many blocks and can no longer be fixed by Raid).
(currently under development)
ErasureCode is the underlying component used by BlockFixer and RaiNode to generate parity blocks and to fix parity/source blocks. ErasureCode does encode and decode. When encoding, ErasureCode takes several source bytes and generate some parity bytes. When decoding, ErasureCode generates the missing bytes (can be parity or source bytes) by looking at the remaining source bytes and parity bytes.
The number of missing bytes can be recovered is equal to the number of parity bytes created. For example, if we encode 10 source bytes to 3 parity bytes. We can recover any 3 missing bytes by the other 10 remaining bytes.
There are two kinds of erasure codes implemented in Raid: XOR code and Reed-Solomon code. The difference between them is that XOR only allows creating one parity bytes but Reed-Solomon code allows creating any given number of parity bytes. As a result, the replication on the source file can be reduce to 1 when using Reed-Solomon without losing data safety. The downside of having only one replica of a block is that reads of a block have to go to a single machine, reducing parallelism. Thus Reed-Solomon should be used on data that is not supposed to be used frequently.
Using HDFS RAID
The entire code is packaged in the form of a single jar file named hadoop-*-raid.jar. To use HDFS Raid, you need to put the above mentioned jar file into Hadoop's CLASSPATH. The easiest way to achieve this is to copy hadoop-*-raid.jar from $HADOOP_HOME/build/contrib/raid to $HADOOP_HOME/lib. Alternatively you can modify $HADOOP_CLASSPATH (defined in conf/hadoop-env.sh) to include this jar.
There is a single configuration file named raid.xml that describes the HDFS paths for which RAID should be used. This provides a list of directory/file patterns that need to be RAIDed. There are quite a few options that can be specified for each pattern. A sample of this file can be found insrc/contrib/raid/conf/raid.xml. To apply the policies defined in raid.xml, a reference has to be added to hdfs-site.xml:
<property> <name>raid.config.file</name> <value>/mnt/hdfs/DFS/conf/raid.xml</value> <description>This is needed by the RaidNode </description> </property>
In order to get clients to use the DRFS client (which transparently fixes bad blocks), add the following configuration property to hdfs-site.xml:
<property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.dfs.DistributedRaidFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property>
Additional (optional) configuration
The following properties can be set in hdfs-site.xml to further tune the DRFS configuration:
- Specify the location where parity files should be stored:
<property> <name>hdfs.raid.locations</name> <value>hdfs://newdfs.data:8000/raid</value> <description>The location for parity files. If this is is not defined, then defaults to /raid. </description> </property>Specify the parity stripe length in blocks:
<property> <name>hdfs.raid.stripeLength</name> <value>10</value> <description>The number of blocks in a file to be combined into a single raid parity block. The default value is 5. The lower the number the greater is the disk space you will save when you enable raid. </description> </property>Specify the size of HAR part-files:
<property> <name>raid.har.partfile.size</name> <value>4294967296</value> <description>The size of HAR part files that store raid parity files. The default is 4GB. The higher the number the fewer the number of files used to store the HAR archive. </description> </property>Specify the block placement policy for raid:
<property> <name>dfs.block.replicator.classname</name> <value> org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyRaid </value> <description>The name of the class which specifies how to place blocks in HDFS. The class BlockPlacementPolicyRaid will try to avoid co-located replicas of the same stripe. This will greatly reduce the probability of raid file corruption. </descrition> </property>
Specify the fairshare scheduler pool that should be use to run jobs dispatched by the DistributedBlockFixer:
<property> <name>raid.mapred.fairscheduler.pool</name> <value>none</value> <description>The name of the fair scheduler pool to use.</description> </property>
Specify which implementation of RaidNode to use (local or distributed):
<property> <name>raid.classname</name> <value>org.apache.hadoop.raid.DistRaidNode</value> <description>Specify which implementation of RaidNode to use (class name). </description> </property>
Specify how frequently the RaidNode re-calculates out-dated or missing parity blocks:
<property> <name>raid.policy.rescan.interval</name> <value>5000</value> <description>Specify the periodicity in milliseconds after which all source paths are rescanned and parity blocks recomputed if necessary. By default, this value is 1 hour. </description> </property>By default, the DRFS assumes that the underlying file system is a DFS. To layer the DRFS
over some other file system, define a property named fs.raid.underlyingfs.impl that specifies the name of the underlying class. For example, to layer The DRFS over an instance of NewFileSystem, use the following property:
<property> <name>fs.raid.underlyingfs.impl</name> <value>org.apache.hadoop.new.NewFileSystem</value> <description>Specify the filesystem that is layered immediately below the DistributedRaidFileSystem. By default, this value is DistributedFileSystem. </description> </property
<property> <name>raid.blockfix.classname</name> <value>org.apache.hadoop.raid.LocalBlockFixer</value> <description>Specify the BlockFixer implementation to use. The default is org.apache.hadoop.raid.DistBlockFixer. </description> </property
The DRFS provides support for administration at runtime without any downtime to cluster services. It is possible to add/delete new paths to be raided without interrupting any load on the cluster. Changes to raid.xml are detected periodically (every few seconds) and new policies are applied immediately.
Designate one machine in your cluster to run the RaidNode software. You can run this daemon on any machine irrespective of whether that machine is running any other hadoop daemon or not. You can start the RaidNode by running the following on the selected machine:
nohup $HADOOP_HOME/bin/hadoop org.apache.hadoop.raid.RaidNode >> /xxx/logs/hadoop-root-raidnode-hadoop.xxx.com.log &
We also provide two scripts to start and stop the RaidNode more easily. Copy the scripts start-raidnode.sh and stop-raidnode.sh to the directory $HADOOP_HOME/bin on the machine where the RaidNode is to be deployed. You can then start or stop the RaidNode by directly calling these scripts on that machine. To deploy the RaidNode remotely, copy start-raidnode-remote.sh and stop-raidnode-remote.sh to $HADOOP_HOME/bin at the machine from which you want to trigger the remote deployment and create a text file $HADOOP_HOME/conf/raidnode on the same machine containing the name of the machine where the RaidNode should be deployed. These scripts ssh to the specified machine and invoke start-raidnode.sh/stop-raidnode.sh there.
For easy maintencance, you might want to change start-mapred.sh on the JobTracker machine so that it automatically calls start-raidnode-remote.sh (and make a similar change tostop-mapred.sh to call stop-raidnode-remote.sh).
To monitor the health of a DRFS, use the fsck command provided by the RaidShell.