Amazon S3 (Simple Storage Service) is a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and AmazonEC2 is free. This makes use of S3 attractive for Hadoop users who run clusters on EC2.

Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).
S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS using the S3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either S3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. Note also, that by using S3 as an input to MapReduce you lose the data locality optimization, which may be significant.

History

Setting up hadoop to use S3 as a replacement for HDFS

Put the following in conf/hadoop-site.xml to set the default filesystem to be the S3 block filesystem:

<property>
  <name>fs.default.name</name>
  <value>s3://BUCKET</value>
</property>

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>ID</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>SECRET</value>
</property>

For the S3 native filesystem, just replace s3 with s3n in the above.

Note that you can not use s3n as a replacement for HDFS on Hadoop versions prior to 0.19.0 since rename was not supported.

Alternatively, you can put the access key ID and the secret access key into a s3 (or s3n) URI as the user info:

<property>
  <name>fs.default.name</name>
  <value>s3://ID:SECRET@BUCKET</value>
</property>

Note that since the secret access key can contain slashes, you must remember to escape them by replacing each slash / with the string %2F. Keys specified in the URI take precedence over any specified using the properties fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey.

Running the Map/Reduce demo in the Hadoop API Documentation using S3 is now a matter of running:

mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*

To run in distributed mode you only need to run the MapReduce daemons (JobTracker and TaskTrackers) - HDFS NameNode and DataNodes are unnecessary.

bin/start-mapred.sh

Security

Your Amazon Secret Access Key is that: secret. If it gets known you have to go to the Security Credentials page and revoke it. Try and avoid printing it in logs, or checking the XML configuration files into revision control.

Running bulk copies in and out of S3

Support for the S3 block filesystem was added to the ${HADOOP_HOME}/bin/hadoop distcp tool in Hadoop 0.11.0 (See HADOOP-862). The distcp tool sets up a MapReduce job to run the copy. Using distcp, a cluster of many members can copy lots of data quickly. The number of map tasks is calculated by counting the number of files in the source: i.e. each map task is responsible for the copying one file. Source and target may refer to disparate filesystem types. For example, source might refer to the local filesystem or hdfs with S3 as the target.

The distcp tool is useful for quickly prepping S3 for MapReduce jobs that use S3 for input or for backing up the content of hdfs.

Here is an example copying a nutch segment named 0070206153839-1998 at /user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3 AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):

% ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456@nutch/

Flip the arguments if you want to run the copy in the opposite direction.

Other schemes supported by distcp are file (for local), and http.

You'll likely encounter the following errors if you are running a stock Hadoop 0.11.X.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed...We encountered an internal error. Please try again...

put: Input stream is not repeatable as 1048576 bytes have been written, exceeding the available buffer size of 131072

See HADOOP-882 for discussion of the above issues and workarounds/fixes.

S3 Block FileSystem Version Numbers

From release 0.13.0 the S3 block filesystem stores a version number in the file metadata. This table lists the first Hadoop release for each version number.

Version number

Release

Unversioned

0.10.0

1

0.13.0

There is a migration tool (org.apache.hadoop.fs.s3.MigrationTool) that can be used to migrate data to the latest version. Run it without arguments to get usage instructions.

AmazonS3 (last edited 2013-01-29 00:07:21 by SteveLoughran)