Differences between revisions 17 and 18
Revision 17 as of 2014-01-08 22:03:31
Size: 5205
Editor: SteveWatt
Comment:
Revision 18 as of 2014-01-09 09:58:56
Size: 5555
Comment: more use of "Apache Hadoop", state that the ASF doesn't validate assertions of Hadoop compatibility
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
'''Hadoop Filesystem Compatibility''' = Filesystem Compatibility with Apache Hadoop =
Line 5: Line 5:
Apache Hadoop is built on a distributed filesystem, HDFS, capable of storing tens of Petabytes of data. This filesystem is designed to work with Hadoop from the ground up, with location aware block placement, integration with the Hadoop tools and both explicit and implicit testing. Apache Hadoop is built on a distributed filesystem, HDFS, ''Hadoop Distributed File System'', capable of storing tens of Petabytes of data. This filesystem is designed to work with Apache Hadoop from the ground up, with location aware block placement, integration with the Hadoop tools and both explicit and implicit testing.
Line 7: Line 7:
Hadoop also works with other filesystems, the platform specific "local" filesystem, [[BlobStore|Blobstores]] such as Amazon S3 and Azure storage, as well as alternative distributed filesystems. Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, [[BlobStore|Blobstores]] such as Amazon S3 and Azure storage, as well as alternative distributed filesystems.
Line 12: Line 12:
 1. The filesystem provides an implementation of the `org.apache.hadoop.fs.FileSystem` class (and in Hadoop v2, in implementation of the `FileContext' class}  1. The filesystem provides an implementation of the `org.apache.hadoop.fs.FileSystem` class (and in Hadoop v2, in implementation of the `FileContext` class}
Line 16: Line 16:
The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix `hdfs:` on any file path means that it refers to an HDFS filesystem; `file:` to the local filesystem, `s3:` to Amazon S3, `ftp:` FTP, etc. The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix `hdfs:` on any file path means that it refers to an HDFS filesystem; `file:` to the local filesystem, `s3:` to Amazon S3, `ftp:` FTP, `swift:` OpenStackSwift, ...etc.
Line 20: Line 20:
All providers of filesystem plugins do their utmost to make their filesystems are compatible with Hadoop. Ambiguities in the Hadoop APIs do not help here -as a lot of the expectations of Hadoop applications are set not by the FileSystem API, but the behavior of HDFS itself -which makes it harder to distinguish "bug" from "feature" in the behavior of HDFS. All providers of filesystem plugins do their utmost to make their filesystems are compatible with Apache Hadoop. Ambiguities in the Hadoop APIs do not help here -as a lot of the expectations of Hadoop applications are set not by the `FileSystem` API, but the behavior of HDFS itself -which makes it harder to distinguish "bug" from "feature" in the behavior of HDFS.
Line 22: Line 22:
We are (as of April 2013), attempting to [[https://issues.apache.org/jira/browse/HADOOP-9371|define the Semantics of the Hadoop FileSystem more rigorously]] as well as adding [[https://issues.apache.org/jira/browse/HADOOP-9258|better test coverage for the filesystem APIs]]. This will ensure that we can keep the filesystem implementations that ship with Hadoop -HDFS itself, and those classes that connect to other filesystems, currently `s3:`, `s3n:`, `file:`, `ftp:`, `webhdfs`- consistent with each other, and compatible with existing applications. The Hadoop developers are (as of April 2013), attempting to [[https://issues.apache.org/jira/browse/HADOOP-9371|define the Semantics of the Hadoop FileSystem more rigorously]] as well as adding [[https://issues.apache.org/jira/browse/HADOOP-9258|better test coverage for the filesystem APIs]]. This will ensure that we can keep the filesystem implementations that ship with Hadoop -HDFS itself, and those classes that connect to other filesystems, currently `s3:`, `s3n:`, `file:`, `ftp:`, `webhdfs`- consistent with each other, and compatible with existing applications.
Line 27: Line 27:
The list below contains links to information about some of the additional FileSystems and Object Stores for Hadoop The list below contains links to information about some of the additional FileSystems and Object Stores for Apache Hadoop
Line 41: Line 41:
Even if the filesystem is supported by a library for tight integration with Hadoop, it may behave differently from what Hadoop and applications expect: this is something to explore with the supplier of the filesystem. Even if the filesystem is supported by a library for tight integration with Apache Hadoop, it may behave differently from what Hadoop and applications expect: this is something to explore with the supplier of the filesystem.
Line 43: Line 43:
What the ASF can do is warn that our own BlobStore filesystems (currently `s3:` and `s3n:`) are not complete replacements for `hdfs:`, as operations such as `rename()` are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on. The Apache Software Foundation can make no assertion that a third party filesystem is ''compatible'' with Apache Hadoop: these are claims by the vendors which the Apache Software Foundation do not validate.

What the ASF can do is warn that our own BlobStore filesystems (currently `s3:`, `s3n:` and `swift:`) are not complete replacements for `hdfs:`, as operations such as `rename()` are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.

Filesystem Compatibility with Apache Hadoop

See this link for Community Progress and Participation on these topics

Apache Hadoop is built on a distributed filesystem, HDFS, Hadoop Distributed File System, capable of storing tens of Petabytes of data. This filesystem is designed to work with Apache Hadoop from the ground up, with location aware block placement, integration with the Hadoop tools and both explicit and implicit testing.

Apache Hadoop also works with other filesystems, the platform specific "local" filesystem, Blobstores such as Amazon S3 and Azure storage, as well as alternative distributed filesystems.

All such filesystems (including HDFS) must link up to Hadoop in two ways.

  1. The filesystem looks like a "native" filesystem, and is accessed as a local FS, perhaps with some filesystem-specific means of telling the MapReduce layer which TaskTracker is closest to the data.

  2. The filesystem provides an implementation of the org.apache.hadoop.fs.FileSystem class (and in Hadoop v2, in implementation of the FileContext class}

Implementing the FileSystem class ensures that there is an API for applications such as MapReduce, Apache HBase, Apache Giraph and others can use -including third-party applications as well as code running in a MapReduce job that wishes to read or write data.

The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix hdfs: on any file path means that it refers to an HDFS filesystem; file: to the local filesystem, s3: to Amazon S3, ftp: FTP, swift: OpenStackSwift, ...etc.

There are other filesystems that provide explicit integration with Hadoop through the relevant Java JAR files, native binaries and configuration parameters needed to add a new schema to Hadoop, such as fat32:

All providers of filesystem plugins do their utmost to make their filesystems are compatible with Apache Hadoop. Ambiguities in the Hadoop APIs do not help here -as a lot of the expectations of Hadoop applications are set not by the FileSystem API, but the behavior of HDFS itself -which makes it harder to distinguish "bug" from "feature" in the behavior of HDFS.

The Hadoop developers are (as of April 2013), attempting to define the Semantics of the Hadoop FileSystem more rigorously as well as adding better test coverage for the filesystem APIs. This will ensure that we can keep the filesystem implementations that ship with Hadoop -HDFS itself, and those classes that connect to other filesystems, currently s3:, s3n:, file:, ftp:, webhdfs- consistent with each other, and compatible with existing applications.

This formalisation of the API will also benefit anyone who wishes to to provide a library that lets Hadoop applications work with their FileSystem -such people have been very constructive in helping define the FileSystem APIs more rigorously.

The list below contains links to information about some of the additional FileSystems and Object Stores for Apache Hadoop

Even if the filesystem is supported by a library for tight integration with Apache Hadoop, it may behave differently from what Hadoop and applications expect: this is something to explore with the supplier of the filesystem.

The Apache Software Foundation can make no assertion that a third party filesystem is compatible with Apache Hadoop: these are claims by the vendors which the Apache Software Foundation do not validate.

What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.

Similarly the local file: filesystem behaves different only different operating systems, especially regarding filename case and whether or not you can delete open files. If your intent is to write code that only ever works with the local filesystem, always test on the target platform.

HCFS (last edited 2014-05-10 02:53:21 by JayVyas)