This page is out of date. Please refer to http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html henceforth. Thanks

Apache Hadoop Compability

The goal of this page is to describe the issues that affect compatibility between Hadoop releases for Hadoop developers, downstream projects and end users.

Here are some existing relevant jiras and pages related to the topic

  1. Describe the annotations an interface should have as per our existing interface classification scheme (see HADOOP-5073)

  2. Cover compatibility items that are beyond the scope of API classification, along the lines of those discussed in HADOOP-5071, focused on Hadoop v1.

  3. The Roadmap captures release policies, some of the content is out of date.

Note to downstream projects/users: If you are concerned about compatibility at any level, we strongly encourage you follow the Hadoop developer mailing lists, and track on JIRA issues that may concern you. You are also strongly advised to verify that your code works against beta releases of forthcoming Hadoop versions, as that is a time in which identified regressions can be corrected rapidly - if you only test when a new final release ships, the time to fix is likely to be at least three months.

Compatibility types

This section describes the various types of compatibility.

Java API

Hadoop interfaces and classes are annotated to describe the intended audience and stability in order to maintain compatibility with previous releases. See HADOOP-5073 for more details.

Usecases

Semantics compatibility

Apache Hadoop strives to ensure that the behavior of APIs remains consistent over versions, though changes for correctness may result in changes in behavior. That is: if you relied on something which we consider to be a bug, it may get fixed.

We are in the process of specifying some APIs more rigorously, enhancing our test suites to verify compliance with the specification, effectively creating a formal specification for the subset of behaviors that can be easily tested. We welcome involvement in this process, from both users and implementors of our APIs.

Wire compatibility

Wire compatibility concerns the data being transmitted over the wire between components. Hadoop uses protocol buffers for most RPC communication. Preserving compatibility requires prohibiting modification to the required fields of the corresponding protocol buffer. Optional fields may be added without breaking backwards compatibility. The protocols can be categorized as follows:

Non-RPC communication should be considered as well, for example using HTTP to transfer an HDFS image as part of snapshotting or transferring MapTask output.

Metrics/ JMX

While the Metrics API compatibility is governed by Java API compatibility, the actual metrics exposed by Hadoop need to be compatible for users to be able to automate using them (scripts etc.). Adding additional metrics is compatible; modifying (eg changing the unit or measurement) or removing existing metrics breaks compatibility. Likewise, changes to JMX MBean object names also break compatibility.

REST APIs

REST API compatibility corresponds to both the request (URLs) and responses to each request (content, which may contain other URLs). Hadoop REST APIs are specifically meant for stable use by clients. The following are the exposed REST APIs:

CLI Commands

Users and admins use Command Line Interface commands either directly or via scripts to access/modify data and run jobs/apps. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return code and output may break compatibility and adversely affect users.

Directory Structure

Userlogs, job history and output are stored on disk - local or on HDFS. Changing the directory structure of these user-accessible files break compatibility, even in cases where the original path is preserved via symbolic links (if, for example, the path is accessed by a servlet that is configured to not follow links).

Classpath

User applications (e.g. Java programs which are not MR jobs) built against Hadoop might add all Hadoop jars (including Hadoop’s dependencies) to the application’s classpath. Adding new dependencies or updating the version of existing dependencies may break user programs.

Environment Variables

Users and related projects often utilize the exported environment variables (eg HADOOP_CONF_DIR), therefore removing or renaming environment variables is an incompatible change.

Hadoop Configuration Files

Modification to Hadoop configuration properties, both key names and units of values. We assume users, who use Hadoop configuration objects to pass information to jobs, ensure their properties do not conflict with the key-prefixes defined by Hadoop. The following key-prefixes are used by Hadoop daemons and should be avoided:

Data Formats

Hadoop uses particular formats to store data and metadata. Modifying these formats can interfere with rolling upgrades and hence require compatibility guarantees. For instance, modifying the IFile format will require re-execution of jobs in-flight during a rolling upgrade. Preserving certain formats like HDFS meta data allow access/modification of data across releases.

Compatibility (last edited 2014-06-24 18:51:39 by Arun C Murthy)