Apache Hadoop Compability
The goal of this page is to describe the issues that affect compatibility between Hadoop releases for Hadoop developers, downstream projects and end users.
Here are some existing relevant jiras and pages related to the topic
Describe the annotations an interface should have as per our existing interface classification scheme (see HADOOP-5073)
Cover compatibility items that are beyond the scope of API classification, along the lines of those discussed in HADOOP-5071, focused on Hadoop v1.
The Roadmap captures release policies, some of the content is out of date.
Note to downstream projects/users: If you are concerned about compatibility at any level, we strongly encourage you follow the Hadoop developer mailing lists, and track on JIRA issues that may concern you. You are also strongly advised to verify that your code works against beta releases of forthcoming Hadoop versions, as that is a time in which identified regressions can be corrected rapidly - if you only test when a new final release ships, the time to fix is likely to be at least three months.
This section describes the various types of compatibility.
Hadoop interfaces and classes are annotated to describe the intended audience and stability in order to maintain compatibility with previous releases. See HADOOP-5073 for more details.
InterfaceAudience: captures the intended audience, possible values are Public (for outside users), LimitedPrivate (for other Hadoop components, and closely related projects like HBase), Private (for within component use)
InterfaceStability: describes what types of interface changes are expected. Possible values are Stable, Evolving, Unstable, and Deprecated. See HADOOP-5073 for details.
- Public-Stable API compatibility is required to ensure end-user programs and downstream projects continue to work without any changes.
LimitedPrivate-Stable API compatibility is required to allow upgrade of individual components across minor releases.
- Private-Stable API compatibility is required for rolling upgrades.
Apache Hadoop strives to ensure that the behavior of APIs remains consistent over versions, though changes for correctness may result in changes in behavior. That is: if you relied on something which we consider to be a bug, it may get fixed.
We are in the process of specifying some APIs more rigorously, enhancing our test suites to verify compliance with the specification, effectively creating a formal specification for the subset of behaviors that can be easily tested. We welcome involvement in this process, from both users and implementors of our APIs.
Wire compatibility concerns the data being transmitted over the wire between components. Hadoop uses protocol buffers for most RPC communication. Preserving compatibility requires prohibiting modification to the required fields of the corresponding protocol buffer. Optional fields may be added without breaking backwards compatibility. The protocols can be categorized as follows:
- Client-Server (Admin): It’s worth distinguishing a subset of the Client-Server protocols used solely by administrative commands (eg the HA admin protocol) as these protocols may be changed with less impact than general Client-Server protocols.
Non-RPC communication should be considered as well, for example using HTTP to transfer an HDFS image as part of snapshotting or transferring MapTask output.
While the Metrics API compatibility is governed by Java API compatibility, the actual metrics exposed by Hadoop need to be compatible for users to be able to automate using them (scripts etc.). Adding additional metrics is compatible; modifying (eg changing the unit or measurement) or removing existing metrics breaks compatibility. Likewise, changes to JMX MBean object names also break compatibility.
REST API compatibility corresponds to both the request (URLs) and responses to each request (content, which may contain other URLs). Hadoop REST APIs are specifically meant for stable use by clients. The following are the exposed REST APIs:
WebHDFS (as supported by HttpFs) - Stable
- WebHDFS (as supported by HDFS) - Stable
- Servlets - JMX, conf
Users and admins use Command Line Interface commands either directly or via scripts to access/modify data and run jobs/apps. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return code and output may break compatibility and adversely affect users.
Userlogs, job history and output are stored on disk - local or on HDFS. Changing the directory structure of these user-accessible files break compatibility, even in cases where the original path is preserved via symbolic links (if, for example, the path is accessed by a servlet that is configured to not follow links).
User applications (e.g. Java programs which are not MR jobs) built against Hadoop might add all Hadoop jars (including Hadoop’s dependencies) to the application’s classpath. Adding new dependencies or updating the version of existing dependencies may break user programs.
Users and related projects often utilize the exported environment variables (eg HADOOP_CONF_DIR), therefore removing or renaming environment variables is an incompatible change.
Hadoop Configuration Files
Modification to Hadoop configuration properties, both key names and units of values. We assume users, who use Hadoop configuration objects to pass information to jobs, ensure their properties do not conflict with the key-prefixes defined by Hadoop. The following key-prefixes are used by Hadoop daemons and should be avoided:
Hadoop uses particular formats to store data and metadata. Modifying these formats can interfere with rolling upgrades and hence require compatibility guarantees. For instance, modifying the IFile format will require re-execution of jobs in-flight during a rolling upgrade. Preserving certain formats like HDFS meta data allow access/modification of data across releases.