How to Contribute to Hadoop Common

This page describes the mechanics of how to contribute software to Hadoop Common. For ideas about what you might contribute, please see the ProjectSuggestions page.

Setting up

Here are some things you will need to build and test Hadoop. It does take some time to set up a working Hadoop development environment, so be prepared to invest some time. Before you actually begin trying to code in it, try getting the project to build and test locally first. This is how you can test your installation.

Software Configuration Management (SCM)

The ASF uses Apache Subversion ("SVN") for its SCM system. There are some excellent GUIs for this, and IDEs with tight SVN integration, but as all our examples are from the command line, it is convenient to have the command line tools installed and a basic understanding of them.

A lot of developers now use Git to keep their own (uncommitted-into-apache) code under SCM; the git command line tools aid with this. See GitAndHadoop for more details.

Integrated Development Environment (IDE)

You are free to use whatever IDE you prefer, or your favorite command line editor. Note that

Build Tools

To build the code, install (as well as the programs needed to build Hadoop on Windows, if that is your development platform)

These should also be on your PATH; test by executing mvn and javac respectively.

As the Hadoop builds use the external Maven repository to download artifacts, Maven needs to be set up with the proxy settings needed to make external HTTP requests. You will also need to be online for the first builds of every Hadoop project, so that the dependencies can all be downloaded.

Other items

Native libraries

On Linux, you need the tools to create the native libraries.

For RHEL (and hence also CentOS):

yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool 

For Debian and Ubuntu:

apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev

Hardware Setup

Getting the source code

First of all, you need the Hadoop source code. The official location for Hadoop is the Apache SVN repository; Git is also supported, and useful if you want to make lots of local changes -and keep those changes under some form or private or public revision control.

SVN Access

Get the source code on your local drive using SVN. Most development is done on the "trunk":

svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/ hadoop-trunk

You may also want to develop against a specific release. To do so, visit http://svn.apache.org/repos/asf/hadoop/common/tags/ and find the release that you are interested in developing against. To checkout this release, run:

svn checkout http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z

If you prefer to use Eclipse for development, there are instructions for setting up SVN access from within Eclipse at EclipseEnvironment.

Committers: Check out using https:// URLs instead.

Git Access

See GitAndHadoop

Building ProtocolBuffers (for 0.23+)

Hadoop 0.23+ must have Google's ProtocolBuffers for compilation to work. These are native binaries which need to be downloaded, compiled and then installed locally. See YARN Readme.

This is a good opportunity to get the GNU C/C++ toolchain installed, which is useful for working on the native code used in the HDFS project.

To install and use ProtocolBuffers

Linux

Install the protobuf packages provided they are current enough -see the README file for the current version. If they are too old, uninstall any version you have and follow the instructions.

Local build and installation

Testing your Protocol Buffers installation

The test for this is verifying that protoc is on the command line. You should expect something like

$ protoc
Missing input file.

You may see the error message

$ protoc
protoc: error while loading shared libraries: libprotobuf.so.7: cannot open shared object file: No such file or directory

This is a known issue for Linux, and is caused by a stale cache of libraries. Run ldconfig and try again.

Making Changes

Before you start, send a message to the Hadoop developer mailing list, or file a bug report in Jira. Describe your proposed changes and check that they fit in with what others are doing and have planned for the project. Be patient, it may take folks a while to understand your requirements.

Modify the source code and add some (very) nice features using your favorite IDE.

But take care about the following points

Using Maven

Hadoop 0.23 and later is built using Apache Maven, version 3 or later.

Maven likes to download things, especially on the first run.

  1. Be online for that first build, on a good network
  2. To set the Maven proxy setttings, see http://maven.apache.org/guides/mini/guide-proxies.html

  3. Because Maven doesn't pass proxy settings down to the Ant tasks it runs HDFS-2381 some parts of the Hadoop build may fail. The fix for this is to pass down the Ant proxy settings in the build Unix: mvn $ANT_OPTS; windows mvn %ANT_OPTS%.

Generating a patch

Unit Tests

Please make sure that all unit tests succeed before constructing your patch and that no new javac compiler warnings are introduced by your patch.

For building Hadoop with Maven, use the following to run all unit tests and build a distribution. The -Ptest-patch profile will check that no new compiler warnings have been introduced by your patch.

mvn clean install -Pdist -Dtar -Ptest-patch

Any test failures can be found in the target/surefire-reports directory of the relevant module. You can also run this command in one of the hadoop-common, hadoop-hdfs, or hadoop-mapreduce directories to just test a particular subproject.

Unit tests development guidelines HowToDevelopUnitTests

Javadoc

Please also check the javadoc.

mvn javadoc:javadoc
firefox target/site/api/index.html

Examine all public classes you've changed to see that documentation is complete, informative, and properly formatted. Your patch must not generate any javadoc warnings.

Creating a patch

Check to see what files you have modified with:

svn stat

Add any new files with:

svn add src/.../MyNewClass.java
svn add src/.../TestMyNewClass.java

In order to create a patch, type (from the base directory of hadoop):

svn diff > HADOOP-1234.patch

This will report all modifications done on Hadoop sources on your local disk and save them into the HADOOP-1234.patch file. Read the patch file. Make sure it includes ONLY the modifications required to fix a single issue.

Please do not:

Please do:

If you need to rename files in your patch:

  1. Write a shell script that uses 'svn mv' to rename the original files.
  2. Edit files as needed (e.g., to change package names).
  3. Create a patch file with 'svn diff --no-diff-deleted --notice-ancestry'.
  4. Submit both the shell script and the patch file.

This way other developers can preview your change by running the script and then applying the patch.

Naming your patch

Patches for trunk should be named according to the Jira: jira-xyz.patch, eg hdfs-1234.patch.

Patches for a non-trunk branch should be named jira-xyz-branch.patch, eg hdfs-123-branch-0.20-security.patch. The branch name suffix should be the exact name of a Subversion branch under hadoop/common/branches/, such as "branch-0.20-security". This naming convention allows the pre-commit tests to automatically run against the correct branch (new capability coming soon; see HADOOP-7435).

It's OK to upload a new patch to Jira with the same name as an existing patch, Jira will just make the previous patches grey. They're still listed, sorted by date. If you select the "Activity>All" tab then the different versions are linked in the comment stream, providing context.

Testing your patch

Before submitting your patch, you are encouraged to run the same tools that the automated Jenkins patch test system will run on your patch. This enables you to fix problems with your patch before you submit it. The dev-support/test-patch.sh script in the trunk directory will run your patch through the same checks that Hudson currently does except for executing the unit tests.

Run this command from a clean workspace (ie svn stat shows no modifications or additions) as follows:

dev-support/test-patch.sh /path/to/my.patch

At the end, you should get a message on your console that is similar to the comment added to Jira by Jenkins's automated patch test system, listing +1 and -1 results. For non-trunk patches (prior to HADOOP-7435 being implemented), please copy this results summary into the Jira as a comment. Generally you should expect a +1 overall in order to have your patch committed; exceptions will be made for false positives that are unrelated to your patch. The scratch directory (which defaults to the value of ${user.home}/tmp) will contain some output files that will be useful in determining cause if issues were found in the patch.

Some things to note:

Run the same command with no arguments to see the usage options.

Applying a patch

To apply a patch either you generated or found from JIRA, you can issue

patch -p0 < cool_patch.patch

if you just want to check whether the patch applies you can run patch with --dry-run option

patch -p0 --dry-run < cool_patch.patch

If you are an Eclipse user, you can apply a patch by : 1. Right click project name in Package Explorer , 2. Team -> Apply Patch

Changes that span projects

You may find that you need to modify both the common project and MapReduce or HDFS. Or perhaps you have changed something in common, and need to verify that these changes do not break the existing unit tests for HDFS and MapReduce. Hadoop's build system integrates with a local maven repository to support cross-project development. Use this general workflow for your development:

Contributing your work

Finally, patches should be attached to an issue report in Jira via the Attach File link on the issue's Jira. Please add a comment that asks for a code review following our code review checklist. Please note that the attachment should be granted license to ASF for inclusion in ASF works (as per the Apache License ยง5).

When you believe that your patch is ready to be committed, select the Submit Patch link on the issue's Jira. Submitted patches will be automatically tested against "trunk" by Hudson, the project's continuous integration engine. Upon test completion, Hudson will add a success ("+1") message or failure ("-1") to your issue report in Jira. If your issue contains multiple patch versions, Hudson tests the last patch uploaded.

Folks should run mvn clean install javadoc:javadoc checkstyle:checkstyle before selecting Submit Patch. Tests must all pass. Javadoc should report no warnings or errors. Checkstyle's error count should not exceed that listed at Checkstyle Errors Hudson's tests are meant to double-check things, and not be used as a primary patch tester, which would create too much noise on the mailing list and in Jira. Submitting patches that fail Hudson testing is frowned on, (unless the failure is not actually due to the patch).

If your patch involves performance optimizations, they should be validated by benchmarks that demonstrate an improvement.

If your patch creates an incompatibility with the latest major release, then you must set the Incompatible change flag on the issue's Jira 'and' fill in the Release Note field with an explanation of the impact of the incompatibility and the necessary steps users must take.

If your patch implements a major feature or improvement, then you must fill in the Release Note field on the issue's Jira with an explanation of the feature that will be comprehensible by the end user.

Once a "+1" comment is received from the automated patch testing system and a code reviewer has set the Reviewed flag on the issue's Jira, a committer should then evaluate it within a few days and either: commit it; or reject it with an explanation.

Please be patient. Committers are busy people too. If no one responds to your patch after a few days, please make friendly reminders. Please incorporate other's suggestions into your patch if you think they're reasonable. Finally, remember that even a patch that is not committed is useful to the community.

Should your patch receive a "-1" from the Hudson testing, select the Cancel Patch on the issue's Jira, upload a new patch with necessary fixes, and then select the Submit Patch link again.

Committers: for non-trivial changes, it is best to get another committer to review your patches before commit. Use Submit Patch link like other contributors, and then wait for a "+1" from another committer before committing. Please also try to frequently review things in the patch queues:

Jira Guidelines

Please comment on issues in Jira, making their concerns known. Please also vote for issues that are a high priority for you.

Please refrain from editing descriptions and comments if possible, as edits spam the mailing list and clutter Jira's "All" display, which is otherwise very useful. Instead, preview descriptions and comments using the preview button (on the right) before posting them. Keep descriptions brief and save more elaborate proposals for comments, since descriptions are included in Jira's automatically sent messages. If you change your mind, note this in a new comment, rather than editing an older comment. The issue should preserve this history of the discussion.

Stay involved

Contributors should join the Hadoop mailing lists. In particular, the commit list (to see changes as they are made), the dev list (to join discussions of changes) and the user list (to help others).

See Also

HowToContribute (last edited 2014-01-13 09:59:24 by LewisJohnMcgibbney)