Git And Hadoop

A lot of people use Git with Hadoop because they have their own patches to make to Hadoop, and Git helps them manage it.

This page tells you how to work with Git. See HowToContribute for instructions on building and testing Hadoop.

Key Git Concepts

The key concepts of Git.

Checking out the source

You need a copy of git on your system. Some IDEs ship with Git support; this page assumes you are using the command line.

Clone a local Git repository from the Apache repository. The Hadoop subprojects (common, HDFS, and MapReduce) live inside a combined repository called hadoop-common.git.

git clone git://git.apache.org/hadoop-common.git

The total download is well over 100MB, so the initial checkout process works best when the network is fast. Once downloaded, Git works offline -though you will need to perform your initial builds online so that the build tools (Maven, Ivy &c) can download dependencies.

Grafts for complete project history

The Hadoop project has undergone some movement in where its component parts have been versioned. Because of that, commands like git log --follow needs to have a little help. To graft the history back together into a coherent whole, insert the following contents into hadoop-common/.git/info/grafts:

5128a9a453d64bfe1ed978cf9ffed27985eeef36 6c16dc8cf2b28818c852e95302920a278d07ad0c
6a3ac690e493c7da45bbf2ae2054768c427fd0e1 6c16dc8cf2b28818c852e95302920a278d07ad0c
546d96754ffee3142bcbbf4563c624c053d0ed0d 6c16dc8cf2b28818c852e95302920a278d07ad0c

You can then use commands like git blame --follow with success.

Forking onto GitHub

You can create your own fork of the ASF project, put in branches and stuff as you desire. GitHub prefer you to explicitly fork their copies of Hadoop.

  1. Create a GitHub login at http://github.com/ ; Add your public SSH keys

  2. Go to http://github.com/apache and search for the Hadoop and other Apache projects you want (avro is handy alongside the others)

  3. For each project, fork in the githb UI. This gives you your own repository URL which you can then clone locally with git clone

  4. For each patch, branch.

At the time of writing (December 2009), GitHub was updating its copy of the Apache repositories every hour. As the Apache repositories were updating every 15 minutes, provided these frequencies are retained, a GitHub-fork derived version will be at worst 1 hour and 15 minutes behind the ASF's SVN repository. If you are actively developing on Hadoop, especially committing code into the SVN repository, that is too long -work off the Apache repositories instead.

  1. Clone the read-only repository from Github (their recommendation) or from Apache (the ASF's recommendation)
  2. in that clone, rename that repository "apache": git remote rename origin apache

  3. Log in to [http://github.com]

  4. Create a new repository (e.g hadoop-fork)
  5. In the existing clone, add the new repository :

    git remote add -f github git@github.com:MYUSERNAMEHERE/hadoop-common.git

This gives you a local repository with two remote repositories: "apache" and "github". Apache has the trunk branch, which you can update whenever you want to get the latest ASF version:

 git co trunk
 git pull apache

Your own branches can be merged with trunk, and pushed out to git hub. To generate patches for submitting as JIRA patches, check everything in to your specific branch, merge that with (a recently pulled) trunk, then diff the two:  git diff --no-prefix trunk > ../hadoop-patches/HADOOP-XYX.patch 

If you are working deep in the code it's not only convenient to have a directory full of patches to the JIRA issues, it's convenient to have that directory a git repository that is pushed to a remote server, such as this example. Why? It helps you move patches from machine to machine without having to do all the updating and merging. From a pure-git perspective this is wrong: it loses history, but for a mixed git/svn workflow it doesn't matter so much.

Branching

Git makes it easy to branch. The recommended process for working with Apache projects is: one branch per JIRA issue. That makes it easy to isolate development and track the development of each change. It does mean if you have your own branch that you release, one that merges in more than one issue, you have to invest some effort in merging everything in. Try not to make changes in different branches that are hard to merge, and learn your way round the git rebase command to handle changes across branches. Better yet: do not use rebase once you have created a chain of branches that each depend on each other

Creating the branch

Creating a branch is quick and easy

#start off in the apache trunk
git checkout trunk
#create a new branch from trunk
git branch HDFS-775
#switch to it
git checkout HDFS-775
#show what's branch you are in
git branch

Remember, this branch is local to your machine. Nobody else can see it until you push up your changes or generate a patch, or you make your machine visible over the network to interested parties.

Creating Patches for attachment to JIRA issues

Assuming your trunk repository is in sync with the Apache projects, you can use git diff to create a patch file. First, have a directory for your patches:

mkdir ../hadoop-patches

Then generate a patch file listing the differences between your trunk and your branch

git diff --no-prefix trunk > ../hadoop-patches/HDFS-775-1.patch

The patch file is an extended version of the unified patch format used by other tools; type git help diff to get more details on it. Here is what the patch file in this example looks like

cat ../outgoing/HDFS-775-1.patch
diff --git src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
index 42ba15e..6383239 100644
--- src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
+++ src/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
@@ -355,12 +355,14 @@ public class FSDataset implements FSConstants, FSDatasetInterface {
       return dfsUsage.getUsed();
     }

+    /**
+     * Calculate the capacity of the filesystem, after removing any
+     * reserved capacity.
+     * @return the unreserved number of bytes left in this filesystem. May be zero.
+     */
     long getCapacity() throws IOException {
-      if (reserved > usage.getCapacity()) {
-        return 0;
-      }
-
-      return usage.getCapacity()-reserved;
+      long remaining = usage.getCapacity() - reserved;
+      return remaining > 0 ? remaining : 0;
     }

     long getAvailable() throws IOException {

It is essential that patches for JIRA issues are generated with the --no-prefix option. Without that an extra directory path is listed, and the patches can only be applied with a patch -p1 call, which Hudson does not know to do. If you want your patches to take, this is what you have to do. You can of course test this yourself by using a command like patch -p0 << ../outgoing/HDFS-775.1 in a copy of the SVN source tree to test that your patch takes.

Updating your patch

If your patch is not immediately accepted, do not be offended: it happens to us all. It introduces a problem: your branches become out of date. You need to check out the latest apache version, merge your branches with it, and then push the changes back to github

 git co trunk
 git pull apache
 git co mybranch
 git merge trunk
 git push github mybranch

Your branch is up to date, and new diffs can be created and attached to patches.

Deriving Branches from Branches

If you have one patch that depends upon another, you should have a separate branch for each one. Simply merge the changes from the first branch into the second, so that it is always kept up to date with the first changes. To create a patch file for submission as a JIRA patch, do a diff between the two branches, not against trunk.

do not play with rebasing once you start doing this as you will make merging a nightmare

What to do when your patch is committed

Once your patch is committed into SVN, you do not need the branch any more. You can delete it straight away, but it is safer to verify the patch is completely merged in

Pull down the latest release and verify that the patch branch is synchronized

 git co trunk
 git pull apache
 git co mybranch
 git merge trunk
 git diff trunk

the output of the last command should be nothing: the two branches should be identical. You can then prove to git that this is true by switching back to the trunk branch and merging in the branch, an operation which will not change the source tree, but update Git's branch graph.

 git co trunk
 git merge mybranch

Now you can delete the branch without being warned by git

 git branch -d mybranch

Finally, propagate that deletion to your private github repository

 git push github :mybranch

This odd syntax says "push nothing to github/mybranch".

Building with a Git repository

The information below this line is relevant for versions of Hadoop before 0.23.x, and should be considered obsolute for later versions. It is probably out of date for Hadoop 0.22 as well.

Building the source

You need to tell all the Hadoop modules to get a local JAR of the bits of Hadoop they depend on. You do this by making sure your Hadoop version does not match anything public, and to use the "internal" repository of locally published artifacts.

Create a build.properties file

Create a build.properties file. Do not do this in the git directories, do it one up. This is going to be a shared file. This article assumes you are using Linux or a different Unix, incidentally.

Make the file something like this:

#this is essential
resolvers=internal
#you can increment this number as you see fit
version=0.22.0-alpha-1
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

The resolvers property tells Ivy to look in the local maven artifact repository for versions of the Hadoop artifacts; if you don't set this then only published JARs from the central repostiory will get picked up.

The version property, and descendents, tells Hadoop which version of artifacts to create and use. Set this to something different (ideally ahead of) what is being published, to ensure that your own artifacts are picked up.

Next, symlink this file to every Hadoop module. Now a change in the file gets picked up by all three.

pushd common; ln -s ../build.properties build.properties; popd
pushd hdfs; ln -s ../build.properties build.properties; popd
pushd mapreduce; ln -s ../build.properties build.properties; popd

You are now all set up to build.

Build Hadoop

  1. In common/ run ant mvn-install

  2. In hdfs/ run ant mvn-install

  3. In mapreduce/ run ant mvn-install

This Ant target not only builds the JAR files, it copies it to the local ${user.home}/.m2 directory, where it will be picked up by the "internal" resolver. You can check that this is taking place by running ant ivy-report on a project and seeing where it gets its dependencies.

Warning: it's easy for old JAR versions to get cached and picked up. You will notice this early if something in hadoop-hdfs or hadoop-mapreduce doesn't compile, but if you are unlucky things do compile, just not work as your updates are not picked up. Run ant clean-cache to fix this.

By default, the trunk of the HDFS and mapreduce projects are set to grab the snapshot versions that get built and published into the Apache snapshot repository nightly. While this saves developers in these projects the complexity of having to build and publish the upstream artifacts themselves, it doesn't work if you do want to make changes to things like hadoop-common. You need to make sure the local projects are picking up what's being built locally.

To check this in the hadoop-hdfs project, generate the Ivy dependency reports using the internal resolver:

ant ivy-report -Dresolvers=internal

Then browse to the report page listed at the bottom of the process, switch to the "common" tab, and look for hadoop-common JAR. It should have a publication timestamp which contains the date and time of your local build. For example, the string " 20110211174419"> means the date 2011-02-11 and the time of 17:44:19. If an older version is listed, you probably have it cached in the ivy cache -you can fix this by removing everything from the org.apache corner of this cache.

rm -rf ~/.ivy2/cache/org.apache.hadoop

Rerun the ivy-report target and check that the publication date is current to verify that the version is now up to date.

Testing

Each project comes with lots of tests; run ant test to run the all, ant test-core for the core tests. If you have made changes to the build and tests fail, it may be that the tests never worked on your machine. Build and test the unmodified source first. Then keep an eye on both the main source and any branch you make. A good way to do this is to give a Continuous Integration server such as Hudson this job: checking out, building and testing both branches.

Remember, the way Git works, your machine's own repository is something that other machines can fetch from. So in theory, you could set up a Hudson server on another machine (or VM) and have it pull and test against your local code. You will need to run it on a separate machine to avoid your own builds and tests from interfering with the Hudson runs.

GitAndHadoop (last edited 2014-01-13 10:08:53 by LewisJohnMcgibbney)