Bigtop - Apache Hadoop Ecosystem Packaging and Test

Abstract

Bigtop - a project for the development of packaging and tests of the Hadoop ecosystem.

Proposal

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.

Build, packaging and integration test code that depends upon official releases of the Apache Hadoop-related projects (HDFS, MapReduce, HBase, Hive, Pig, ZooKeeper, etc...) will be developed and released by this project. As bugs and other issues are found we expect these to be fixed upstream.

Background

The initial packaging and test code for Bigtop was developed by Cloudera to package projects from the Apache Hadoop ecosystem and provide a consistent, inter-operable framework.

Rationale

Hadoop defines itself as:

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these subprojects:

* Hadoop Common: The common utilities that support the other Hadoop subprojects.
* HDFS: A distributed file system that provides high throughput access to application data.
* MapReduce: A software framework for distributed processing of large data sets on compute clusters.

There are also several other Hadoop-related projects at Apache. Some TLP examples include HBase, Hive, Mahout, ZooKeeper, and Pig. There are also several new projects in the Incubator such as HCatalog, Hama and Sqoop.

From a packaging and deployment perspective, the current loosely-coupled nature of the project has limitations:

  1. Insufficient building against trunk versions of dependent projects (in the style of Apache Gump).
  2. Insufficient testing against the trunk versions of dependent projects.
  3. No consistent packaging for the Linux servers which provide the main Hadoop datacenter platform.
  4. No functional testing against multi-machine clusters as part of the regular automated build process. This is due to a lack of a physical or virtual Hadoop cluster for testing, and not enough test suites designed to run against a live cluster with known datasets.

The intent of this project is to build a community where the projects are brought together, packaged, and tested for interoperability.

Projects such as Apache Whirr (incubating), which deploy and use a collection of Hadoop-related projects, would benefit from the interoperability testing done by Bigtop, rather than picking and testing project combinations themselves.

Initial Goals

Much of the code for Bigtop has been released by Cloudera under the Apache 2.0 license for over two years.

Some current goals include:

Bigtop’s release artifact would consist of a single tarball of packaging and test code that, when built, would produce source and binary Linux packages for the upstream projects.

Current Status

Meritocracy

Bigtop was originally developed and released as an open source packaging infrastructure, CDH, by Cloudera.

Community

The community is primarily the original developers at Cloudera, however a number of contributions to the packaging specifications have been accepted from outside contributors. Growing a diverse community is the main reason to bring Bigtop to the Apache Incubator.

Core Developers

The core developers for Bigtop project are:

Many of the committers to the Bigtop project have contributed towards Hadoop or related Apache projects (Alejandro Abdelnur, Konstantin Boudnik, Eli Collins, Alan Gates, Patrick Hunt, Steve Loughran, Owen O'Malley, John Sichi, Michael Stack, Tom White) and are familiar with Apache principals and philosophy for community driven software development.

Alignment

We expect projects in Bigtop to be drawn from Hadoop and related projects at Apache. Bigtop will complement these projects (Hadoop, Pig, Hive, HBase, etc...) by providing an environment for contributors interested in building more complex data processing pipelines to work together integrating more than a single project into a well-tested whole.

Known Risks

Orphaned Products

The contributors are leading vendors of Hadoop-based technologies and have a long standing in the Hadoop community. There is minimal risk of this work becoming non-strategic and the contributors are confident that a larger community will form within the project in a relatively short space of time.

Inexperience with Open Source

All code developed for Bigtop has been open sourced under the Apache 2.0 license. Most committers of Bigtop project are intimately familiar with the Apache model for open-source development and are experienced with working with new contributors.

Homogeneous Developers

The initial set of committers is from a small set of organizations and numerous existing Apache projects. We expect that once approved for incubation, the project will attract new contributors from more organizations and will thus grow organically.

Reliance on Salaried Developers

It is expected that Bigtop will be developed on salaried and volunteer time, although all of the initial developers will work on it mainly on salaried time.

Relationships with Other Apache Products

Bigtop depends upon other Apache Projects including Apache Hadoop, Apache HBase, Apache Hive, Apache Pig, Apache Zookeeper, Apache Thrift, Apache Avro, Apache Whirr. The build system uses Apache Ant and Apache Maven.

An Excessive Fascination with the Apache Brand

We would like Bigtop to become an Apache project to further foster a healthy community of contributors and consumers around interoperability, testing and packaging of Hadoop projects. Since Bigtop directly interacts with many Apache Hadoop-related projects and solves important problems of many Hadoop users, residing in the the Apache Software Foundation will increase interaction with the larger community.

Documentation

Initial Source

Source and Intellectual Property Submission Plan

https://github.com/cloudera/bigtop

External Dependencies

The required external dependencies are all Apache License or compatible licenses.

Cryptography

Bigtop doesn't use cryptography itself, however Hadoop projects use standard APIs and tools for SSH and SSL communication where necessary.

Required Resources

Mailing lists

Subversion Directory

https://svn.apache.org/repos/asf/incubator/bigtop

Issue Tracking

JIRA BIGTOP (Bigtop)

Other Resources

The existing code already has unit and integration tests so we would like a Jenkins instance to run them whenever a new patch is submitted. This can be added after project creation.

To test RPM & deb install/uninstall and upgrade, it is useful to have a set of Virtual Machine images in known states, and servers that can bring them up. It should be possible to use Apache Whirr to choreograph the VM setup/teardown, so these tests could be performed against VMs on developer desktops or large scale VM-hosting platforms. For the latter, VM hosting time would be appreciated.

Initial Committers

Affiliations

Sponsors

Champion

Nominated Mentors

Sponsoring Entity