Tez

Abstract

Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects.

Proposal

Tez is a proposal to develop a generic application which can be used to process complex data-processing task DAGs and runs natively on Apache Hadoop YARN. YARN is a generic resource-management system on which currently applications like MapReduce already exist. MapReduce is a specific, and constrained, DAG - which is not optimal for several frameworks like Apache Hive and Apache Pig. Furthermore, we propose to develop a re-usable set of libraries of data-processing primitives such as sorting, merging, data-shuffling, intermediate data management etc. which are necessary for Tez which we envision can be used directly by other projects.

Background

Apache Hadoop MapReduce has emerged as the assembly-language on which other frameworks like Apache Pig and Apache Hive have been built. However, it has been well accepted that MapReduce produces very constrained task DAGs for each job which results in Apache Pig and Apache Hive requiring multiple MapReduce jobs for several queries. By providing a more expressive DAG of tasks for a job, Tez attempts to provide significantly enhanced data-processing capabilities for projects like Apache Pig, Apache Hive, Cascading etc.

Rationale

There is an important gap that Tez fulfills in the Apache Hadoop ecosystem of allowing for more expressive task DAGs for data-processing applications such as Apache Pig, Apache Hive, Cascading etc.

With emergence of Apache Hadoop YARN, there is a strong need for a common DAG application which can then be shared by Apache Pig, Apache Hive, Cascading etc.

Initial Goals

The initial goals for this project are to specify the detailed requirements and architecture, and then develop the initial implementation including the DAG ApplicationMaster to run natively inside Apache Hadoop YARN.

Current Status

Significant work has been completed to identify the initial requirements and define the overall system architecture. There is a patch available in the internal Hortonworks git repository which can act as the initial seed.

Meritocracy

We plan to invest in supporting a meritocracy. We will discuss the requirements in an open forum. Several companies have already expressed interest in this project, and we intend to invite additional developers to participate. We will encourage and monitor community participation so that privileges can be extended to those that contribute.

Community

The need for a generic DAG application for data processing in the open source is tremendous, so there is a potential for a very large community. We believe that Tez's extensible architecture will further encourage community participation. Also, related Apache projects (eg, Pig, Hive) have very large and active communities, and we expect that over time Tez will also attract a large community.

Core Developers

The developers on the initial committers list include people very experienced in the Apache Hadoop ecosystem:

We realize that though we have significant employer diversity already, additional diversity is always better, and we will work aggressively to recruit developers from additional companies.

Alignment

The initial committers strongly believe that a standard task DAG application on Apache Hadoop YARN will gain broader adoption as an open source, community driven project, where the community can contribute not only to the core components, but also to a growing collection of applications which will be based on top of Tez. Our hope is that the Apache Hive, Apache Pig, Cascading and other communities will find tremendous value in Tez and will adopt it en masse.

Known Risks

Orphaned Products

The contributors are leading users and vendors in the Apache Hadoop ecosystem, with significant open source experience, so the risk of being orphaned is relatively low. The project could be at risk if vendors decided to change their strategies in the market. In such an event, the current committers plan to continue working on the project on their own time, though the progress will likely be slower. We plan to mitigate this risk by recruiting additional committers.

Inexperience with Open Source

The initial committers include veteran Apache members (Committers, PMC members and Apache Members) and other developers who have varying degrees of experience with open source projects. All have been involved with source code that has been released under an open source license, and several also have experience developing code with an open source development process.

Homogenous Developers

The initial committers are employed by a number of companies, including Cloudera, Facebook, Hortonworks, Microsoft, Twitter and Yahoo. We are committed to recruiting additional committers from other companies based on their contributions to the project even though we do have significant diversity already.

Reliance on Salaried Developers

It is expected that Tez development will occur on both salaried time and on volunteer time, after hours. The majority of initial committers are paid by their employer to contribute to this project. However, they are all passionate about the project, and we are confident that the project will continue even if no salaried developers contribute to the project. We are committed to recruiting additional committers including non-salaried developers.

Relationships with Other Apache Products

As mentioned in the Alignment section, Tez is closely integrated with Hadoop, Hive and Pig in a numerous ways. We look forward to collaborating with those communities, as well as other Apache communities.

An Excessive Fascination with the Apache Brand

Tez solves a real need for generic task DAG management in the Apache Hadoop ecosystem, something which has been addressed in a very ad hoc manner so far by multiple Apache projects. Our rationale for developing Tez as an Apache project is detailed in the Rationale section. We believe that the Apache brand and community process will help us attract more contributors to this project, and help establish ubiquitous APIs.

Documentation

http://wiki.apache.org/incubator/TezProposal

Initial Source

Available as a patch.

Cryptography

Tez will eventually support encryption on the wire. This is not one of the initial goals, and we do not expect Tez to be a controlled export item due to the use of encryption.

Required Resources

Mailing List

Subversion Directory

Git is the preferred source control system: git://git.apache.org/tez

Issue Tracking

JIRA Tez (TEZ)

Initial Committers

Affiliations

The initial committers are employees of Cloudera, Facebook, Hortonworks, Microsoft, Twitter and Yahoo Inc.

The nominated mentors are employees of Hortonworks, LinkedIn, NASA JPL and Microsoft.

Sponsors

Champion

Arun C Murthy <acmurthy at apache dot org>

Nominated Mentors

Sponsoring Entity

Incubator

TezProposal (last edited 2013-02-20 04:18:29 by ArunMurthy)