Drill

Abstract

Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel.

Proposal

Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Background

Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called "Dremel: Interactive Analysis of Web-Scale Datasets," describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel.

Rationale

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need.

It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data.

The Drill architecture consists of four key components/layers:

It is worth noting that no open source project has successfully replicated the capabilities of Dremel, nor have any taken on the broader goals of flexibility (eg, pluggable query languages, data formats, data sources and execution engine operators/connectors) that are part of Drill.

Initial Goals

The initial goals for this project are to specify the detailed requirements and architecture, and then develop the initial implementation including the execution engine and DrQL. Like Apache Hadoop, which was built to support multiple storage systems (through the FileSystem API) and file formats (through the InputFormat/OutputFormat APIs), Drill will be built to support multiple query languages, data formats and data sources. The initial implementation of Drill will support the DrQL and a column-based format similar to Dremel.

Current Status

Significant work has been completed to identify the initial requirements and define the overall system architecture. The next step is to implement the four components described in the Rationale section, and we intend to do that development as an Apache project.

Meritocracy

We plan to invest in supporting a meritocracy. We will discuss the requirements in an open forum. Several companies have already expressed interest in this project, and we intend to invite additional developers to participate. We will encourage and monitor community participation so that privileges can be extended to those that contribute. Also, Drill has an extensible/pluggable architecture that encourages developers to contribute various extensions, such as query languages, data formats, data sources and execution engine operators and connectors. While some companies will surely develop commercial extensions, we also anticipate that some companies and individuals will want to contribute such extensions back to the project, and we look forward to fostering a rich ecosystem of extensions.

Community

The need for a system for interactive analysis of large datasets in the open source is tremendous, so there is a potential for a very large community. We believe that Drill's extensible architecture will further encourage community participation. Also, related Apache projects (eg, Hadoop) have very large and active communities, and we expect that over time Drill will also attract a large community.

Core Developers

The developers on the initial committers list include experienced distributed systems engineers:

We realize that additional employer diversity is needed, and we will work aggressively to recruit developers from additional companies.

Alignment

The initial committers strongly believe that a system for interactive analysis of large-scale datasets will gain broader adoption as an open source, community driven project, where the community can contribute not only to the core components, but also to a growing collection of query languages and optimizers, data formats, data formats, and execution engine operators and connectors. Drill will integrate closely with Apache Hadoop. First, the data will live in Hadoop. That is, Drill will support Hadoop FileSystem implementations and HBase. Second, Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools will be provided to produce column-based formats. Fourth, Drill tables can be registered in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.

Known Risks

Orphaned Products

The contributors are leading vendors in this space, with significant open source experience, so the risk of being orphaned is relatively low. The project could be at risk if vendors decided to change their strategies in the market. In such an event, the current committers plan to continue working on the project on their own time, though the progress will likely be slower. We plan to mitigate this risk by recruiting additional committers.

Inexperience with Open Source

The initial committers include veteran Apache members (committers and PMC members) and other developers who have varying degrees of experience with open source projects. All have been involved with source code that has been released under an open source license, and several also have experience developing code with an open source development process.

Homogenous Developers

The initial committers are employed by a number of companies, including MapR Technologies, Concurrent and Drawn to Scale. We are committed to recruiting additional committers from other companies.

Reliance on Salaried Developers

It is expected that Drill development will occur on both salaried time and on volunteer time, after hours. The majority of initial committers are paid by their employer to contribute to this project. However, they are all passionate about the project, and we are confident that the project will continue even if no salaried developers contribute to the project. We are committed to recruiting additional committers including non-salaried developers.

Relationships with Other Apache Products

As mentioned in the Alignment section, Drill is closely integrated with Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data lives inside a Hadoop environment (Drill operates on in situ data). We look forward to collaborating with those communities, as well as other Apache communities.

An Excessive Fascination with the Apache Brand

Drill solves a real problem that many organizations struggle with, and has been proven within Google to be of significant value. The architecture is based on academic and industry research. Our rationale for developing Drill as an Apache project is detailed in the Rationale section. We believe that the Apache brand and community process will help us attract more contributors to this project, and help establish ubiquitous APIs. In addition, establishing consensus among users and developers of a Dremel-like tool is a key requirement for success of the project.

Documentation

Drill is inspired by Google's Dremel. Google has published a paper highlighting Dremel's innovative nested column-based data format and execution engine.

Initial Source

The requirement and design documents are currently stored in MapR Technologies' source code repository. They will be checked in as part of the initial code dump. Check out the attached slides.

Cryptography

Drill will eventually support encryption on the wire. This is not one of the initial goals, and we do not expect Drill to be a controlled export item due to the use of encryption.

Required Resources

Mailing List

Subversion Directory

Git is the preferred source control system: git://git.apache.org/drill

Issue Tracking

JIRA Drill (DRILL)

Initial Committers

Affiliations

The initial committers are employees of MapR Technologies, Drawn to Scale and Concurrent. The nominated mentors are employees of MapR Technologies, Lucid Imagination and Nokia.

Sponsors

Champion

Ted Dunning (tdunning at apache dot org)

Nominated Mentors

Sponsoring Entity

Incubator

DrillProposal (last edited 2012-08-09 04:26:11 by tshiran)