Apache Rheem Proposal
Rheem is a cross-platform data analytics tool. Its goal is to decouple the business logic of data analytics applications from concrete data processing platforms, such as Apache Hadoop or Apache Spark. Hence, it tames the complexity that arises from the “Cambrian explosion” of novel data analytics tools that we currently witness.
Rheem is a cross-platform system that provides an abstraction over data processing systems to free users from the burdens of: performing tedious and costly data migration and integration tasks to run their applications; and choosing the right data processing platforms for their applications. To achieve this, Rheem: (1) provides an abstraction on top of existing data processing platforms that allows users to specify their data analytics tasks in a form of a DAG of operators; (2) comes with a cross-platform optimizer for automating the selection of suitable/efficient platforms; and (3) and finally takes care of executing the optimized plan, including communication across platforms. In summary, Rheem has the following salient features:
- Flexible Data Model -- It considers a flexible and simple data model based on data quanta. A data quantum is an atomic processing unit in the system, that can represent a large spectrum of data formats, such as data points for a machine learning application, tuples for a database application, or RDF triples. Hence, Rheem is able to express a wide range of data analytics tasks.
Platform independence -- It provides a simple interface (currently Java and Scala) that is inspired by established programming models, such as that of Apache Spark and Apache Flink. Users represent their data analytic tasks as a DAG (Rheem plan), where vertices correspond to Rheem operators and edges represent data flows (data quanta flowing) among these operators. A Rheem operator defines a particular kind of data transformation over an input data quantum, ranging from basic functionality (e.g., transformations, filters, joins) to complex, extensible tasks (e.g., PageRank).
- Cross-platform execution -- Besides running a data analytic task on any data processing platform, it also comes with an optimizer that can decide to execute a single data analytic task using multiple data processing platforms. This allows for exploiting the capabilities of different data processing platforms to perform complex data analytic tasks more efficiently.
- Self-tuning UDF-based cost model -- Its optimizer uses a cost model fully based on UDFs. This not only enables Rheem to learn the cost functions of newly added data processing platforms, but also allows developers to tune the optimizer at will.
- Extensibility -- It treats data processing platforms as plugins to allow users (developers) to easily incorporate new data processing platform into the system. This is achieved by exposing the functionalities of data processing platforms as operators (execution operators). The same approach is followed at the Rheem interface, where users can also extend Rheem capabilities, i.e., the operators, easily.
We plan to work on the stability of all these features as well as extending Rheem with more advanced features. Furthermore, Rheem currently supports Apache Spark, Standalone Java, GraphChi, and Relational databases via JDBC. We plan to incorporate more data processing platforms, such as Apache Flink and Apache Hadoop.
Many organizations and companies collect or produce large variety of data to apply data analytics over them. This is because insights from data rapidly allow them to make better decisions. Thus, the pursuit for efficient and scalable data analytics as well as the one-size-does-not-fit-all philosophy has given rise to a plethora of data processing platforms. Examples of these specialized processing platforms range from DBMSs to MapReduce-like platforms.
However, today's data analytics are moving beyond the limits of a single data processing platform. More and more applications need to perform complex data analytics over several data processing platforms. For example, (i) IBM reported that North York hospital needs to process 50 diverse datasets, which are on a dozen different internal systems, (ii) oil & gas companies stated they need to process large amounts of data they produce everyday, e.g., a single oil company can produce more than 1.5TB of diverse (structured and unstructured) data per day, (iii) Fortune magazine stated that airlines need to analyze large datasets, which are produced by different departments, are of different data formats, and reside on multiple data sources, to produce global reports for decision makers, and (iv) Hewlett Packard has claimed that, according to its customer portfolio, business intelligence typically require a single analytics pipeline using different processing platforms at different parts of the pipeline. These are just few examples of emerging applications that require a diversity of data processing platforms.
Today, developers have to deal with this myriad of data processing platforms. That is, they have to choose the right data processing platform for their applications (or data analytic tasks) and to familiarize with the intricacies of the different platforms to achieve high efficiency and scalability. Several systems have also appeared with the goal of helping users to easily glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. Nevertheless, all these systems still require quite good expertise from users to decide which data processing platforms to use for the data analytic task at hand. In consequence, great engineering effort is required to unify the data from various sources, to combine the processing capabilities of different platforms, and to maintain those applications, so as to unleash the full potential of the data. In the worst case, such applications are not built in the first place, as it seems too much of a daunting endeavor.
It is evident that there is an urgent need to release developers from the burden of knowing all the intricacies of choosings and glueing together data processing platforms for supporting their applications (data analytic tasks). Developers must focus only on the logics of their applications.
It is evident that there is an urgent need to release developers from the burden of knowing all the intricacies of choosing and glueing together data processing platforms for supporting their applications (data analytic tasks). Developers must focus only on the logics of their applications. Surprisingly, there is no open source system trying to satisfy this urgent need. Rheem aims at filling this gap. It copes with this urgent need by providing both a common interface over data processing platforms and an optimizer to execute a data analytic tasks on the right data processing platform(s) seamlessly. As Apache is the place where most of the important big data systems are, we then consider Apache as the right place for Rheem.
The current version of Rheem (v0.2.1) was co-developed by staff, students, and interns at the Qatar Computing Research Institute (QCRI) and the Hasso-Plattner Institute (HPI). The projects was initiated at and sponsored by QCRI in 2015 with the goal of freeing data scientists and developers from the intricacies of data processing platforms to support their analytic tasks. The first open source release of Rheem was made only one year and a half later, in June 13th of 2016, under the Apache Software License 2.0. Since then we have been very active and have made two more releases of Rheem, the latest release was done on November 30th of 2016. We plan to have our next release by June 2017.
All current Rheem developers are familiar with this development process at Apache and are currently trying to follow this meritocracy process as much as possible. For example, Rheem already follows a committer principle where any pull request is analyzed by at least one Rheem core developer. This was one of the reasons of choosing Apache for Rheem as we all want to encourage and keep this style of development for Rheem.
Rheem started as a pure research project at QCRI, but it quickly started developing into a community. People from HPI quickly joined our efforts almost from the very beginning to make this project a reality. Recently, the Pontifical Catholic University of Valparaiso (PUCV) in Chile has also joined our efforts for developing Rheem. Currently, we are intensively seeking to further develop both developer and user communities. To keep broaden the community, we plan to also exploit our ongoing academic collaborations with the MIT. For instance, Rheem is already being utilized for accessing multiple data sources in the context of a large data analytics project led by MIT and QCRI. We also believe that Rheem's extensible architecture (i.e., adding new operators and platforms) will further encourage community participation. Furthermore, we are currently collaborating with a large airline company in the middle-east in the context of Rheem. During incubation we plan to have Rheem adopted by this airline company and will explicitly seek more industrial participation.
The initial developers of the project are diverse, they are from three different institutions (QCRI, HPI, and PUCV). Although 50% of the developers are from QCRI, due to the origins of the project, we will work aggressively to change this during the incubation by recruiting more developers from other institutions.
We believe Apache is the most natural home for taking Rheem to the next level. Apache is currently hosting the most important big data systems. Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are just some examples of these technologies. Rheem fills a significant gap -- it provides a common abstraction for all these platforms and decides on which platforms to run a single data analytic task -- that exist in the big data open source world. Rheem is now being developed following the Apache-style development model. Also, it is well-aligned with the Apache principle of building a community to impact the big data community.
Currently, Rheem is one of the important projects at QCRI at long term. As a result, a team of research scientists and engineers are working on full time basis on this project. Thus, the risk of Rheem orphaned is relatively very low. Still, people outside QCRI (from HPI) has also joined the project, which makes the risk of abandoning the project even lower. The PUCV in Chile is also beginning to contribute to the code base and to develop a declarative query language on top of Rheem. The project is constantly being monitored by email and frequent Skype meeting as well as by weekly meetings physically at QCRI for QCRI people. Additionally, at the end of each year, we meet to discuss the status of the project as well as to plan the most important aspects we should work on during the year after.
Inexperience with Open Source
Rheem quickly started being developed in open source between QCRI and HPI under the Apache Software License 2.0. The source code is available on Github. Also few of the initial committers have contributed to other open source projects:TODO
The initial committers are already geographically distributed among Qatar (QCRI), Germany (HPI), and Chile (PUCV). During incubation, one of our main goals is to increase the heterogeneity of the current community and we will work hard to achieve it.
Reliance on Salaried Developers
Rheem is already being developed by a mix of full time and volunteer time. Only 3 of the initial committers are working full time on this project. Although these three committers are paid to contribute to this project, they are all passionate about the project. The rest of initial committers are faculty, staff, and students working on the project on volunteer basis. So, we are confident that the project will not decrease its development pace. Furthermore, we are committed to recruit additional committers to significantly increase the development pace of the project.
Relationships with Other Apache Products
Rheem is somehow related to Apache Spark as its developing interface is inspired from Spark. In contrast to Spark, Rheem is not a data processing platform, but a mediator between user applications and data processing platforms. In this sense, Rheem is similar to the Apache Drill project, and Apache Beam. However, Rheem significantly differs from Apache Drill in two main aspects. First, Apache Drill provides only a common interface to query multiple data storages and hence users have to specify in their query the data to fetch. Then, Apache Drill translates the query to the processing platforms where the data is stored, e.g. into mongoDB query representation. In contrast, in Rheem, users only specify the data path and Rheem decides which is the best (performance-wise) data processing platforms to use to process such data. Second, the query interface in Apache Drill is SQL. Rheem uses an interface based on operators forming DAGs. In this latter point, we are currently developing a PIGLatin-like query language for Rheem. In addition, in contrast to Apache Beam, Rheem not only allows users to use multiple data processing platforms at the same time, but also it provides an optimizer to choose the most efficient platform for the task at hand. In Apache Beam, users have to specify an appropriate runner (platform). Apache Beam simply aims at abstracting the pipeline of the data processing.
Given these similarities with the two Apache projects mentioned above, we are looking forward to collaborating with those communities. Still, we are open and would also love to collaborate with other Apache communities as well.
An Excessive Fascination with the Apache Brand
Rheem solves a real problem that currently users and developers have to deal with at a high cost: monetary cost, high design and development efforts, and very time consuming. Therefore, we believe that Rheem can be successful in building a large community around it. We are convinced that the Apache brand and community process will significantly help us in building such a community and to establish the project in the long-term. We simply believe that ASF is the right home for Rheem to achieve this.
Further details, documentation, and publications related to Rheem can be found at http://da.qcri.org/rheem/
The current source code of Rheem resides in Github: https://github.com/rheem-ecosystem/rheem
Rheem depends on the following Apache projects:
- Hadoop Common
Rheem depends on the following other open source projects organized by license:
- Snake Yaml
Git is the preferred source control system: git://git.apache.org/repos/asf/incubator/rheem
The following list gives the planned initial committers (in alphabetical order):
Bertty Contreras-Rojas <email@example.com>
Yasser Idris <firstname.lastname@example.org>
Zoi Kaoudi <email@example.com>
Sebastian Kruse <Sebastian.Kruse@hpi.de>
Essam Mansour <firstname.lastname@example.org>
Wenceslao Palma-Muñoz <email@example.com>
Jorge-Arnulfo Quiane-Ruiz <firstname.lastname@example.org>
Anis Troudi <email@example.com>
- Qatar Computing Research Institute, HBKU, Qatar (QCRI)
- Yasser Idris
- Zoi Kaoudi
- Essam Mansour
- Jorge-Arnulfo Quiane-Ruiz
- Anis Troudi
- Hasso-Plattner Institute, Germany (HPI)
- Sebastian Kruse
- Pontifical Catholic University of Valparaiso, Chile (PUCV)
- Bertty Contreras-Rojas
- Wenceslao Palma-Muñoz
- Venkatesh Seetharam (venkatesh)
- Venkatesh Seetharam (venkatesh)
The Apache Incubator
- 1 - Papers detailing Rheem are available at