Apache Wayang Proposal

Abstract

Wayang is a cross-platform data processing system that aims at decoupling the business logic of data analytics applications from concrete data processing platforms, such as Apache Flink or Apache Spark. Hence, it tames the complexity that arises from the “Cambrian explosion” of novel data processing platforms that we currently witness.

Note that Wayang project is the Rheem project, but we have renamed the project because of trademark issues.

You can find the project web page at: https://rheem-ecosystem.github.io/

Proposal

Wayang is a cross-platform system that provides an abstraction over data processing platforms to free users from the burdens of (i) performing tedious and costly data migration and integration tasks to run their applications, and (ii) choosing the right data processing platforms for their applications. To achieve this, Wayang: (1) provides an abstraction on top of existing data processing platforms that allows users to specify their data analytics tasks in a form of a DAG of operators; (2) comes with a cross-platform optimizer for automating the selection of suitable/efficient platforms; and (3) and finally takes care of executing the optimized plan, including communication across platforms. In summary, Wayang has the following salient features:

  • Flexible Data Model – It considers a flexible and simple data model based on data quanta. A data quantum is an atomic processing unit in the system, that can represent a large spectrum of data formats, such as data points for a machine learning application, tuples for a database application, or RDF triples. Hence, Wayang is able to express a wide range of data analytics tasks.
  • Platform independence – It provides a simple interface (currently Java and Scala) that is inspired by established programming models, such as that of Apache Spark and Apache Flink. Users represent their data analytic tasks as a DAG (Wayang plan), where vertices correspond to Wayang operators and edges represent data flows (data quanta flowing) among these operators. A Wayang operator defines a particular kind of data transformation over an input data quantum, ranging from basic functionality (e.g., transformations, filters, joins) to complex, extensible tasks (e.g., PageRank).
  • Cross-platform execution – Besides running a data analytic task on any data processing platform, it also comes with an optimizer that can decide to execute a single data analytic task using multiple data processing platforms. This allows for exploiting the capabilities of different data processing platforms to perform complex data analytic tasks more efficiently.
  • Self-tuning UDF-based cost model – Its optimizer uses a cost model fully based on UDFs. This not only enables Wayang to learn the cost functions of newly added data processing platforms, but also allows developers to tune the optimizer at will.
  • Extensibility – It treats data processing platforms as plugins to allow users (developers) to easily incorporate new data processing platforms into the system. This is achieved by exposing the functionalities of data processing platforms as operators (execution operators). The same approach is followed at the Wayang interface, where users can also extend Wayang capabilities, i.e., the operators, easily.

We plan to work on the stability of all these features as well as extending Wayang with more advanced features. Furthermore, Wayang currently supports Apache Spark, Standalone Java, GraphChi, relational databases (via JDBC). We plan to incorporate more data processing platforms, such as Apache Flink and Apache Hive.

Background

Many organizations and companies collect or produce large variety of data to apply data analytics over them. This is because insights from data rapidly allow them to make better decisions. Thus, the pursuit for efficient and scalable data analytics as well as the one-size-does-not-fit-all philosophy has given rise to a plethora of data processing platforms. Examples of these specialized processing platforms range from DBMSs to MapReduce-like platforms.

However, today's data analytics are moving beyond the limits of a single data processing platform. More and more applications need to perform complex data analytics over several data processing platforms. For example, IBM reported that North York hospital needs to process 50 diverse datasets, which are on a dozen different internal systems, (ii) oil & gas companies stated they need to process large amounts of data they produce everyday, e.g., a single oil company can produce more than 1.5TB of diverse (structured and unstructured) data per day, (iii) Fortune magazine stated that airlines need to analyze large datasets, which are produced by different departments, are of different data formats, and reside on multiple data sources, to produce global reports for decision makers, and (iv) Hewlett Packard has claimed that, according to its customer portfolio, business intelligence typically require a single analytics pipeline using different processing platforms at different parts of the pipeline. These are just a few examples of emerging applications that require a diversity of data processing platforms.

Today, developers have to deal with this myriad of data processing platforms. That is, they have to choose the right data processing platform for their applications (or data analytic tasks) and to familiarize with the intricacies of the different platforms to achieve high efficiency and scalability. Several systems have also appeared with the goal of helping users to easily glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. Nevertheless, all these systems still require quite good expertise from users to decide which data processing platforms to use for the data analytic task at hand. In consequence, great engineering effort is required to unify the data from various sources, to combine the processing capabilities of different platforms, and to maintain those applications, so as to unleash the full potential of the data. In the worst case, such applications are not built in the first place, as it seems too much of a daunting endeavor.

Rationale

It is evident that there is an urgent need to release developers from the burden of knowing all the intricacies of choosing and glueing together data processing platforms for supporting their applications (data analytic tasks). Developers must focus only on the logics of their applications. Surprisingly, there is no open source system trying to satisfy this urgent need. Wayang aims at filling this gap. It copes with this urgent need by providing both a common interface over data processing platforms and an optimizer to execute data analytic tasks on the right data processing platform(s) seamlessly. As Apache is the place where most of the important big data systems are, we then consider Apache as the right place for Wayang.

Current Status

The current version of Wayang (v0.5.0) was initially co-developed by staff, students, and interns at the Qatar Computing Research Institute (QCRI) and the Hasso-Plattner Institute (HPI). The project was initiated at and sponsored by QCRI in 2015 with the goal of freeing data scientists and developers from the intricacies of data processing platforms to support their analytic tasks. The first open source release of Wayang was made only one year and a half later, in June 13th of 2016, under the Apache Software License 2.0. Since we have made several releases, the latest release was done on January 23th, 2019.

Meritocracy

All current Wayang developers are familiar with this development process at Apache and are currently trying to follow this meritocracy process as much as possible. For example, Wayang already follows a committer principle where any pull request is analyzed by at least one Wayang core developer. This was one of the reasons for choosing Apache for Wayang as we all want to encourage and keep this style of development for Wayang.

Community

Wayang started as a pure research project, but it quickly started developing into a community. People from HPI quickly joined our efforts almost from the very beginning to make this project a reality. Recently, the Berlin Institute of Technology (TU Berlin) and the Pontifical Catholic University of Valparaiso (PUCV) in Chile have also joined our efforts for developing Wayang. A company, called Scalytics, has been created around Wayang. Currently, we are intensively seeking to further develop both developer and user communities. To keep broadening the community, we plan to also exploit our ongoing academic collaborations with multiple universities in Berlin and companies that we collaborate with. For instance, Wayang is already being utilized for accessing multiple data sources in the context of a large data analytics project led by TU Berlin and Huawei. We also believe that Wayang's extensible architecture (i.e., adding new operators and platforms) will further encourage community participation. During incubation we plan to have Wayang adopted by at least one company and will explicitly seek more industrial participation.

Core Developers

The initial developers of the project are diverse, they are from four different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will work aggressively to grow the community during the incubation by recruiting more developers from other institutions.

Alignment

We believe Apache is the most natural home for taking Wayang to the next level. Apache is currently hosting the most important big data systems. Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are just some examples of these technologies. Wayang fills a significant gap – it provides a common abstraction for all these platforms and decides on which platforms to run a single data analytic task – that exist in the big data open source world. Wayang is now being developed following the Apache-style development model. Also, it is well-aligned with the Apache principle of building a community to impact the big data community.

Known Risks

Orphaned Products

Currently, Wayang is the core technology behind Scalytics inc.. As a result, a team of two engineers are working on a full time basis on this project. Recently, three more developers have joined our efforts in building Wayang. Thus, the risk of Wayang becoming orphaned is relatively very low. Still, people outside Scalytics (from TU Berlin and HBKU) have also joined the project, which makes the risk of abandoning the project even lower. The PUCV in Chile is also beginning to contribute to the code base and to develop a declarative query language on top of Wayang. The project is constantly being monitored by email and frequent Skype meetings as well as by weekly meetings with Scalytics people. Additionally, at the end of each year, we meet to discuss the status of the project as well as to plan the most important aspects we should work on during the year after.

Inexperience with Open Source

Wayang quickly started being developed in open source under the Apache Software License 2.0. The source code is available on Github. Also few of the initial committers have contributed to other open source projects: Hadoop and Flume

Homogenous Developers

The initial committers are already geographically distributed among Chile, Germany, and Qatar. During incubation, one of our main goals is to increase the heterogeneity of the current community and we will work hard to achieve it.

Reliance on Salaried Developers

Wayang is already being developed by a mix of full time and volunteer time. Only 2 of the initial committers are working full time on this project (Scalytics). So, we are confident that the project will not decrease its development pace. Furthermore, we are committed to recruit additional committers to significantly increase the development pace of the project.

Relationships with Other Apache Products

Wayang is somehow related to Apache Spark as its developing interface is inspired from Spark. In contrast to Spark, Wayang is not a data processing platform, but a mediator between user applications and data processing platforms. In this sense, Wayang is similar to the Apache Drill project, and Apache Beam. However, Wayang significantly differs from Apache Drill in two main aspects. First, Apache Drill provides only a common interface to query multiple data storages and hence users have to specify in their query the data to fetch. Then, Apache Drill translates the query to the processing platforms where the data is stored, e.g. into mongoDB query representation. In contrast, in Wayang, users only specify the data path and Wayang decides which are the best (performance-wise) data processing platforms to use to process such data. Second, the query interface in Apache Drill is SQL. Wayang uses an interface based on operators forming DAGs. In this latter point, we are currently developing a PIGLatin-like query language for Wayang. In addition, in contrast to Apache Beam, Wayang not only allows users to use multiple data processing platforms at the same time, but also it provides an optimizer to choose the most efficient platform for the task at hand. In Apache Beam, users have to specify an appropriate runner (platform).

Given these similarities with the two Apache projects mentioned above, we are looking forward to collaborating with those communities. Still, we are open and would also love to collaborate with other Apache communities as well.

An Excessive Fascination with the Apache Brand

Wayang solves a real problem that currently users and developers have to deal with at a high cost: monetary cost, high design and development efforts, and very time consuming. Therefore, we believe that Wayang can be successful in building a large community around it. We are convinced that the Apache brand and community process will significantly help us in building such a community and to establish the project in the long-term. We simply believe that ASF is the right home for Wayang to achieve this.

Documentation

Further details, documentation, and publications related to Wayang can be found at https://docs.rheem.io/rheem/

Initial Source

The current source code of Wayang resides in Github:
https://github.com/rheem-ecosystem/rheem

External Dependencies

Wayang depends on the following Apache projects:

  • Maven
  • HDFS
  • Hadoop
  • Spark

Wayang depends on the following other open source projects organized by license:


Dependency

Licences

org.json.json

Json (http://json.org/license.html

SnakeYAML

Apache 2.0

Java Unified Expression Language API (Juel)

Apache 2.0

ProfileDB Instrumentation

Apache 2.0

Gson

Apache 2.0

Hadoop

Apache 2.0

Scala

Apache 2.0

Antlr 4

BSD

Jackson

Apache 2.0

Junit 5

EPL 2.0

Mockito

MIT

Assertj

Apache 2.0

logback-classic

EPL 1.0 LGPL 2.1

slf4j

MIT

GNU Trove

LGPL 2.1

graphchi

Apache 2.0

SQLite JDBC

Apache 2.0

PostgreSQL

BSD 2-clause

jcommander

Apache 2.0

Koloboke Collections API

Apache 2.0

Snappy Java

Apache 2.0

Apache Spark

Apache 2.0

HyperSQL Database

BSD Modified (http://hsqldb.org/web/hsqlLicense.html

Apache Giraph

Apache 2.0

Apache Flink

Apache 2.0

Apache Commons IO

Apache 2.0

Apache Commons Lang

Apache 2.0

Apache Maven

Apache 2.0



  • slf4j

Required Resources

Mailing lists

Subversion Directory

Git is the preferred source control system:

git://git.apache.org/repos/asf/incubator/wayang

Issue Tracking

https://issues.apache.org/jira/browse/Wayang

Initial Committers

The following list gives the planned initial committers (in alphabetical order):

Affiliations

  • Scalytics Inc.
    • Bertty Contreras-Rojas
    • Rodrigo Pardo-Meza
    • Alexander Alten-Lorenz
  • Berlin Institute of Technology (TU Berlin)
    • Zoi Kaoudi
    • Haralampos Gavriilidis
    • Jorge-Arnulfo Quiane-Ruiz
  • Hamad Bin Khalifa University (HBKU)
    • Anis Troudi
  • Pontifical Catholic University of Valparaiso, Chile (PUCV)
    • Wenceslao Palma-Muñoz

Sponsors

Champion

  • (cdutz) Christofer Dutz

Nominated Mentors

  • (cdutz) Christofer Dutz
  • (larsgeorge) Lars George
  • (berndf) Bernd Fondermann
  • (jbonofre) Jean-Baptiste Onofré

Sponsoring Entity

The Apache Incubator

Footnotes

  • No labels