Abstract

Kyuubi is a distributed multi-tenant Thrift JDBC/ODBC server for large-scale data management, processing, and analytics, built on top of Apache Spark and designed to support more engines (i.e., Apache Flink). It has been open-sourced by NetEase since 2018. We are aiming to make Kyuubi an "out-of-the-box" tool for data warehouses and data lakes.

Proposal

Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extendible Spark SQL engines. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client-side. At the server-side, Kyuubi server and engines' multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc.

Background

In typical big data production environments, especially secured ones, all bundled services manage access control lists to restricting access to authorized users. For example, Hadoop YARN divides compute resources into queues. With Queue ACLs, it can identify and control which users/groups can take actions on particular queues. Similarly, HDFS ACLs control access of HDFS files by providing a way to set different permissions for specific users/groups.

Apache Spark is a unified analytics engine for large-scale data processing. It provides a Distributed SQL Engine, a.k.a, the Spark Thrift Server(STS), designed to be seamlessly compatible with HiveServer2 and get even better performance.

HiveServer2 can identify and authenticate a caller, and then if the caller also has permissions for the YARN queue and HDFS files, it succeeds. Otherwise, it fails. However, on the one hand, STS is a single Spark application, and the user and the queue to which STS belongs are uniquely determined at startup. Consequently, STS cannot leverage cluster managers such as YARN and Kubernetes for resource isolation and sharing or control the access for callers by the single user inside the whole system. On the other hand, the Thrift Server is coupled in the Spark driver's JVM process. This coupled architect puts a high risk on server stability and makes it unable to handle high client concurrency or apply high availability such as load balancing as it is stateful.

Kyuubi extends the use of STS in a multi-tenant model based on a unified interface and relies on the concept of multi-tenancy to interact with cluster managers to finally gain the ability of resource sharing/isolation and data security. The loosely coupled architecture of the Kyuubi server and engine dramatically improves the client concurrency and service stability of the service itself.

You can find more information on Kyuubi at the existing open-source website: https://kyuubi.readthedocs.io/.

Rationale

For pure SQL users migrating from HiveServer2 to Spark SQL for better performance, there is a strong need for multi-tenancy support to realize the purpose of resource isolation and data security and the client concurrency, service stability, and high availability as well.

To achieve these goals, Kyuubi introduces the following three foremost aspects:

  • Kyuubi Server
    • The server is a daemon process that handles concurrent connections and query requests and converting these requests into various operations against the Kyuubi engine to complete the responses to clients.
  • Kyuubi Engine
    • A CONNECTION type can be used within a JDBC connection.
    • A USER type can be shared across Kyuubi servers and JDBC connections of a particular user.
    • The engines, which are pre-programmed and extendible Spark SQL mini servers, handle all queries through Kyuubi servers.
    • There are two basic kinds of engines classified by the share level:
  • Service Discovery
    • Server Space: a namespace for Kyuubi servers to register and expose themselves to clients
    • Engine Space: namespaces isolated via tenant for Kyuubi engines to register and expose themselves to all Kyuubi servers.

Kyuubi, STS, and HiveServer2 are identical in terms of interfaces and protocols - HiveServer2Overview-Protocol. Therefore, from the user's point of view, the way of use is unchanged. They can use the existing tools, like Hive Jdbc, Hive Beeline, Hue, DBeaver, etc., to talk with these services in the same way.

Kyuubi applies the multi-tenant feature based on the concept of Kyuubi engines, which are pre-programmed and extendible Spark applications.

Kyuubi isolates engines according to the tenants in the whole system. The tenant, a.k.a. user, is unified and end-to-end unique through a JDBC connection. The Kyuubi server will identify and authenticate the user and then retrieve from the engine space or create an engine belonging to this user. This user will also be used as the submitter for the engine, and it must have authority to use the resources from YARN, Kubernetes, or just Local machine, e.t.c. Inside an engine, the engine's user, a.k.a. Spark User, will also be the same. When an engine runs queries received from the JDBC connection, the engine's user must also have rights to access the metadata and data.

Initial Goals

  • Move existing codebase, website, documentation, and mailing lists to Apache-hosted infrastructure.
  • Integrate with the Apache development process and infrastructure and move our code review, build, and testing workflows in the context of the ASF
  • Grow and diversify the Kyuubi community

Current Status

The Kyuubi project was started at NetEase and maintained as a sub-module of our in-house Spark project(Since 2.0.0). At that point, we needed a Spark SQL service to migrate thousands of Hive QL jobs from HiveServer2 for better performance. Meanwhile, we didn't want to lose features like fine-grained permission control, queue resource isolation, high availability, etc. We separated it into an independent project and open-sourced it under Apache License 2.0 in Dec 2017.

Meritocracy:

This proposal intends to start building a diverse developer and user community around Kyuubi following the ASF meritocracy model. Since Kyuubi was open-sourced, many enterprises have adopted Kyuubi to build up their multi-tenant Spark SQL services to replace HiveServer2. In return, we have received many issue reports or enhancements from them simultaneously. Because Kyuubi is maintained under the NetEase Account on Github and closely associated with the ASF's various big data projects, we've been asked many times by our users if the ASF could incubate it. The codebase is now mainly managed by a group of developers from NetEase and eBay, etc, and we also accept individual developers as core developers of Kyuubi as well. We will also try our best to encourage an environment that supports a meritocracy.

Community:

Kyuubi has been building a community around contributors and users to this framework for the last three years. And we believe that we can get a lot of help from the Apache Spark community too.

Core Developers:

  • Kent Yao. He is the founder of this project, Apache Spark/Submarine Committer, from NetEase Corp.
  • Ulysses You. He is an Apache Spark Contributor, from NetEase Corp.
  • Cheng Pan. He is the ClickHouse-Native-JDBC maintainer, Apache Iceberg/Spark Contributor,  an individual open-source enthusiast.
  • Fei Wang. He is an Apache Spark Contributor, from eBay Inc.

Alignment:

Kyuubi is built upon Apache Spark and many other Apache projects such as Apache Hive, Zookeeper, Hadoop, YARN, etc. The codebase of Kyuubi is already under Apache License Version 2.0. Meanwhile, our current core developers all have the experience of contributing to various Apache projects. These community connections help us focus on development practices that emphasize community engagement to align us with the ASF path to meritocratic recognition naturally.

Known Risks

Project Name

The project took its name, Kyuubi, from a character of a popular Japanese manga - Naruto. It is a nine-tailed fox spirit in Chinese, Japanese mythology. Kyuubi spreads the energy of fire, used here to symbolize the capability of the project. Meanwhile, its nine tails stand for end-to-end multi-tenancy support of this project vividly.

Based on our search results, the term Kyuubi is used as a trademark only under Class 5, Class 7,  and Class 17, so it is perfectly legal to use it as our project name.

Orphaned products

There is some risk of the Kyuubi project being abandoned, as the current community is very young and small. We need to highlight this issue and reduce the risk as soon as possible during the Apache Incubation. Many organizations are using Kyuubi to build critical big data pipelines, and then we can encourage them to help develop Kyuubi's community if it becomes an ASF project.

Inexperience with Open Source:

Many of the Kyuubi committers have experience working on open source projects. They are also active committers and contributors to other Apache projects.

Homogenous Developers:

The current contributors work across various organizations, including NetEase, eBay, etc. We are committed to recruiting additional committers based on their contributions to the project.

Reliance on Salaried Developers

Salaried engineers have made contributions to the Kyuubi project to date from NetEase, eBay, etc., both on their salaried time and on volunteer time. They are all passionate about the project, and we are committed to recruiting additional committees, including non-salaried developers, and aim to diversify the Kyuubi user and contributor base further.

Relationships with Other Apache Products:

Kyuubi is closely integrated with Apache Spark, Zookeeper, Curator, Hive, Thrift, and commons currently in numerous ways.

Kyuubi inherits Hive's Hive Service RPC module to reuse the Thrift API to build the RPC environments between clients and Kyuubi servers, and between Kyuubi servers and engines internally. Clients can use the existing Hive JDBC/Beeline to talk to Kyuubi in the same way as Spark ThriftServer and HiveServer2. Kyuubi uses Zookeeper and Curator to build a service registration discovery mechanism for internal and external components. Kyuubi engines are pre-programmed Spark SQL applications that can fully support the pure SQL usages in Spark. They can run any cluster managers like Kubernetes, Hadoop YARN, Spark Standalone, etc. In the future, engines' pluggable design can support more Apache projects, such as Apache Flink.

An Excessive Fascination with the Apache Brand

The primary motivation for submitting Kyuubi to the ASF is to build a diverse and strong community and to gain stability for long-term development. We also wish to encourage diverse organizations to adopt Kyuubi and contribute to Kyuubi without any concerns about ownership or licensing.

Documentation

Since Kyuubi 1.0.0, the Kyuubi online documentation is hosted by https://readthedocs.org/.

You can find the specific version of Kyuubi documentation listed below.

For 0.8 and earlier versions, on Github Pages.

Initial Source

The initial source code for Kyuubi is hosted at https://github.com/NetEase/kyuubi. 

Initial Source and Intellectual Property Submission Plan

As soon as Kyuubi is approved to join Apache Incubator, our initial committers will submit iCLA(s), SGA, and CCLA(s). The codebase is already licensed under the Apache License 2.0.

External Dependencies

Apache Licence 2.0

  • commons-codec:commons-codec
  • org.apache.commons:commons-lang3
  • org.apache.curator:curator-client
  • org.apache.curator:curator-framework
  • org.apache.curator:curator-recipes
  • org.apache.curator:curator-test
  • com.google.guava:failureaccess
  • com.google.guava:guava
  • org.apache.hadoop:hadoop-client-api
  • org.apache.hadoop:hadoop-client-runtime
  • org.apache.hive:hive-service-rpc
  • org.apache.htrace:htrace-core4
  • com.fasterxml.jackson.core:jackson-annotations
  • com.fasterxml.jackson.core:jackson-core
  • com.fasterxml.jackson.core:jackson-databind
  • org.javassist:javassist
  • org.apache.thrift:libfb303
  • org.apache.thrift:libthrift
  • log4j:log4j
  • io.dropwizard.metrics:metrics-core
  • io.dropwizard.metrics:metrics-jmx
  • io.dropwizard.metrics:metrics-json
  • io.dropwizard.metrics:metrics-jvm
  • org.apache.zookeeper:zookeeper
  • spark-*-bin-*.tgz

BSD 3-Clause

  • org.scala-lang:scala-library

MIT License

  • org.slf4j:slf4j-api
  • org.slf4j:slf4j-log4j12
  • org.slf4j:jcl-over-slf4j

Required Resources

Mailing lists

Git Repositories:

Issue Tracking

We request the creation of an Apache-hosted JIRA.

Jira ID: KYUUBI

Initial Committers

Sponsors

Champion

Nominated Mentors

Sponsoring Entity

We are expecting the Apache Incubator could sponsor this project.

  • No labels