Abstract

Amoro is a Lakehouse management system built on open data lake formats like Apache Iceberg and Apache Paimon (Incubating). Working with compute engines including Apache Flink, Apache Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture.

Proposal

Amoro is a unified management system providing self-optimizing features to data lake tables with different table formats. It continuously monitors metadata changes in the tables, automatically triggers optimizing tasks to improve query performance, and automatically triggers maintenance operations such as snapshot expiration and file cleaning based on user configurations. Users do not need to maintain specific optimizing tasks but only need to perform simple configurations to complete the management of data lake tables. This "out-of-the-box" usage greatly reduces the threshold for using data lake tables and provides data self-optimizing features for scenarios such as streaming lake warehouses and cloud-native data warehouses. Amoro provides high availability, resource isolation, and elastic resource management for optimizing tasks, ensuring the availability of management services, and reducing resource costs.

Background

Data lake table formats like Apache Iceberg, Apache Hudi, and Apache Paimon (Incubating) bring many new features to data lake storage, such as ACID transactions, real-time updates, and time travel, enabling richer application scenarios on data lakes. However, introducing these new features in scenarios such as streaming lake warehouses and cloud-native data warehouses is not easy. Users need to maintain additional maintenance tasks such as compacting small files and cleaning up expired files while reading and writing these tables. Failure to run maintenance tasks in a timely manner can cause a sharp drop in table read performance, while too frequent file compaction can result in huge resource costs.

Since 2019, NetEase Corp has been using Apache Iceberg extensively across multiple business lines and has developed the Arctic Lakehouse management system to provide self-optimizing features that greatly enhance the user experience of Apache Iceberg. 

NetEase Corp open-sourced Arctic in 2022 and renamed it to Amoro in 2023. In the year since its open-source, Amoro has integrated more table formats, expanded its management functions, and has been widely used in over ten other companies.

Rationale

Amoro aims to automatically optimize tables using reasonable resources in a timely manner. To achieve this goal, Amoro splits into the following components:

  • Amoro Management Service (AMS): A centralized management service continuously monitors data changes in all tables. When optimizing is needed, it schedules optimizing tasks on the tables and hands them over to the Optimizers for execution. It is also responsible for managing all Optimizers and automatically scaling and merging optimizer resources.
  • Optimizer: An execution node for optimizing tasks. It receives optimizing tasks from the AMS, executes them, and reports the execution results to AMS upon completion. It is stateless and can be deployed on a large scale. It provides multiple implementations based on Apache Flink and Apache Spark.


Considering the cost and benefit of optimizing tasks in the case of real-time writes on the table, Amoro also divides file compaction tasks on the table into multiple levels:

  • Minor Optimizing: Compact fragment files smaller than 16MB, which may trigger very frequently, possibly every few minutes.
  • Major Optimizing: Compact files to the target size and clean up redundant files such as Apache Iceberg Delete File. It may trigger frequently, possibly every hour.
  • Full Optimizing: Globally organize files in the table and complete data sorting. It triggers infrequently, usually once a day.

Initial Goals

  • Move existing codebase, website, and documentation to Apache-hosted infrastructure.
  • Integrate with the Apache development process and infrastructure and move our code review, build, and testing workflows in the context of the ASF.
  • Grow and diversify the Amoro community.
  • Integrate with other ASF projects.

Current Status

Meritocracy:

Amoro was started at NetEase in 2019 with the project name "Arctic" and open-sourced in 2022. Since then, Amoro has gained strong interest from numerous companies and individuals. Roadmaps, issues, and design docs are accessible to everyone and discussed across developers. Amoro already has active contributors from different organizations.

We value meritocracy and we understand that it is the basis for an open community that encourages multiple companies and individuals to contribute and be invested in the project’s future. We will try our best to build an environment that supports meritocracy. We believe the community and project will grow better if we run in the Apache Way.

Community:

Amoro has built an open-source community with 62 developers and released 8 versions in the past year.

Core Developers:

  • Jinsong Zhou. He is the founder of this project, from NetEase (GitHub ID:zhoujinsong)
  • Nathan Ma. He is the chief architect as well as a developer of the project, from NetEase (GitHub ID: majin1102)
  • Silei Jin. He is a developer of the project, from DtDream (GitHub ID: Hellojinsilei)
  • Xu Bai. He is a developer of the project, from Cisco Webex (GitHub ID: XBaith)
  • Qishang Zhong. He is a developer of the project, from Qichacha (GitHub ID: zhongqishang)
  • Yuanfeng Hu. He is a developer of the project, from Huya (GitHub ID: huyuanfeng2018)
  • ZhenYu Chen. He is a developer of the project, from HuoLaLa (GitHub ID: cyz006)
  • Gang Huang. He is a developer of the project, from Dmall (GitHub ID: tcodehuber)
  • Tao Wang.  He is a developer of the project, from NetEase (GitHub ID: wangtaohz)
  • Yongxiang Zhang. He is a developer of the project, from NetEase (GitHub ID: baiyangtx)
  • Zeyu Wang. He is a developer of the project, from NetEase (GitHub ID: hameizi)
  • Ting Lu. He is a developer of the project, from NetEase (GitHub ID: hzluting)
  • Jianmin Huang. He is a developer of the project, from NetEase (GitHub ID: HuangFru)
  • Dayang Shi. He is a developer of the project, from NetEase (GitHub ID: shidayang)
  • Xianxun Ye. He is a developer of the project, from NetEase (GitHub ID: YesOrNo828)

Alignment:

Amoro is built on Apache Iceberg, Apache Paimon(Incubating), and many other Apache projects such as Apache Flink, Apache Spark, Apache Hive, etc. The codebase of Amoro is already under Apache License Version 2.0. Meanwhile, our current core developers all have the experience of contributing to various Apache projects. These community connections help us focus on development practices that emphasize community engagement to align us with the ASF path to meritocratic recognition naturally.

Known Risks

Project Name

Amoro is a self-created word inspired by the word "Love". We have checked and believe the name Amoro is suitable. There are no other projects found using this name through trademark research.

Orphaned Products

Over ten users have already deployed and used Amoro in a production environment. The developers and community maintain a healthy community, and the risk of the project being abandoned is minimal. We are now actively growing the community and will continue to increase the vitality of the community to attract more contributors to join.

Inexperience with Open Source:

The Amoro project has been managed in a completely open-source manner for one year, and the developers in the community also have experience contributing to other Apache projects like Apache Iceberg, Apache Paimon(Incubating), and Apache Flink, although not in-depth. We will continue to learn how to engage in open-source work by working with our mentors and the Apache community.

Length of Incubation:

Expect to enter incubation in two months and graduate in about one or two years.

Homogenous Developers:

The contributors are from various organizations, including NetEase, Cisco, Qichacha, DtDream, etc. At this stage, we admit that the Amoro community could do with more diversity. We need to pay more attention to creating a more diverse community by nominating committers based on their contributions to the project.

Reliance on Salaried Developers:

Most of the developers are paid by their employers to contribute to this project. These developers come from several companies that have already used Amoro in their production environment, so we are very confident that they will continue to contribute to the Amoro community.

Relationships with Other Apache Products:

We have integrated with Apache Iceberg, and Apache Paimon (Incubating). We plan to have better integration with other projects in the Apache ecosystem (mainly big data projects).

A Excessive Fascination with the Apache Brand:

The primary motivation for submitting Amoro to the ASF is to build a diverse and strong community and to gain stability for long-term development. We also wish to encourage diverse organizations to adopt Amoro and contribute to Amoro without any concerns about ownership or licensing.

Documentation

Documentation can be found at https://amoro.netease.com/docs/latest/.

Initial Source

The initial source code for Amoro is hosted at https://github.com/NetEase/amoro

Source and Intellectual Property Submission Plan

As soon as Amoro is approved to join Apache Incubator, our initial committers will submit iCLAs, Netease will sign the SGA. We will ask the top 30 contributors to sign iCLA for IP clearance. The codebase is already licensed under the Apache License 2.0.

External Dependencies:

ASF projects

  • Apache Iceberg
  • Apache Paimon
  • Apache Spark
  • Apache Flink
  • Apache Hive

Apache Licence 2.0

  • cglib:cglib
  • com.alibaba:fastjson
  • com.fasterxml.jackson.core:jackson-core
  • com.fasterxml.jackson.core:jackson-databind
  • com.github.ben-manes.caffeine:caffeine
  • com.google.code.gson:gson
  • com.google.inject:guice
  • io.airlift:bootstrap
  • io.airlift:concurrent
  • io.airlift:configuration
  • io.airlift:event
  • io.airlift:json
  • io.airlift:log
  • io.airlift:units
  • io.dropwizard.metrics:metrics-core
  • io.javalin:javalin
  • io.netty:netty-all
  • io.trino:trino-memory-context
  • io.trino:trino-plugin-toolkit
  • javax.inject:inject
  • org.apache.commons:commons-dbcp2
  • org.apache.commons:commons-lang3
  • org.apache.commons:commons-pool2
  • org.apache.curator:curator-framework
  • org.apache.curator:curator-recipes
  • org.apache.derby:derby
  • org.apache.hadoop:hadoop-client
  • org.apache.kyuubi:kyuubi-hive-jdbc-shaded
  • org.apache.logging.log4j:log4j-1.2-api
  • org.apache.logging.log4j:log4j-api
  • org.apache.logging.log4j:log4j-core
  • org.apache.logging.log4j:log4j-slf4j-impl
  • org.apache.lucene:lucene-core
  • org.apache.orc:orc-core
  • org.apache.parquet:parquet-avro
  • org.apache.parquet:parquet-hadoop
  • org.apache.pulsar:pulsar-client-all
  • org.apache.thrift:libthrift
  • org.apache.zookeeper:zookeeper
  • org.mybatis:mybatis
  • org.rocksdb:rocksdbjni
  • org.yaml:snakeyaml
  • org.roaringbitmap:RoaringBitmap
  • software.amazon.awssdk:dynamodb
  • software.amazon.awssdk:glue
  • software.amazon.awssdk:kms
  • software.amazon.awssdk:s3
  • software.amazon.awssdk:sts
  • software.amazon.awssdk:url-connection-client
  • org.scala-lang:scala-library
  • org.scala-lang:scala-compiler


MIT License

  • org.slf4j:slf4j-api
  • args4j:args4j


BSD 2-clause

  • org.postgresql:postgresql


BSD

  • com.esotericsoftware.kryo:kryo


Cryptography:

Amoro does not currently include any cryptography-related code.

Required Resources

Mailing lists:

Git Repositories:

Issue Tracking:

The community would like to continue using GitHub Issues.

Other Resources:

The community has already chosen GitHub actions as continuous integration tools.

Initial Committers

@zhoujinsong initiated a discussion in the Amoro community titled "Who is willing to act as the initial committer?"(https://github.com/NetEase/amoro/discussions/2104), And by now the below fifteen show their interest whose contributions are highly remarkable.

Sponsors

Champion:

Justin Mclean (jmclean@apache.org)

Nominated Mentors:

Justin Mclean (jmclean@apache.org)

Zhongyi Tan (jerrytan@apache.org)

Yu Li (liyu@apache.org)

Xinyu Zhou (yukon@apache.org)

Kent Yao (yao@apache.org)

Sponsoring Entity:

We are expecting the Apache Incubator could sponsor this project.

  • No labels