XTable Proposal

Abstract

XTable is an omni-directional converter for table formats that facilitates interoperability across data processing systems and query engines.

Currently, XTable supports widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake.

Proposal

XTable seeks to create an open-source and universal converter across table formats for the data lake. Its purpose is to convert between these formats transparently and continuously, enabling seamless interoperability across engines.

The implementation exploits the two-layer storage design common to all these table formats: A data layer and a metadata layer. The data layer across these formats often relies in one widely adopted file format such as Apache Parquet or Apache ORC. Thus, format interoperability is achieved by replicating only the metadata, which is small-scale compared to the data.

Although XTable has an internal model to represent tables, XTable is NOT a new table format. As mentioned above, its focus is on the omni-directional conversion across formats.

Another important aspect of XTable is extensibility. XTable supports Delta Lake, Apache Hudi, and Apache Iceberg. However, we suspect new table formats will continue emerging, e.g., Apache Paimon (incubating). Supporting a new format simply involves implementing a few interfaces, which we believe will facilitate the expansion of supported source and target formats in the future.

Rationale

Table formats are becoming increasingly popular for providing consistency and isolation guarantees, as well as performance enhancements, in data lakes. Various table formats have distinct advantages, and different engines and platforms have adopted one or more of them.

Not all data processing systems or platforms have standardized on the same table format. This diversity in table formats can lead to limitations when users need to consume data generated by one engine by using a different engine, resulting in a "lock-in" effect.

Furthermore, as these table formats may be tailored to specific use cases (e.g., slow-changing tables, fast streaming ingestion), we anticipate the emergence of new table formats. One such example is the recently accepted project for incubation at ASF, Apache Paimon (incubating).

For these reasons, we believe it makes sense to develop an independent converter that can work across different table formats. While there are technical challenges to address and there should be a strong focus on testing and ongoing adaptation as these formats evolve, the feasibility of a converter is facilitated by the fact that the differences between these formats are primarily in metadata presentation rather than data representation, which would be more costly to convert.

Initial Goals

  • Rename the existing codebase from OneTable to XTable.
  • Move the codebase, website, documentation, and mailing lists to an Apache-hosted infrastructure.
  • Integrate with the Apache development process.
  • Ensure all dependencies are compliant with Apache License version 2.0.
  • Incrementally develop and release per Apache guidelines.

Current Status

XTable was recently open sourced under the Apache License, Version 2.0. The source code is currently hosted at github.com (https://github.com/onetable-io/onetable), which will seed the Apache git repository after being renamed to XTable.

Meritocracy:

We are fully committed to open, transparent, and merit-based interactions with our community. Our reason for entering the incubation process is to follow Apache's best practices that emphasize meritocracy. Several individuals from different organizations have shown interest in this project, and we intend to invite more developers to participate. We'll actively support and oversee community participation, ensuring that privileges are given to those who contribute meaningfully.

Community:

We see a demand for an open-source omni-directional table format converter, and there is an opportunity to engage a broad and diverse community. This potential arises from the existence of other open-source table formats (such as Delta Lake, Apache Hudi, Apache Iceberg, Apache Paimon (incubating)) and the thriving communities associated with engines that rely on these formats.

One important attribute for XTable's success is to ensure it is equally effective across all table formats without creating distinctions between them. We recognize that in-depth knowledge about each format primarily resides within the respective communities that develop them. Thus, as part of our incubation discussions, we are reaching out to the developer mailing lists of different table format communities, i.e., Apache Hudi, Apache Iceberg, Delta Lake, and Apache Paimon (incubating), inviting them to share their feedback on the proposal and participate in XTable's community.

Core Developers:

XTable was initially developed at Onehouse and is under active development. The initial list of committers includes individuals with extensive experience in the Apache ecosystem, including ASF members, PMC members, and Committers from projects such as Apache Hadoop, Apache Calcite, Apache Hive, Apache ORC, Apache Geode, and Apache Heron (incubating).

Alignment:

XTable’s evolution and ongoing maintenance necessitates close collaboration with communities built around the various table formats and engines, and most of these associated projects—such as Apache Iceberg, Apache Hudi, Apache Parquet, Apache Avro, and Apache ORC—are already under the ASF.

Known Risks

Project Name

We have checked and believe that XTable is an appropriate name. We searched XTable in the USPTO and did not find the same name.

xtable is a package/function in R to export tables to LaTeX or HTML.

(The initial proposal was to keep the project's name as OneTable. However, following discussions in the incubator mailing list, we decided to switch to XTable to avoid any significant associations and potential confusion with specific corporations/products in this space, particularly those employing some of the initial committers.)

Orphaned Products

The risk of orphaned products is relatively low. XTable's development began with Onehouse, and several companies have expressed interest in incorporating it into their platforms. Furthermore, the project includes contributors with significant open-source experience. Developers from both the community and these companies are dedicated to advancing XTable. We are actively managing the project and will work on expanding the community's engagement to welcome additional contributors.

Inexperience with Open Source:

The initial committers include veteran Apache members (Committers, PMC members and ASF Members) and other developers who have varying degrees of experience with open-source projects. All have been involved with source code that has been released under an open-source license, and several also have experience developing code with an open-source development process.

Length of Incubation:

We expect that XTable can graduate from the incubator 2 years or less.

Homogenous Developers:

The initial committers are employed by various companies, such as Onehouse, Microsoft, Google, Walmart, Adobe, Cloudera, and Dremio. By becoming part of the Apache Incubator, we aim to connect with like-minded individuals who share our enthusiasm for promoting the Apache way. We are dedicated to welcoming new committers from different organizations based on their contributions to the project.

Reliance on Salaried Developers:

It is expected that XTable development will occur on both salaried time and on volunteer time. Most of the initial committers are paid by their employers to contribute to this project. However, they are all passionate about the project, and we are both confident and hopeful that by building a community around the project, it will continue even if no salaried developers contribute to it.

Relationships with Other Apache Products:

XTable is deeply integrated with other Apache projects. XTable provides source and target connectors for Apache Hudi and Apache Iceberg. XTable has a dependency on Apache Spark for the integration with the Delta Lake table format. XTable also relies on Hadoop for its connectors to cloud and other types of storage systems.

Lastly, XTable integrates with Apache Log4j for logging and relies on Apache Commons libraries (CLI, Lang) to provide some of its functionality.

An Excessive Fascination with the Apache Brand:

While we expect the Apache brand will increase XTable's visibility and potentially attract more contributors, our decision to start this project is based on the factors mentioned in the Rationale section.

We believe XTable will benefit from collaboration and building a diverse community of developers and committers. We also hope that it will be embraced by other Apache communities, such as Apache Hudi, Apache Iceberg, and Apache Paimon (incubating).

Documentation

Information on XTable can be found at https://github.com/onetable-io/onetable/blob/main/README.md.

Initial Source

The initial source code for XTable is hosted at https://github.com/onetable-io/onetable under the Apache License, version 2.0.

Source and Intellectual Property Submission Plan

All code currently hosted in the GitHub repository will be contributed.

External Dependencies:

Apache Licence 2.0

BSD 3-clause

Eclipse Distribution License 2.0

MIT License

Cryptography:

XTable does not currently include any cryptography-related code.

Required Resources

Mailing lists:

  • private@xtable.incubator.apache.org
  • dev@xtable.incubator.apache.org
  • commits@xtable.incubator.apache.org

Git Repositories:

Upon entering incubation, we want to move the existing repo to the Apache Software Foundation:

https://github.com/onetable-io/onetable -> https://github.com/apache/incubator-xtable

Issue Tracking:

The community would like to continue using GitHub Issues.

Initial Committers

  • Tim Brown (tim@onehouse.ai) - Onehouse
  • Vamshi Gudavarthi (vamshi@onehouse.ai) - Onehouse
  • Vinish Reddy Pannala (vinishreddypannala@onehouse.ai) - Onehouse
  • Vinoth Chandar (vinoth@apache.org) - Onehouse
  • Ashvin Agrawal (ashvin@apache.org) - Microsoft
  • Jesus Camacho Rodriguez (jcamacho@apache.org) - Microsoft
  • Anoop Johnson (anoopkj@google.com) - Google
  • Baljinder Singh (Baljinder.Singh1@walmart.com) - Walmart
  • Hitesh Shah (hitesh@apache.org) - Adobe
  • Stamatis Zampetakis (zabetak@apache.org) - Cloudera
  • Jean-Baptiste Onofré (jbonofre@apache.org) - Dremio

Sponsors

Champion:

Jesus Camacho Rodriguez (jcamacho@apache.org)

Nominated Mentors:

  • Jesus Camacho Rodriguez (jcamacho@apache.org)
  • Hitesh Shah (hitesh@apache.org)
  • Stamatis Zampetakis (zabetak@apache.org)
  • Jean-Baptiste Onofré (jbonofre@apache.org)

Sponsoring Entity:

The Incubator.

  • No labels

6 Comments

  1. It would be great if the initial committers could include active members of the other communities involved in table formats. For example, maybe Ryan Blue , Eduard Tudenhoefner , Tathagata Das , Venki Korukanti and Jean-Baptiste Onofré could be approached for inclusion. It seems much likelier that this project would be succesful with a individuals involved in all projects as opposed to only those involved in Hudi.

    1. I agree with Jacques Nadeau 

      I would be more than happy to be mentor and/or champion on this proposal. Please let me know if it makes sense to you.

    2. Thanks Jacques Nadeau. I align with the overall sentiment, as I believe is highlighted in the proposal draft as well. The goal is to foster a neutral, inclusive community that grows over time, which is why positioning OneTable as an independent project makes sense. Just to note, as mentioned in the draft, the project is already open sourced under the Apache license (within a separate OneTable GitHub organization) and anyone interested is encouraged to contribute.

      Jean-Baptiste Onofré , I've included you as a mentor for the project, happy to have you on board. Looking forward to hearing from others.

      (FYI I plan to send the email to the incubator mailing list for discussion of the proposal later this week.)

  2. Regarding the naming discussion, maybe it would be a good idea to start the name search process ASAP to avoid having to rename the project later on. I didn't see an entry for XTable in the Jira search so I suppose it has not started yet. I guess it is fine to do it after the VOTE but we should not defer it too much in the future.

    1. Thanks Stamatis. Based on the guidelines described here, I was going to run the name search process after the project is accepted for incubation, but before we start to request resources.