Abstract

Hop is short for the Hop Orchestration Platform. Written completely in Java it aims to provide a wide range of data orchestration tools, including a visual development environment, servers, metadata analysis, auditing services and so on. As a platform, Hop also wants to be a reusable library so that it can be easily reused by other software.

Proposal

Hop provides all the tools to build, maintain and deploy data orchestration, ETL and data integration solutions. For example, Hop allows you to diagram a data flow that propagates changes from a database via Apache Kafka to a data warehouse and deploy it as an Apache Beam pipeline. The core concepts of Hop are Pipelines and Workflows.

Pipelines do the core data manipulation work (read, manipulate, write data). The main items of work in pipelines are transforms. A pipeline consists of two or more (usually many) transforms that each perform a granular piece of work. The transforms in a pipeline run in parallel, and together create a powerful data processing tool.
Workflows take care of the orchestration of actions: execute pipelines, run child workflows, environment checks, preparation, problem alerting and so on.

If these terms sound familiar it’s because they are taken from the Apache Beam and Apache Airflow projects.

The main components of the Hop platform are:

hop-gui, a visual data orchestration IDE
hop-run: a CLI tool to run workflows or pipelines
hop-config: a CLI tool to configure Hop and its components
hop-server: a light-weight web server to run and monitor workflows and pipelines
hop-translator: a tool for translating the various parts of the Hop tools (i18n).
hop-web: a thin client version of hop-gui for web browsers and mobile devices

The cornerstone of the Hop platform is extensibility: all major components of the platform are designed to be pluggable. This allows any possible missing functionality to be created in a short amount of time.

Background

The Hop Orchestration Platform has its origins in the Kettle community. Kettle got acquired by Pentaho and after Pentaho’s acquisition by Hitachi in 2015, the community struck out to solve problems less aligned with Hitachi’s interests.

Rationale

In the Hop community, we have always aimed to function as a meritocracy, where contributions are accepted based on merit, and individuals gain status in the community based on their contributions (coding and otherwise). We’re proud to have a diverse group of people doing all the required things in a project: development, documentation, tutorials, architecture, testing, graphics design and much more. Bringing the project under the Apache Software Foundation would allow us to continue and grow, but also give our users confidence about the governance, IP status, and future of the project.

ASF Preparation Phase

The very first goal of project Hop is to find a good way to cooperate on the development across wide geographical, economical and social spectra. To make this possible real changes were needed to a codebase which is essentially 20 years old. Most of these changes have been tackled by now. We think it’s fair to say that by now, Hop is a new platform even though it shares a common background as it partly started from the Kettle code base.

Here are a few of the key focus areas we’re trying to saveguard going forward:

Plugins: lightweight plugins for all major functionality. This makes it possible to extend Hop or reduce Hop in size. It also allows people to implement or change functionality with minimal coding. In other words it makes it easier to contribute.
Maintain an open and responsive community where every concern, feedback and contribution is welcome.
Maintain a clear focus on data orchestration user requirements, not on “industry trends”
Documentation: we set up a version controlled “adoc” system with automated builds which is both open, controlled and reviewed. This is incredibly important for every Hop user and developer.
Testing and stability: we want to massively increase stability by implementing integration tests beyond the standard Java unit testing because of the dynamic nature of data orchestration work. We still have a long way to go. This work will never be finished. It’s a clear and important goal nevertheless.
Simplicity: things are complex enough. We follow the example of projects like Apache Spark and Flink and so as an example “hop-run.sh” does exactly what the name says without the need to dive into documentation. As much as possible we make things self-evident and will re-use existing terminology.

For a list of the changes you can look at the monthly roundup which was compiled since February 2020. It documents the hard work of our community so far:

http://www.project-hop.org/news/roundup-2020-02/

http://www.project-hop.org/news/roundup-2020-03/

http://www.project-hop.org/news/roundup-2020-04/

http://www.project-hop.org/news/roundup-2020-05/

http://www.project-hop.org/news/roundup-2020-06/

http://www.project-hop.org/news/roundup-2020-08/

Goals

Here are a few more details and specifics of things we still want to take on going forward:

Add more plugin metadata to Transforms and Action plugins as well as their supported engines. This will make it easier to refine the user interface and make the user experience better by giving to the point feedback on what operations are supported and required. Example metadata to add: extra version and build information, dependencies, tags and labels (replacing categories), keywords, documentation links, input and output capabilities, engine capabilities and so on.
SWT: While the Eclipse SWT project is still supported we want to make a list of all the commonly used API calls and stick to those with our own API. This will help the development of hop-web and allow us to possibly more easily migrate to different user interfaces later on.
Integration testing: every transform and action should have an integration test before it is released to ensure quality. Java unit testing has been proven to be insufficient in guarding against backward compatibility, stability and functionality. We need to do better.
Apache VFS: Hop makes extensive use of this API to handle files. As such we want to implement the various drivers for gs://, hdfs://, s3:// through standard Kettle plugins making it easier to choose which protocols to support.
Variables & Parameters: make this experience more intuitive, clean up the underlying API and add more options to the various user interfaces responsible for setting and passing variables and parameters.
Make Hop-Web an integral part of the Apache Hop project removing the code duplication (fork) we’re dealing with now. This includes the need to improve various user interfaces which were designed for non-web clients.
Make best practices and governance functionality an integral part of the API of the project:

Data sets and unit testing (already done)
Environments and lifecycle management (partly done)
Git support (partly done)
Auditing and lineage
Software policies and enforcement thereof
Configuration management (partly done)

Current Status

Meritocracy

With Project Hop, we actively work to foster the existing community and encourage community contributions. As of September 1st 2020 we received over 250 pull requests and have around 600 tickets in our JIRA platform (a lot of which were created by community members) and have active discussions in our Mattermost chat platform with over 80 members.

The last half year we started to ask users on our chat chat server for specific feedback on terminology, features and so on. It’s been a wonderfully positive experience to have in-depth discussions on complex issues with industry experts. We look forward to moving these discussions and votes to an Apache mailing list.

Community

Hop is developed, extended and maintained by a global community of users and developers. The Hop community is what has driven its development and growth.

The particular past history of Hop has led to a lot of interest for the project and already led to a number of contributions, documentation and translations.

Core Developers

We have a diverse group of core developers with people joining on a regular basis. Matt Casters, Rodrigo Haces and David Rosenblum are part time developers on Hop, salaried by Neo Solutions. Bart Maertens, Hans Van Akelyen, Yannick Mols are part time Hop developers paid for by company know.bi. Doug and Gretchen Moran were Pentaho employees but along with Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio Ramazzina and many others they can be considered to be long time consultants and community members for over a decade that joined the Hop community in the last year or two.

Alignment

We want to anchor and safeguard our development and community building efforts for the future. We strongly believe that as an Apache project this can be achieved in the best possible way. The Hop project also started to align with projects like Apache Beam, Spark and Flink in its use of terminology, tools, manner of configuration and so on. As mentioned elsewhere in this document Hop is a large user of other Apache projects and libraries and we believe that becoming an Apache project is mutually beneficial. Specifically for Apache Beam we believe that providing a visual pipeline development tool can be of great value.

Known Risks

While the current code-base of Kettle on which we have started from is already released under the Apache Public License 2.0 proper attribution needs to happen to Hitachi Vantara.

We have no knowledge of existing patents on any part of the Kettle codebase.

To further reduce any risk of there even being any discussion on naming the Hop team decided to rename the project, its tools (to be more self-evident as well), the java API and even the main concepts (Transformations are now called Pipelines, in line with Apache Beam naming conventions).

Orphaned products

There is little risk that the project will become orphaned. The list of active developers is large, and consists of a mix of developers who have been working on the code for several years and recent arrivals in the community

Inexperience with Open Source

The project team has a long history in open source and has contributed to Apache licensed open source projects, mostly in the Kettle ecosystem such as Kettle itself and the many plugins and projects surrounding it. The experience gained there has allowed us to quickly set up all required build tools and processes. In its fairly short history, Hop has been advocating open source in all aspects of the project. Our submission to the Apache Software Foundation is a logical extension of our commitment to open source software.

Licensing

The original source code we started from (see below) has been open source since december 2005, initially under the Lesser GPL but since January 2012 all under the Apache License version 2.0. All Hop code has been scanned for compliance with APL 2.0. We integrated Apache Rat with our build process.

Heterogeneous Developers

Hop is built, developed and maintained by a global community of developers. Input comes from a large group of developers and users from all over the world. At this moment over 7 companies contribute to Hop through the developers along with a list of individuals and consultants.

Reliance on Salaried Developers

Hop developers are a mix of volunteers, enthusiasts and people working for an employer. There is also a group of consultants who want to be involved in Hop because it allows them to do projects with it. They are in fact our most important users and developers since they provide valuable feedback from the trenches.

Relationships with Other Apache Products

Hop is a heavy user of Apache software libraries.

Apache Commons usage:

commons-beanutils
commons-cli
commons-codec
commons-collections
commons-collections4
commons-compiler
commons-compress
commons-configuration
commons-database-model
commons-dbcp
commons-digester
commons-el
commons-httpclient
commons-io
commons-lang and commons-lang3
commons-logging
commons-math and commons-math3-3.5.jar
commons-net
commons-pool
commons-validator
commons-vfs2

Other libraries:

Apache Batik : for the front-end SVG drawing
Apache Xerces (XSLT, XML processing)

Other usage of Apache projects related to Hop (plugins):

Apache Avro
Apache Beam w/ Apache Spark, Apache Flink, …
Apache Cassandra
Apache CouchDB
Apache Derby
Apache Flume
Apache Hadoop
Apache Hive
Apache Kafka
Apache Solr
Apache Subversion
Apache Zookeeper

For the build process

Apache Maven
Apache Jenkins

An excessive Fascination with the Apache Brand

With this proposal we are not seeking attention or publicity. Rather, we firmly believe in Hop, visual data pipeline development and the ability to treat the developed data pipelines (ETL) as software code. While the original Hop code has been open source for about 15 years, we believe putting code on GitHub can only go so far. We see the Apache community, processes, and mission as critical for ensuring Hop is truly community-driven, positively impactful, and innovative open source software. We believe Hop is a great fit for the Apache Software Foundation due to its focus on visual data processing and its relationships to existing ASF projects.

Documentation

Over the years, the community has contributed extensive documentation to wiki.pentaho.com. Over time, areas of the available information have become incomplete or outdated. Most of this documentation has been reviewed, updated and will be contributed to the Apache foundation with the Hop source code. Documentation for the extensive new functionality that was added to Hop in recent months is being written.

We consider documentation to be a core piece of the Hop platform and will treat documentation as any other item of code.

Initial Source

While there isn’t a Java class in Hop which is unchanged from its origins we should mention we selected this source code to form the base of Apache Kettle:

https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R

We merged various changes from the WebSpoon fork found over here:

https://github.com/HiromuHota/pentaho-kettle

Various community driven Kettle plugins were written to bypass bugs, slow down code-rot and to implement missing features. They were were merged into Hop from these locations:

https://github.com/mattcasters/kettle-debug-plugin (better debugging)

https://github.com/mattcasters/kettle-beam (Apache Beam support)

https://github.com/mattcasters/pentaho-pdi-dataset (Unit Testing)

https://github.com/mattcasters/kettle-needful-things (Bug fixes & workarounds)

https://github.com/mattcasters/kettle-environment (Environment management)

The Hop repositories are currently hosted at:

https://github.com/project-hop/

Hop: source code for the Hop project
Hop-doc: technical documentation for the Hop project
Hop-website: Hop website and content repository
Hop-docker: Docker containers, Kubernetes

Source and Intellectual Property Submission Plan

The originating source code is already licensed under an Apache 2 license:

For all contributions we have an agreement in place: https://cla-assistant.io/project-hop/hop

External Dependencies

Over the course of the last year we removed non-essential dependencies as much as possible and replaced them by interfaces and plugin types. We did this to simplify the architecture.

It’s important to note all external dependencies are licensed under an Apache 2.0 or Apache-compatible license. As we grow the Hop community we will configure our build process to require and validate all contributions and dependencies are licensed under the Apache 2.0 license or are under an Apache-compatible license.

Cryptography

Required Resources

Mailing lists

We currently use a mix of email and Mattermost. We will migrate our existing mailing lists to the following:

dev@hop.incubator.apache.org

user@hop.incubator.apache.org

private@hop.incubator.apache.org

commits@hop.incubator.apache.org

Git Repository

The Hop code is currently in git, we’d like to keep it that way. We request a git repository for incubator-hop with mirroring to GitHub.

Issue Tracking

We request the creation of an Apache-hosted JIRA.

Jira ID: HOP

Other Resources

To allow other projects to use Hop as a library we would love to publish artifacts on a Maven server like maven.apache.org.

Initial Committers

Nicholas Adment <nadment@gmail.com>
Hans Van Akelyen <hans.van.akelyen@know.bi>
Lokke Bruyndonckx <lokke.bruyndonckx@know.bi>
Matt Casters <matt.casters@neo4j.com>
Jason Chu <jianjunchu@gmail.com>
Peter Fabricius <info@peter-fabricius.de>
Rodrigo Haces <rodrigo.haces@neo4j.com>
Dave Henry <dshenry99@gmail.com>
Hiromu Hota <hiromu.hota@gmail.com>
Brandon Jackson <usbrandon@gmail.com>
Dan Keeley <dan@dankeeley.co.uk>
Bart Maertens <bart.maertens@know.bi>
Yannick Mols <yannick.mols@know.bi>
Doug Moran <doug@dougandgretchen.com>
Gretchen Moran <gretchen@dougandgretchen.com>
Sergio Ramazzina <sergio.ramazzina@serasoft.it>
Maria Carina Roldan <maria.carina.roldan@gmail.com>
David Rosenblum <david.rosenblum@neo4j.com>
Rafael Valenzuela <ravamo@gmail.com>

Affiliations

Neo4J

Matt Casters
Rodrigo Haces
David Rosenblum

Know.bi

Bart Maertens
Hans Van Akelyen
Lokke Bruyndonckx
Yannick Mols

eHealth Africa

Doug & Gretchen Moran

Schemetrica

Dave Henry

Beijing Auphi Data Co

Jason Chu

Serasoft Italy

Sergio Ramazzina

Hitachi Research

Hiromu Hota

Page tree

Abstract

Proposal

Background

Rationale

ASF Preparation Phase

Goals

Current Status

Meritocracy

Community

Core Developers

Alignment

Known Risks

Orphaned products

Inexperience with Open Source

Licensing

Heterogeneous Developers

Reliance on Salaried Developers

Relationships with Other Apache Products

An excessive Fascination with the Apache Brand

Documentation

Initial Source

Source and Intellectual Property Submission Plan

External Dependencies

Cryptography

Required Resources

Mailing lists

Git Repository

Issue Tracking

Other Resources

Initial Committers

Affiliations

Sponsors

Champion

Nominated Mentors

Page tree

HopProposal

Abstract

Proposal

Background

Rationale

ASF Preparation Phase

Goals

Current Status

Meritocracy

Community

Core Developers

Alignment

Known Risks

Orphaned products

Inexperience with Open Source

Licensing

Heterogeneous Developers

Reliance on Salaried Developers

Relationships with Other Apache Products

An excessive Fascination with the Apache Brand

Documentation

Initial Source

Source and Intellectual Property Submission Plan

External Dependencies

Cryptography

Required Resources

Mailing lists

Git Repository

Issue Tracking

Other Resources

Initial Committers

Affiliations

Sponsors

Champion

Nominated Mentors