Fluo is a distributed system for incrementally processing large data sets stored in Accumulo.
Fluo is a distributed transaction and notification system that enables the incremental processing of large data sets. Its transaction system allows for concurrent, cross-node updates to data stored in Accumulo. Its notification system enables developers to write code to be executed when observed data changes. Fluo provides a core API to perform transactional updates using minimalistic get/set methods. Fluo also provides a higher order recipes API that builds on the core API to support more complex methods for transactional updates.
Several frameworks exist for batch (i.e Spark, MapReduce) and stream (i.e Storm, Spark Streaming) processing of data. While batch and stream processing have strong use cases, they are not suited for joining incoming data in real-time to a large existing data set. To fill this need, Google developed an incremental processing system called Percolator and described it in the paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications
USENIX (2010), http://research.google.com/pubs/pub36726.html |
Fluo fills the need for cross-row (and cross-node) transactions in Accumulo by providing it with an open source implementation of Percolator. Fluo also satisfies a gap in Accumulo’s ability to incrementally process data. Fluo also provides a novel recipes API which offers higher level abstractions for transactional updates.
Fluo currently exists as an open source project on GitHub and has been in active development since 2013. The project has made an alpha release and two beta releases. The major features of Fluo outlined in this proposal have been implemented. Several example Fluo applications have been created and run successfully on clusters (up to 24 nodes).
The Fluo project operates as a meritocracy and will continue to do so because we feel that a project comprised of a diverse set of committers will thrive. Therefore, we welcome new contributors and encourage them on their path to committership.
Fluo is currently being used by a subset of the Accumulo community. The initial developers have been responsive to external contributions through pull requests and issues on GitHub. As Fluo releases a stable 1.0 version that is production-ready, we expect this community to grow. To encourage growth, we have created a project website with documentation, given talks at Meetups and the Accumulo Summit, and engaged with new users on GitHub and the Fluo mailing list.
The project was started by Keith Turner (an Apache Member and committer/PMC on Gora and Accumulo) in 2013, and the development has primarily consisted of his and Mike Walch’s continued efforts. Additional developers have contributed over time, which has led to new committers.
Fluo is closely linked to the Accumulo community, and fits well within the larger Hadoop ecosystem at Apache. Fluo utilizes several Apache projects, such as Accumulo, YARN, Twill, and ZooKeeper. Enabling closer collaboration between these communities through its coexistence within the ASF would help further drive the success of them all.
In addition to our technical ties to other ASF projects, our development philosophy aligns with Apache philosophies. Based on our experience with existing Apache projects, we are interested in establishing formal governance with a PMC and community bylaws, which we feel would best be done within Apache.
Fluo could be orphaned if the project fails to gain adoption and the core developers abandon their interest (this is not anticipated). This risk can be mitigated by attracting more committers and developing further documentation to ease adoption.
Fluo has been an open source project on GitHub from the start of its development. Several Fluo developers are committers on other ASF projects as well as open source projects outside ASF, and understand open source development.
The initial committers work for different employers. We hope add more developers from other employers and industries.
While most of the initial committers are paid to work on Fluo, there have been many contributions from developers working independently.
Fluo uses Accumulo, Hadoop (HDFS & YARN), Twill, ZooKeeper, Curator, Thrift, and various Commons libraries. During development, contributions have been made to some of these Apache projects to better support Fluo use cases.
While we recognize the impact of the Apache brand, we feel that Fluo would fit well in Apache because of its relationship to other Apache projects and because we share the ASF values of meritocracy and community over code.
Information about Fluo can be found on the project website at http://fluo.io/. This includes:
The initial source code is publicly available as an open source project on GitHub at https://github.com/fluo-io/fluo
Supplemental repositories also exist on GitHub at https://github.com/fluo-io and some of those will become part of the initial code base (perhaps in separate repositories).
All of the Fluo’s source code is available under the Apache License, Version 2.
The Fluo logo was designed and contributed to the Fluo project, for use by the project, and the contributors would like it to remain the logo of the project within the ASF, granting any necessary rights to the ASF, while continuing to use the logo on Fluo-related historical sites and project pages (such as Fluo’s current GitHub site).
Fluo has made it a point from its beginning to use dependencies which are compatible with the expectations of an ASF project. The following are its current dependencies, grouped by license.
Apache License, Version 2.0
BSD License (2-Clause)
Eclipse Public License - v 1.0
MIT License (Expat)
none