Apache DataSketches Proposal

Abstract

DataSketches.GitHub.io is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.

...

The DataSketches library was started in 2012 as internal Yahoo

Footnote
In 2017 Verizon acquired Yahoo and merged it with previously acquired AOL. The merged entity was originally called Oath, Inc., but has recently been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon, Inc. Since Yahoo is the more recognized name, references in this document to Yahoo, are also a reference to Verizon Media, Inc.

project to dramatically reduce time and resources required for distinct (unique) counting. An extensive search on the Internet at the time yielded a number of theoretical papers on stochastic streaming algorithms with pseudocode examples, but we did not find any usable open-source code of the quality we felt we needed for our internal production systems. So we started a small project (one person) to develop our own sketches working directly from published theoretical papers.

...

Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and Justin Thaler. A high-performance algorithm for identifying frequent items in data streams. In ACM IMC 2017.
Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A framework for estimating stream expression cardinalities. In *EDBT/ICDT Proceedings ‘16, pages 6:1–6:17, 2016.
Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings ‘16, pages 845-854, 2016.
Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.
Kevin J Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839, 2017.
Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD Proceedings ‘13, pages 581– 588, 2013 (Received Best-Paper Award, KDD is one of the largest data mining focused conferences in the world).
Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings ‘16, pages 441–454, 2016.
Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical heavy hitters with the space saving algorithm. In SIAM ALENEX Proceedings ‘12, pages 160–174, 2012.
Edo Liberty and Zohar Karnin. Discrepancy, Coresets, and Sketches in Machine Learning. To be published.
Arik Rinberg, Alexander Spiegelman, Edward Bortnikov, Eshcar Hillel, Idit Keidar, Hadar Serviansky, Fast Concurrent Data Sketches. https://arxiv.org/abs/1902.10995 - under submission

International Keynotes and Tutorials on Sketching by the Team

Edo Liberty - Southern Data Conference - Upcoming
Edo Liberty - Information Theory and Applications (ITA) 2019
Edo Liberty - TMA Experts Summit 2018
Edo Liberty - Streaming Quantiles - Shonan Workshop on Processing Big Data Streams 2017
Edo Liberty - MLConf 2014, NYC - Introducing the Streaming Computation Model
Edo Liberty and Jelani Nelson - Full tutorial at KDD 2012 (Slides)
Lee Rhodes - DataSketches, a Required Toolkit for Analysis of Big Data. Hadoop Summit 2015
Lee Rhodes - DataSketches. Alan Turing Institute, Invited talk. May 2017.

The Rationale for Sketches

...

Streaming. To be truly streaming a sketch can only “touch” each item of the stream once. There is no second chance.
Amortized per item processing time is constant. Sketch algorithms are designed so that the processing time per item is essentially independent of n, the number of items (or size) of the input stream. There may be specific instances where upon receiving a new item, the sketch needs to resize or rebuild its internal data structures to accommodate more items, but overall, these events are rare, so that when amortized over n, the computational cost is effectively O(1) with a small hidden constant.
Small Size. Relative to the size of the input stream, sketches retain a small amount of information about the state of the stream observed so far. Sketch sizes are often orders-of-magnitude smaller, typically just kilobytes in size, even for very long input streams. The smaller the retained information in the sketch the faster it can be examined or processed.
Sublinear space growth. The sublinear property means that as the number of items in the input stream, n, grows, the size of the sketch must not grow proportionately, it must grow much less than that. To be useful, sketch algorithms should grow logarithmically or sub-logarithmically with the size of the input stream, or not at all. Not only must the sketch start small, it must remain small as n increases.
Mergeability. In order to be useful in large distributed systems, sketches must be mergeable. This means that the internal data structures of two sketches can be combined in a linear or “additive” fashion. Let A and B be streams and + indicate concatenation:
- sketch.merge(sketch(A), sketch(B)) ≅ sketch(A + B).
Some sketches extend mergeability to include operations other than concatenation or union, such as set intersection and set difference. Mergeability enables input streams to be automatically processed in a distributed and parallel manner: split the stream up arbitrarily across many machines, process each piece separately, and merge the resulting sketches.
Mathematically proven error properties. It is important that results obtained from a sketch have clearly defined and useful error properties. For example, a sketch with a specific configuration might specify that with 95% confidence, the exact answer will be within plus-or-minus 1% of the estimated result produced by the sketch. In order to make such a statement the sketch must essentially be stream independent, which is to say that the sketch must deliver on its claimed error properties independent of the length, the order, the range of values, and the distribution of the items in the input stream. There is always the possibility that with certain stream lengths, orders, or distributions, that the actual sketch error might be considerably better, even zero, but the claimed error properties still hold.
- In order to make claims like this, the theory behind sketches must include rigorous mathematical proofs that demonstrate that, if implemented correctly, the sketch will produce results with the claimed error properties regardless of the input stream.
- The sketch is essentially a complex state machine and combined with some arbitrary input stream defines a stochastic process. We then apply probabilistic methods to interpret the states of the stochastic process in order to extract useful information about the input stream itself. The resulting information will be approximate, but we also use additional probabilistic methods to extract an estimate of the likely probability distribution of error.
- The error of the sketch result for a given input stream is a random variable, and “mathematically proven error properties” means that this random variable has a probability distribution with a mean and a variance that is well-understood. More generally, the estimate returned by a sketch will be within some error ε of the true answer to any query, with a specified statistical confidence. The definition of this ε error is determined by the mathematics underlying the specific sketch and can be either additive (i.e., the sketch returns the correct result in the range result ± ε, or multiplicative (i.e., the sketch returns the correct result in the range result * (1 ± ε)).
- The mathematical proofs should also include the merge process and associated algorithms. It is important to note that there must be no error penalty for merging. I.e., the error properties of the merged sketch must be no worse than the error of a single sketch produced directly from the concatenated streams.

...

Sketching is a rather distinct field in that the underlying theory of these algorithms is usually only taught in graduate-level elective courses and only at a few universities that have professors that have experience in this field (see Appendix). For example, a freshly-minted Ph.D. in Machine Learning, does not imply that they have any exposure to this area. Even course-work in algorithms does not imply that this area is taught because it is an advanced topic. There are specific academic workshops on streaming sub-linear algorithms, but no dedicated conferences, yet.

The Rationale for Apache DataSketches

Other open source implementations of sketch algorithms can be found on the Internet. However, we have not yet found any open source implementations that are as comprehensive, engineered with the quality required for production systems, and with usable and guaranteed error properties. Large Internet companies, such as Google and Facebook, have published papers on sketching, however, their implementations of their published algorithms are proprietary and not available as open source.

...

Binary compatibility across language, platform and history. Binary compatibility means that the stored image of a sketch can be fully interpreted and used by the same type sketch in a different language (e.g., C++, Python) or on a different platform. Our guarantee is that sketches that were produced by the earliest versions of our code can still be read and interpreted by the latest versions of our code. This is critically important for systems that might store years worth of sketches, because it is vastly more efficient than attempting to store years worth of raw data. We have found that this property is even vastly more important than backward compatibility of the APIs. Unfortunately, APIs do have to change and evolve, and while we try hard to avoid this, it sometimes is required.
Accommodations for specific system architecture or language requirements. Through our work with the Druid team we learned the importance of being able to operate sketches off the java heap. As a result, the sketches that we have currently integrated into Druid’s aggregation functions have this off-heap (or Direct) capability. By operate we mean that the sketch is able to be updated, queried, and merged without having to be deserialized on to the Java heap first. Our work with PostgreSQL (C++) team has taught us the importance of enabling user specification of malloc() and free() which can be customized to the environment.

We believe that having DataSketches as an Apache project will provide an immediate, worthwhile, and substantial contribution to the open source community, will have a better opportunity to provide a meaningful contribution to both the science and engineering of sketching algorithms, and integrate with other Apache projects. In addition, this is a significant opportunity for Apache to be the "go-to" destination for users that want to leverage this exciting technology.

Apache DataSketches as a Top-Level Project

Because successful development and implementation of high-performance sketches involves knowledge of advanced mathematics and statistics, there might be a tendency to associate the Apache DataSketches project with Apache Commons-Math or Apache Commons-Statistics. This, I believe, would be a mistake for a couple of reasons.

Language Support. The Apache Commons-Math, Apache Commons-Statistics, and Apache Commons-Lang libraries are exclusively Java libraries by definition. The DataSketches library supports multiple languages (So far: Java, C++, Python).
Visibility to data processing platform developers. Sketching is a relatively new field in the arsenal of tools available to system developers. Burying this project under the commons math or commons statistics may make it harder to find. We want to encourage synergy with the various platforms to learn to leverage this technology and to provide feedback to us on capabilities in the design of the sketches themselves.
Sketches solve difficult computational problems that are desirable queries in large data processing systems, such as unique counts, quantiles, CDFs, PMFs, Histograms, Heavy-hitters (TopN), etc. And they solve these problems in a mergeable and streaming way, which makes them suitable for real-time queries.

Initial Goals

We are breaking our initial goals into short-term (2-6 months) and intermediate to longer-term (6 months to 2 years):

...

Understanding and adapting to the Apache development process and structures.
Start refactoring codebase and move various DataSketches repositories code to Apache Git repository.
Continue development of new features, functions, and fixes.
Specific sub-projects (e.g., C++ and Python) will continue to be developed and expanded.

The intermediate to longer term goals include:

Completing the design and implementation of the C++ sketches to complement what is already available in Java, and the Python wrappers of those C++ sketches.
Expanding the C++ build framework to include Windows and the popular Linux variants.
Continued engagement with the scientific research community on the development of new algorithms for computationally difficult problems that heretofore have not had a sketching solution.

Current Status

The DataSketches GitHub project has been quite successful. As of this writing (Feb, 2019) the number of downloads measured by the Nexus Repository Manager at oss.sonatype.org has grown by nearly a factor of 10 over the past year to about 55 thousand per month. The DataSketches/sketches-core repository has about 560 stars and 141 forks, which is pretty good for a highly specialized library.

...

Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests: distributed systems, scalable systems and platforms for big data processing, concurrent algorithms and data structures,
Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale, California. Interests: algorithms, theoretical and applied mathematics, encoding and compression theory, theoretical and applied performance optimization.
Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto, California. Manages the algorithms group at Amazon AI. We build scalable machine learning systems and algorithms which are used both internally and externally by customers of SageMaker, AWS's flagship machine learning platform.
Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests: Computational advertising, machine learning, speech recognition, data-driven analysis, large scale experimentation, big data, stream/complex event processing
Justin Thaler: (*) Assistant Professor, Department of Computer Science, Georgetown University, Washington D.C. Interests: algorithms and computational complexity, complexity theory, quantum algorithms, private data analysis, and learning theory, developing efficient streaming and sketching algorithms

Engineers That Love Science

Roman Leventov: Senior Software Engineer, Metamarkets / Snap. Interests: design and implementation of data storing and data processing (distributed) systems, performance optimization, CPU performance, mechanical sympathy, JVM performance, API design, databases, (concurrent) data structures, memory management, garbage collection algorithms, language design and runtimes (their tradeoffs), distributed systems (cloud) efficiency, Linux, code quality, code transformation, pure functional programming models, Haskell.
Lee Rhodes: (*) Distinguished Architect, lead developer and founder of the DataSketches project, Yahoo, Sunnyvale, California. Interests: streaming algorithms, mathematics, computer science, high quality and high performance code for the analysis of massive data, bridging the divide between theory and practice.
Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, California. Interests: applied mathematics, computer science, big data, distributed systems.

Introduction to Additional Interested Contributors

...

Build
- Apache Maven
Integrations and adaptors for the following projects naturally have them as dependencies
- Apache Hive
- Apache Pig
- Apache Druid
- Apache Spark
Additional dependencies for the above integrations and adaptors include
- Apache Hadoop
- Apache Commons (Math)

There is no other Apache project that we are aware of that duplicates the functionality of the DataSketches library.

...

DataSketches website: datasketches.github.io.
DataSketches website repository: github.com/DataSketches/DataSketches.github.io

We will need an apache website for this documentation similar to

https://datasketches.apache.org

Initial Source

The initial source for DataSketches which we will submit to the Apache Foundation will include a number of repositories which are currently hosted under the GitHub.com/datasketches organization:

...

Java Production Code
- sketches-core: This repository has the core sketching classes, which are leveraged by some of the other repositories. This repository has no external dependencies outside of the DataSketches/memory repository, Java and TestNG for unit tests. This code is versioned and the latest release can be obtained from Maven Central.
- memory: Low level, high-performance memory data-structure management primarily for off-heap. This code is versioned and the latest release can be obtained from Maven Central.
- sketches-android: This is a new repository dedicated to sketches designed to be run in a mobile client, such as a cell phone or IoT device. It should be considered experimental. It is not currently versioned or published to Maven Central.
- sketches-hive: This repository contains Hive UDFs and UDAFs for use within Hadoop grid environments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Maven Central.
- sketches-pig: This repository contains Pig User Defined Functions (UDF) for use within Hadoop grid environments. This code has dependencies on sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven to bring in all the required dependencies. This code is versioned and the latest release can be obtained from Maven Central.
- sketches-vector: This is a new repository dedicated to sketches for vector and matrix operations. It is still somewhat experimental. It is versioned and published to Maven Central.
Java Non-Production Code
- characterization: This relatively new repository is for code that we use to characterize the accuracy and speed performance of the sketches in the library and is constantly being updated. Examples of the job command files used for various tests can be found in the src/main/resources directory. Some of these tests can run for hours depending on its configuration. This code is not versioned and not published to Maven Central.
- experimental: This repository is an experimental staging area for code that may eventually end up in another repository. This code is not versioned and not published to Maven Central.
- sketches-misc: Demos and other code not related to production deployment. We have no plans to publish this to Maven Central in the future.
C++ and Python Production Code
- sketches-core-cpp: This is the C+/Python companion to the Java sketches-core. These implementations are binary compatible with their counterparts in Java. In other words, a sketch created and stored in C+ can be opened and read in Java and visa-versa. This site also has our Python adaptors that basically wrap the C++ implementations, making the high performance C++ implementations available from Python. This code will be versioned.
- sketches-postgres: This site provides the postgres-specific adaptors that wrap the C++ implementations making them available to the Postgres database users. This code will be versioned.
C++ and Python Non-Production Code
- characterization-cpp: This is the C++/Python companion to the Java characterization repository. This code will not be versioned.
- experimental-cpp: This repository is an experimental staging area for C++ code that will eventually end up in another repository. This code will not be versioned.
Command-Line Tools - Non Production Code

...

sketches-cmd
homebrew-sketches
homebrew-sketches-cmd

These projects have always been Apache 2.0 licensed. We intend to bundle all of these repositories since they are all complementary and should be maintained in one project. Prior to our submission, we will combine all of these projects into a new git repository.

...

Java Characterization Dependencies The characterization code is not production run-time code and for the user to inspect and run if they wish. Because this repository contains characterization tests for algorithms from external sources by definition it has dependencies on those external sources.
Java System Integration Adaptor Dependencies: The code in the sketches-pig, sketches-hive repositories by definition rely on Apache Pig, Apache Hive and Apache Hadoop code.
C++ and Python Repositories: This is still evolving, but we have tried to limit the dependencies to the C and C++ Standard and Boost libraries.
C++ System Integration Adaptor Dependencies: So far we only have an adaptor for PostgreSQL which has dependencies on PostgreSQL.
Ruby / Homebrew Command-Line Tool: This has dependencies on Ruby and Homebrew code (for Mac systems).
Android-based Sketches: So far, this only has dependencies on Java and no other external dependencies.

Required Resources

Mailing Lists

...

dev@datasketches.incubator.apache.org
user@datasketches.incubator.apache.org
private@datasketches.incubator.apache.org
commits@datasketches.incubator.apache.org
issues@datasketches.incubator.apache.org
builds@datasketches.incubator.apache.org
notifications@datasketches.incubator.apache.org

Source Control

The DataSketches team currently uses Git and would like to continue to do so. We request a Git repository for DataSketches with mirroring to GitHub enabled similar the following:

Issue Tracking

We request the creation of an Apache-hosted JIRA. The DataSketches project is currently using the public GitHub issue tracker and the public Google Groups forum/sketches-user for issue tracking and discussions. We will migrate and combine from these two sources to the Apache JIRA.

...

The Apache Incubator **** This is our 1st choice ****
Apache Druid. The incubating Apache Druid project might also be a logical sponsor. However, DataSketches has applications in many areas of computing outside of Druid so our preference and recommendation is that DataSketches would ultimately be a top-level Apache project.

...

Appendix

Academic Classes on Streaming and Sketching Algorithms

...

Amit Chakrabarti's course at Dartmouth (Data Stream Algorithms, last offered 2015): https://www.cs.dartmouth.edu/~ac/Teach/CS35-Fall15/
Andrew McGregor's course at UMass Amherst: https://people.cs.umass.edu/~mcgregor/courses/CS711S18/index.html. This is course sometimes called "More Advanced Algorithms", and sometimes called "Data Stream Algorithms.
Justin Thaler’s course at Georgetown, called "Streaming Algorithms": http://people.cs.georgetown.edu/jthaler/COSC548.html.
Paul Beame's course at University of Washington, called Sublinear (and Streaming) Algorithms: https://courses.cs.washington.edu/courses/cse522/14sp/lectures/index.html.
Jelani Nelson's course at Harvard, called "Sketching Algorithms for Big Data": https://www.sketchingbigdata.org/fall17/.

All of the courses above are taught by theorists, and with the possible exception of Professor Thaler’s course, don't cover many of the algorithms or concepts most central to the Data Sketches library.

...

Page tree

Versions Compared

Old Version 32

New Version 104

Key

Apache DataSketches Proposal

Abstract

International Keynotes and Tutorials on Sketching by the Team

The Rationale for Sketches

The Rationale for Apache DataSketches

Apache DataSketches as a Top-Level Project

Initial Goals

Current Status

Engineers That Love Science

Introduction to Additional Interested Contributors

Initial Source

Required Resources

Mailing Lists

Source Control

Issue Tracking

Appendix

Academic Classes on Streaming and Sketching Algorithms

Page tree

Page History

Versions Compared

Old Version 32

New Version 104

Key

Apache DataSketches Proposal

Abstract

International Keynotes and Tutorials on Sketching by the Team

The Rationale for Sketches

The Rationale for Apache DataSketches

Apache DataSketches as a Top-Level Project

Initial Goals

Current Status

Engineers That Love Science

Introduction to Additional Interested Contributors

Initial Source

Required Resources

Mailing Lists

Source Control

Issue Tracking

Appendix

Academic Classes on Streaming and Sketching Algorithms