Abstract
Heart (Highly Extensible & Accumulative RDF Table) will develop a planet-scale RDF data store and a distributed processing engine based on Hadoop & Hbase.
Proposal
Heart will develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
Background
We can store very sparse RDF data in a single table in Hbase, with as many columns as they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. This reduces costly self-joins in answering queries asking questions on the same subject, which results in efficient processing of queries, although we still need self-joins to answer RDF path queries.
We can further accelerate query performance by using MapReduce for parallel, distributed query processing.
Rationale
Heart Data Loader
Heart Data Loader (HDL) reads RDF data from a file, and organizes the data into a Hbase table in such a way that efficient query processing is possible. In Hbase, we can store everything in a single table. The sparsicy of RDF data is not a problem, because Hbase, which is a column-based storage and adopts various compression techniques, is very good at dealing with nulls in the table
Heart Query Processor
Heart Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table. It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns the results to the user.
Query processing steps are as follows:
{{{SPARQL query -> Parse tree -> Logical operator tree -> Physical operator tree -> Execution}}}
Implemenation of each step may proceed as an individual issue.
Heart Data Materializer
Heart Data Materializer (HDM) pre-computes RDF path queries and stores the results into a Hbase table. Later, HQP uses those materialized data for efficient processing of RDF path queries.
Current Status
This is a new project.
Meritocracy
The initial developers are very familiar with meritocratic open source development, both at Apache and elsewhere. Apache was chosen specifically because the initial developers want to encourage this style of development for the project.
Community
Heart seeks to develop developer and user communities during incubation.
Core Developers
The initial set of committers includes folks from the Hadoop & Hbase communities. We have varying degrees of experience with Apache-style open source development, ranging from none to ASF Members.
Alignment
The developers of Heart want to work with the Apache Software Foundation specifically because Apache has proven to provide a strong foundation and set of practices for developing standards-based infrastructure and server components.
Known Risks
Orphaned products
Inexperience with Open Source
Homogenous Developers
Reliance on Salaried Developers
Relationships with Other Apache Products
Heart has a strong relationship with Apache Hadoop & Hbase. Being part of Apache could help for a closer collaboration between the three projects.
An Excessive Fascination with the Apache Brand
We believe in the processes, systems, and framework Apache has put in place. The brand is excellent, but it is not the only reason why we wish to come to Apache.
Documentation
Initial Source
The initial source will consist of the current HQL and a Java based RDF query language parser.
External Dependencies
- Hadoop (HDFS, Map/Reduce) License: Apache License, 2.0
- Hbase (Sparse Matrix Table) License: Apache License, 2.0
Required Resources
- Developer and user mailing lists
- A subversion repository
- A JIRA issue tracker
Initial Committers
Sponsors
Edward J. Yoon (edwardyoon@apache.org)
Nominated Mentors
Sponsoring Entity
The Apache Incubator.