Abstract

Heart (Highly Extensible & Accumulative RDF Table) will develop a planet-scale RDF data store and a distributed processing engine based on Hadoop & Hbase.

Proposal

Heart will develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.

Background

We can store very sparse RDF data in a single table in Hbase, with as many columns as they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. This reduces costly self-joins in answering queries asking questions on the same subject, which results in efficient processing of queries, although we still need self-joins to answer RDF path queries.

We can further accelerate query performance by using MapReduce for parallel, distributed query processing.

Rationale

Heart Data Loader

Heart Data Loader (HDL) reads RDF data from a file, and organizes the data into a Hbase table in such a way that efficient query processing is possible. In Hbase, we can store everything in a single table. The sparsicy of RDF data is not a problem, because Hbase, which is a column-based storage and adopts various compression techniques, is very good at dealing with nulls in the table

Heart Query Processor

Heart Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase table. It translates RDF queries into API calls to Hbase, or MapReduce jobs, gathers and returns the results to the user.

Query processing steps are as follows:

{{{SPARQL query -> Parse tree -> Logical operator tree -> Physical operator tree -> Execution}}}

Implemenation of each step may proceed as an individual issue.

Heart Data Materializer

Heart Data Materializer (HDM) pre-computes RDF path queries and stores the results into a Hbase table. Later, HQP uses those materialized data for efficient processing of RDF path queries.

Current Status

This is a new project.

Meritocracy

The initial developers are very familiar with meritocratic open source development, both at Apache and elsewhere. Apache was chosen specifically because the initial developers want to encourage this style of development for the project.

Community

Heart seeks to develop developer and user communities during incubation.

Core Developers

The initial set of committers includes folks from the Hadoop & Hbase communities. We have varying degrees of experience with Apache-style open source development, ranging from none to ASF Members.

Alignment

The developers of Heart want to work with the Apache Software Foundation specifically because Apache has proven to provide a strong foundation and set of practices for developing standards-based infrastructure and server components.

Known Risks

Orphaned products

Inexperience with Open Source

Homogenous Developers

Reliance on Salaried Developers

Relationships with Other Apache Products

Heart has a strong relationship with Apache Hadoop & Hbase. Being part of Apache could help for a closer collaboration between the three projects.

An Excessive Fascination with the Apache Brand

We believe in the processes, systems, and framework Apache has put in place. The brand is excellent, but it is not the only reason why we wish to come to Apache.

Documentation

Initial Source

The initial source will consist of the current HQL and a Java based RDF query language parser.

External Dependencies

Required Resources

Initial Committers

Sponsors

Nominated Mentors

Sponsoring Entity

The Apache Incubator.

HeartProposal (last edited 2009-09-20 23:05:56 by localhost)