File Format Design Doc
This page describes a planned replacement file format for SSTables.
Requirements
- Compression
- Block based corruption recovery
- Space efficient when un-compressed: remove redundancy
- Random access to the middle of nested, wide rows
- Range tombstones (for range/slice deletes)
Influences
- Google Dremel [1] - Arbitrarily nested, field-oriented serialization
- Hive RCFile [2] - Column-group-oriented storage
Current Implementation
To avoid overloading words too heavily, we will call a series of ordered rows in an SSTable a 'span'. Since SSTables contain data for a single column family in sorted order, spans share these properties as well. One way to visualize a simple span with depth 3 (aka, containing super columns) is to arrange it as a table:
row key |
name1 |
name2 |
value |
cheese |
brie |
flavor |
3.4 |
cheese |
gouda |
flavor |
5.6 |
cheese |
gouda |
origin |
france |
cheese |
swiss |
flavor |
2.6 |
row key |
name1 |
name2 |
value |
fruit |
apple |
flavor |
4.2 |
fruit |
pear |
flavor |
4.9 |
fruit |
pear |
origin |
china |
This representation of a span involves a lot of redundancy in representing whether the parent of a particular column has changed. If we remove that particular redundancy, we get a tree structure:
row key |
name1 |
name2 |
value |
cheese |
brie |
flavor |
3.4 |
|
gouda |
flavor |
5.6 |
|
|
origin |
france |
|
swiss |
flavor |
2.6 |
row key |
name1 |
name2 |
value |
fruit |
apple |
flavor |
4.2 |
|
pear |
flavor |
4.9 |
|
|
origin |
china |
The current implementation of SSTables lays data out on disk in approximately this way: data for rows is stored contiguously. In relation to the table representation above, we divide the tree into pieces using horizontal "chunks". One must seek to the root of the tree for a row in order to read the row index and determine which chunk the next level of the tree is stored in.
Additionally, only the first level of the tree is indexed: in order to find a particular column at the level labeled "name2", you would need to deserialize all columns at that level, which makes large super columns impractical.
Finally, there is a second type of redundancy that the current design does not tackle: the column names at level "name2" are frequently repeated, but since rows are stored independently, we don't normalize those values. For narrow rows (like those shown), removing this redundancy will be our largest win.
Metadata
Metadata is currently implemented such that column parents have metadata that covers their entire range: this means that you cannot delete arbitrary slices, only exact keys or names.
Proposed Implementation
Because we will be storing multiple columns per SSTable, our design will bear the most similarity to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow for nesting via super columns (and hopefully with a more flexible representation in the future), we need to take hints from Dremel's serialization to allow for efficient storage of parent and null information.
Vertical chunks
Rather than slicing the span into chunks horizontally, we will use vertical chunks (and break particularly wide rows into multiple spans):
row key |
cheese |
fruit |
name1 |
brie |
gouda |
swiss |
apple |
pear |
name2 |
flavor |
flavor |
origin |
flavor |
flavor |
flavor |
origin |
value |
3.4 |
5.6 |
france |
2.6 |
4.2 |
4.9 |
china |
This representation achieves some of the benefits for compression shown in the RCFile paper: similar values are stored much closer together. But we have lost some information!: Using the tables above, it is impossible to determine which fields at level "name1" are cheeses, and which are fruits.
Parent representation
We need to store parent information, and one method comes from Dremel's clever representation for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent changes:
row key |
parent_change |
cheese |
0 |
fruit |
0 |
name1 |
parent_change |
brie |
0 |
gouda |
0 |
swiss |
0 |
apple |
1 |
pear |
1 |
name2 |
parent_change |
flavor |
0 |
flavor |
1 |
origin |
1 |
flavor |
0 |
flavor |
1 |
flavor |
0 |
origin |
0 |
value |
parent_change |
3.4 |
0 |
5.6 |
1 |
france |
0 |
2.6 |
1 |
4.2 |
0 |
4.9 |
1 |
china |
0 |
The parent change flags can be represented compactly using a bitmap, and type information can be stored using a byte per value.
Metadata
Cassandra also needs to encode metadata about tuples and ranges of tuples in order to represent creation and deletion timestamps. For both value tuples and range tuples, a varying number (depending on value and range type) of timestamps will need to be encoded.
Types
deleted - If a value has been deleted at any point in time, we need to record that action with timestamps: marking a value deleted indicates that these timestamps exist
standard - A standard value has a timestamp to indicate when it was created
expiring - An expiring value has additional timestamps to indicate when it should be garbage collected
counter - Counter values have an additional timestamp to indicate the last time the counter was deleted
range_* - Types that indicate the beginning or end of a range. range_begin* types store two timestamps
begin|end - Indicate that a tuple is stored to signify a split in the metadata for a parent
(begin|end)_null - Indicate that a range is beginning or ending, but that it is unbounded, and hence no tuple is stored
Range Metadata
Ranges are stored as two values embedded in the array of tuples in a chunk, and marked with a range_* type as described above.
name1 |
parent_change |
type |
brie |
0 |
standard |
gouda |
0 |
standard |
havarti |
0 |
range_begin |
muenster |
0 |
range_end |
swiss |
0 |
standard |
apple |
1 |
standard |
pear |
1 |
standard |
This example adds type information and range metadata for values at level "name1" between 'havarti' and 'muenster': the chunk for the "name1" level stores a pair of tuples for the 'cheese' parent. The end result is that the span stores a tombstone from ('cheese', 'havarti', <empty>) to ('cheese', 'muenster', null), where <empty> is the smallest value, and null is the largest value.
In cases where parents are too wide to fit in a single span, range metadata is also used to indicate that only a portion of a parent is present in a given span. If a span ends with a non-null range_end marker, then it should be assumed that more spans are present.
Value Metadata
Unlike range metadata, value (column) metadata will only be stored in the last chunk of a span (since the entire span is used to represent the path to the column, you don't really care about metadata until you've built the full path to the value).
Schema
To iterate quickly, the initial versions of the file format will be implemented using Avro. The current Avro schema:
/**
* The largest record in an SSTable: consecutive Chunks represent a
* 'span' of tuples, where each chunk represents a level of nesting.
*/
record Chunk {
/** The index of the chunk in the span (mostly for readability). */
int index;
/** Values stored at this level. */
array<bytes> values;
/**
* Packed metadata for values.
* type: deleted, expiring, standard, range_*
*/
bytes types;
/** parent: a single bit to represent parent changes */
bytes parents;
/* TODO: Should offset or rl encode. */
/**
* A variable number of timestamps depending on value type.
*/
array<long> timestamps;
}
Disk Layout
Chunk
|
bytes |
magic |
16 |
encoded_length |
4 |
encoded |
variable |
hash |
4 |
Roadmap
Implementation has started on this design at https://github.com/stuhood/cassandra/tree/file-format
v1
Was implemented last year, and bears no similarity to this design.
v2
The first posted version of the patch:
Will not implement metadata packing (group varint or bitmapped types/nulls)
Will not implement field reordering
Will not modify the SSTable index (meaning we will only be able to store 1 row per span)
https://issues.apache.org/jira/browse/CASSANDRA-2319 implements most of the index changes necessary
- Will be based on Avro