File Format Design Doc

This page describes a planned replacement file format for SSTables.

Requirements

Influences


Current Implementation

To avoid overloading words too heavily, we will call a series of ordered rows in an SSTable a 'span'. Since SSTables contain data for a single column family in sorted order, spans share these properties as well. One way to visualize a simple span with depth 3 (aka, containing super columns) is to arrange it as a table:

row key

name1

name2

value

cheese

brie

flavor

3.4

cheese

gouda

flavor

5.6

cheese

gouda

origin

france

cheese

swiss

flavor

2.6

row key

name1

name2

value

fruit

apple

flavor

4.2

fruit

pear

flavor

4.9

fruit

pear

origin

china

This representation of a span involves a lot of redundancy in representing whether the parent of a particular column has changed. If we remove that particular redundancy, we get a tree structure:

row key

name1

name2

value

cheese

brie

flavor

3.4

gouda

flavor

5.6

origin

france

swiss

flavor

2.6

row key

name1

name2

value

fruit

apple

flavor

4.2

pear

flavor

4.9

origin

china

The current implementation of SSTables lays data out on disk in approximately this way: data for rows is stored contiguously. In relation to the table representation above, we divide the tree into pieces using horizontal "chunks". One must seek to the root of the tree for a row in order to read the row index and determine which chunk the next level of the tree is stored in.

Additionally, only the first level of the tree is indexed: in order to find a particular column at the level labeled "name2", you would need to deserialize all columns at that level, which makes large super columns impractical.

Finally, there is a second type of redundancy that the current design does not tackle: the column names at level "name2" are frequently repeated, but since rows are stored independently, we don't normalize those values. For narrow rows (like those shown), removing this redundancy will be our largest win.

Metadata

Metadata is currently implemented such that column parents have metadata that covers their entire range: this means that you cannot delete arbitrary slices, only exact keys or names.


Proposed Implementation

Because we will be storing multiple columns per SSTable, our design will bear the most similarity to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow for nesting via super columns (and hopefully with a more flexible representation in the future), we need to take hints from Dremel's serialization to allow for efficient storage of parent and null information.

Vertical chunks

Rather than slicing the span into chunks horizontally, we will use vertical chunks (and break particularly wide rows into multiple spans):

row key

cheese

fruit

name1

brie

gouda

swiss

apple

pear

name2

flavor

flavor

origin

flavor

flavor

flavor

origin

value

3.4

5.6

france

2.6

4.2

4.9

china

This representation achieves some of the benefits for compression shown in the RCFile paper: similar values are stored much closer together. But we have lost some information!: Using the tables above, it is impossible to determine which fields at level "name1" are cheeses, and which are fruits.

Parent representation

We need to store parent information, and one method comes from Dremel's clever representation for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent changes:

row key

parent_change

cheese

0

fruit

0

name1

parent_change

brie

0

gouda

0

swiss

0

apple

1

pear

1

name2

parent_change

flavor

0

flavor

1

origin

1

flavor

0

flavor

1

flavor

0

origin

0

value

parent_change

3.4

0

5.6

1

france

0

2.6

1

4.2

0

4.9

1

china

0

The parent change flags can be represented compactly using a bitmap, and type information can be stored using a byte per value.

Metadata

Cassandra also needs to encode metadata about tuples and ranges of tuples in order to represent creation and deletion timestamps. For both value tuples and range tuples, a varying number (depending on value and range type) of timestamps will need to be encoded.

Types

Range Metadata

Ranges are stored as two values embedded in the array of tuples in a chunk, and marked with a range_* type as described above.

name1

parent_change

type

brie

0

standard

gouda

0

standard

havarti

0

range_begin

muenster

0

range_end

swiss

0

standard

apple

1

standard

pear

1

standard

This example adds type information and range metadata for values at level "name1" between 'havarti' and 'muenster': the chunk for the "name1" level stores a pair of tuples for the 'cheese' parent. The end result is that the span stores a tombstone from ('cheese', 'havarti', <empty>) to ('cheese', 'muenster', null), where <empty> is the smallest value, and null is the largest value.

In cases where parents are too wide to fit in a single span, range metadata is also used to indicate that only a portion of a parent is present in a given span. If a span ends with a non-null range_end marker, then it should be assumed that more spans are present.

Value Metadata

Unlike range metadata, value (column) metadata will only be stored in the last chunk of a span (since the entire span is used to represent the path to the column, you don't really care about metadata until you've built the full path to the value).


Schema

To iterate quickly, the initial versions of the file format will be implemented using Avro. The current Avro schema:

   1 /**
   2  * The largest record in an SSTable: consecutive Chunks represent a
   3  * 'span' of tuples, where each chunk represents a level of nesting.
   4  */
   5 record Chunk {
   6  /** The index of the chunk in the span (mostly for readability). */
   7  int index;
   8 
   9  /** Values stored at this level. */
  10  array<bytes> values;
  11 
  12  /**
  13   * Packed metadata for values.
  14   * type: deleted, expiring, standard, range_*
  15   */
  16  bytes types;
  17 
  18  /** parent: a single bit to represent parent changes */
  19  bytes parents;
  20 
  21  /* TODO: Should offset or rl encode. */
  22  /**
  23   * A variable number of timestamps depending on value type.
  24   */
  25  array<long> timestamps;
  26 }

Disk Layout

Chunk

bytes

magic

16

encoded_length

4

encoded

variable

hash

4

Roadmap

Implementation has started on this design at https://github.com/stuhood/cassandra/tree/file-format

v1

Was implemented last year, and bears no similarity to this design.

v2

The first posted version of the patch:

References

stats