Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

Cassandra Data Model and Operations

This page was created by someone trying to understand Cassandra. Until it is reviewed & blessed by someone who really knows you read it at your own risk...

Wiki Markup
This page is an alternate attempt at capturing the Cassandra data model and its operations.  The descriptions below show the original Thrift API (as of 0.3) as well as a simplified notation borrowed from the [Bright Yellow Cow blog entry|http://www.brightyellowcow.com/blog/Evaluating-the-API-of-Cassandra-BigTable-.html], i.e. using \[\] to mean 'list of' and ( , ) for tuple construction.

Simple Column families

A column family has a name and an arbitrary number of columns, each column is a name, value, and timestamp tuple. Columns may be name sorted or time sorted, which affects range operations on them. In pseudo-notation:

No Format
family -> [(name, value, timestamp)]

Since each (top-level) row has an arbitrary set of columns in each column family, we can really think of this as a two dimensional map:

No Format
family -> [(key1, key2, value, timestamp)]

In the Thrift API all this is defined as:

No Format
struct column_t {
   1: string                        columnName,
   2: binary                        value,
   3: i64                           timestamp,
}

typedef map< string, list<column_t>  > column_family_map

insert

Insert a column.

No Format
insert(family, key1, key2, value, timestamp)

I believe the block_for parameter is to wait for N replicas to ACK the write. From the Thrift API:

No Format
void insert(1:string tablename, 2:string key, 3:string columnFamily_column, 4:binary cellData,
            5:i64 timestamp, 6:i32 block_for=0)
throws (1: InvalidRequestException ire, 2: UnavailableException ue),

remove

Remove a column

No Format
remove(family, key1, key2, timestamp)

The timestamp specifies exactly which insertion is removed (the column could have been re-inserted "later"). From the Thrift API:

No Format
void remove(1:string tablename, 2:string key, 3:string columnFamily_column, 4:i64 timestamp,
            5:i32 block_for=0)
throws (1: InvalidRequestException ire, 2: UnavailableException ue),

get_column

Retrieve a specific column for a key.

No Format
get_column(family, key1, key2) -> (key2, value, timestamp)

From the Thrift API:

No Format
column_t       get_column(1:string tablename, 2:string key, 3:string columnFamily_column)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_slice

Retrieve all columns for a key:

No Format
get_slice(family, key1) -> [(key2, value, timestamp)]

plus start/count parameters allow pagination of the results. From the Thrift API:

No Format
list<column_t> get_slice(1:string tablename, 2:string key, 3:string columnFamily_column,
                         4:i32 start=-1, 5:i32 count=-1)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_slice_by_name_range

Retrieve a range of columns for a key:

No Format
get_slice(family, key1, key2_start, key2_end) -> [(key2, value, timestamp)]

plus a count parameter allows limiting the result. From the Thrift API:

No Format
list<column_t> get_slice_by_name_range(1:string tablename, 2:string key, 3:string columnFamily,
                                       4:string start, 5:string end, 6:i32 count=-1)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_slice_by_names

Retrieve a specific set of columns for a key:

No Format
get_slice_by_names(family, key1, [key2_1, key2_2, ..., key2_N]) -> [(key2, value, timestamp)]

From the Thrift API:

No Format
list<column_t> get_slice_by_names(1:string tablename, 2:string key, 3:string columnFamily, 4:list<string> columnNames)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_slice_from

Retrieve columns for a key starting from a specific column.

No Format
get_slice_from(family, key1, key2_start) -> [(key, value, timestamp)]

plus an ascending/descending flag and a count determine the direction and limit of the enumeration. From the Thrift API:

No Format
list<column_t> get_slice_from(1:string tablename, 2:string key, 3:string columnFamily_column,
                              4:bool isAscending, 5:i32 count)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_columns_since

Retrieves columns for a key starting from a specific timestamp.

No Format
get_columns_since(family, key1, key2, timestamp) -> [(key, value, timestamp)]

From the Thrift API:

No Format
list<column_t> get_columns_since(1:string tablename, 2:string key, 3:string columnFamily_column, 4:i64 timeStamp)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

get_column_count

Return the number of columns for a key.

No Format
get_column_count(family, key1, key2) -> count

From the Thrift API:

No Format
i32 get_column_count(1:string tablename, 2:string key, 3:string columnFamily_column)
throws (1: InvalidRequestException ire),

batch_insert

Insert a batch of columns for a key.

No Format
batch_insert(family, key1, [(key2, value, timestamp)])

From the Thrift API:

No Format
struct batch_mutation_t {
   1: string                        table,
   2: string                        key,
   3: column_family_map             cfmap,
}

void     batch_insert(1: batch_mutation_t batchMutation, 2:i32 block_for=0)
throws (1: InvalidRequestException ire, 2: UnavailableException ue),

Super Column

A super column family has a name and an arbitrary number of super columns, each super column has an arbitrary number of columns. "Currently" supercolumns are always name-sorted, and their subcolumns are always time-sorted. In pseudo-notation:

No Format
super_family -> [(super_column, [(column_name, value, timestamp)])]

It is tempting but inaccurate to think of this as a three dimensional map:

No Format
super_family -> [(key1, key2, key3, value, timestamp)]

What's more accurate is to continue thinking of this as a two-dimensional map, just like regular column families, but where the values are really sets of name-value pairs (plus timestamps to be accurate). So it's really like this:

No Format
Simple column families:
  column_family -> [(key1, key2, value, timestamp)]
Super column families:
  column_family -> [(key1, key2, [(key3, value, timestamp)])]

In the Thrift API all this is defined as:

No Format
struct superColumn_t {
   1: string           name,
   2: list<column_t>   columns,
}

typedef map< string, list<superColumn_t>  > superColumn_family_map

get_superColumn

Retrieves a super column from a column family for a key.

No Format
get_superColumn(super_family, key1, key2) -> (key2, [(key3, value, timestamp)])

From the Thrift API:

No Format
superColumn_t get_superColumn(1:string tablename, 2:string key, 3:string columnFamily)
throws (1: InvalidRequestException ire, 2: NotFoundException nfe),

Note that the 3rd argument should really be called columnFamily_superColumnName

get_slice_super

Retrieve the super columns in a super column family for a key.

No Format
get_slice_super(super_family, key1) -> [(key2, [(key3, value, timestamp)])]

The start/count parameters allow pagination of the results. From the Thrift API:

No Format
list<superColumn_t> get_slice_super(1:string tablename, 2:string key, 3:string columnFamily_superColumnName,
                                    4:i32 start=-1, 5:i32 count=-1)
throws (1: InvalidRequestException ire),

Note that the 3rd argument should really be called columnFamily

get_slice_super_by_names

Retrieve a set of super columns in a super column family.

No Format
get_slice_super_by_names(family, key1, [key2_1, key2_2, ..., key2_N]) -> [(key2, [(key3, value, timestamp)])]

From the Thrift API:

No Format
list<superColumn_t> get_slice_super_by_names(1:string tablename, 2:string key, 3:string columnFamily,
                                             4:list<string> superColumnNames)
throws (1: InvalidRequestException ire),

batch_insert_superColumn

Insert a super column.

No Format
batch_insert_superColumn(family, key1, key2, [(key3, value, timestamp)])

From the Thrift API:

No Format
struct batch_mutation_super_t {
   1: string                        table,
   2: string                        key,
   3: superColumn_family_map        cfmap,
}

void batch_insert_superColumn(1:batch_mutation_super_t batchMutationSuper, 2:i32 block_for=0)
throws (1: InvalidRequestException ire, 2: UnavailableException ue),

Other operations

get_key_range

Retrieve the list of keys that exist in a range. A key exists if at least on column in one column family exists for the key. A list of column families can be passed into the call to reduce the search to columns in those families.

No Format
get_key_range(family, key1_start, key1_end, [key2_1, key2_2, ..., key2_N]) -> [key1_1, key1_2, ..., key1_M]

From the Thrift API:

No Format
# range query: returns matching keys
list<string> get_key_range(1:string tablename, 2:list<string> columnFamilies=[], 3:string startWith="", 4:string stopAt="", 
                             5:i32 maxResults=1000)
throws (1: InvalidRequestException ire),

touch

Intended to force index information for the key into cache, but is buggy and to be deprecated.

No Format
touch(key1)

From the Thrift API:

No Format
oneway void touch(1:string key, 2:bool fData),

https://c.statcounter.com/9397521/0/fe557aad/1/|stats