Differences between revisions 31 and 32
Revision 31 as of 2013-03-28 23:56:30
Size: 7809
Comment:
Revision 32 as of 2013-07-11 16:04:16
Size: 1139
Comment: Rip out obsolete material in favor of DS docs and Patrick's presentations
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
The Cassandra data model is designed for distributed data on a very large scale. It trades ACID-compliant data practices for important advantages in performance, availability, and operational manageability. Here are some resources for learning about the data model: Cassandra is a partitioned row store, where rows are organized into tables with a required primary key.
Line 4: Line 4:
 * DataStax (formerly Riptano) reference documentation on the Cassandra data model:
  * [[http://www.datastax.com/docs/0.7/data_model/index|Cassandra data model (version 0.7)]]
  * [[http://www.datastax.com/docs/1.1/ddl/index|Cassandra data model (version 1.1)]]
 * [[http://maxgrinev.com/2010/07/09/a-quick-introduction-to-the-cassandra-data-model/|An Introduction to the data model]] by Max Grinev.
The first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the PK. Other columns may be indexed independent of the PK.
Line 9: Line 6:
The basic concepts are: This allows pervasive denormalization to "pre-build" resultsets at update time, rather than doing expensive joins across the cluster.
Line 11: Line 8:
 * Cluster: the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces.
 * Keyspace: a namespace for !ColumnFamilies, typically one per application.
 * !ColumnFamilies contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys.
 * !SuperColumns can be thought of as columns that themselves have subcolumns.
DataStax has a good introduction to data modeling in Cassandra here:
Line 16: Line 10:
We'll start from the bottom up, moving from the leaves of Cassandra's data structure (columns) up to the root of the tree (the cluster).  * [[http://www.datastax.com/documentation/cassandra/1.2/index.html#cassandra/ddl/ddl_anatomy_table_c.html#concept_ds_qqw_1dy_zj|http://www.datastax.com/documentation/cassandra/1.2/index.html#cassandra/ddl/ddl_anatomy_table_c.html]]
Line 18: Line 12:
= Columns =
The column is the lowest/smallest increment of data. It's a tuple (triplet) that contains a name, a value and a timestamp.
For more detail, see Patrick !McFadin's three-part series:
Line 21: Line 14:
Here's the thrift interface definition of a Column

{{{
struct Column {
  1: binary name,
  2: binary value,
  3: i64 timestamp,
}
}}}
And here's a column represented in JSON-ish notation:

{{{
{
  "name": "emailAddress",
  "value": "foo@bar.com",
  "timestamp": 123456789
}
}}}
All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings.

Timestamps can be anything you like, but microseconds since 1970 is a convention. Whatever you use, it must be consistent across the application, otherwise earlier changes may overwrite newer ones.

= Column Families =
A column family is a container for rows, analogous to the table in a relational system. Each row in a column family can be referenced by its key.

Column families have a configurable ordering applied to the columns within each row, which affects the behavior of the get_slice call in the thrift API. Out of the box ordering implementations include ASCII, UTF-8, Long, UUID (lexical or time), Date, combinations of these using CompositeType, and others.

= Rows =
In Cassandra, each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns, those that you'll access together, should be kept within the same column family.

The row key is what determines what machine data is stored on. Thus, for each key you can have data from multiple column families associated with it. However, these are logically distinct, which is why the Thrift interface is oriented around accessing one !ColumnFamily per key at a time. (TODO given this, is the following JSON more confusing than helpful?)

A JSON representation of the rowkey -> column families -> column structure is

{{{
{
   "row_key1":{
      "Users":{
         "emailAddress":{"name":"emailAddress", "value":"foo@bar.com"},
         "webSite":{"name":"webSite", "value":"http://bar.com"}
      },
      "Stats":{
         "visits":{"name":"visits", "value":"243"}
      }
   },
   "row_key2":{
      "Users":{
         "emailAddress":{"name":"emailAddress", "value":"user2@bar.com"},
         "twitter":{"name":"twitter", "value":"user2"}
      }
   }
}
}}}
Note that the key "row_key1" identifies data in two different column families, "Users" and "Stats". This does not imply that data from these column families is related. The semantics of having data for the same key in two different column families is entirely up to the application. Also note that within the "Users" column family, "row_key1" and "row_key2" have different column names defined. This is perfectly valid in Cassandra. In fact there may be a virtually unlimited set of column names defined, which leads to fairly common use of the column name as a piece of runtime populated data. This is unusual in storage systems, particularly if you're coming from the RDBMS world.

= Keyspaces =
A keyspace is the first dimension of the Cassandra hash, and is the container for column families. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.

= Super Columns =
So far we've covered "normal" columns and rows. Cassandra also supports super columns: columns whose values are columns; that is, a super column is a (sorted) associative array of columns.

One can thus think of columns and super columns in terms of maps: A row in a regular column family is basically a sorted map of column names to column values; a row in a super column family is a sorted map of super column names to maps of column names to column values.

A JSON description of this layout:

{{{
{
  "mccv": {
    "Tags": {
      "cassandra": {
        "incubator": {"incubator": "http://incubator.apache.org/cassandra/"},
        "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"}
      },
      "thrift": {
        "jira": {"jira": "http://issues.apache.org/jira/browse/THRIFT"}
      }
    }
  }
}
}}}
Here my column family is "Tags". I have two super columns defined here, "cassandra" and "thrift". Within these I have specific named bookmarks, each of which is a column.

Just like normal columns, super columns are sparse: each row may contain as many or as few as it likes; Cassandra imposes no restrictions.

= Range queries =
Cassandra supports pluggable partitioning schemes with a relatively small amount of code. Out of the box, Cassandra provides the hash-based RandomPartitioner and a ByteOrderedPartitioner. RandomPartitioner gives you pretty good load balancing with no further work required. ByteOrderedPartitioner on the other hand lets you perform range queries on the keys you have stored, but requires choosing node tokens carefully or active load balancing. Systems that only support hash-based partitioning cannot perform range queries efficiently.

= Modeling your application =
Unlike with relational systems, where you model entities and relationships and then just add indexes to support whatever queries become necessary, with Cassandra you need to think about what queries you want to support efficiently ahead of time, and model appropriately. Since there are no automatically-provided indexes, you will be much closer to one !ColumnFamily per query than you would have been with tables:queries relationally. Don't be afraid to denormalize accordingly; Cassandra is much, much faster at writes than relational systems, without giving up speed on reads.

See the CassandraLimitations page for other things to keep in mind when designing a model.

= Attribution =
Thanks to phatduckk and asenchi for coming up with examples, text, and reviewing concepts.
 1. [[http://www.youtube.com/watch?v=px6U2n74q3g|The Data Model is Dead; Long live the Data Model]]
 1. [[http://www.youtube.com/watch?v=qphhxujn5Es|Become a Super Modeler]]
 1. [[http://www.youtube.com/watch?v=HdJlsOZVGwM&list=PLqcm6qE9lgKJzVvwHprow9h7KMpb5hcUU&index=11|The World's Next Top Data Model]]

Introduction

Cassandra is a partitioned row store, where rows are organized into tables with a required primary key.

The first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the PK. Other columns may be indexed independent of the PK.

This allows pervasive denormalization to "pre-build" resultsets at update time, rather than doing expensive joins across the cluster.

DataStax has a good introduction to data modeling in Cassandra here:

For more detail, see Patrick McFadin's three-part series:

  1. The Data Model is Dead; Long live the Data Model

  2. Become a Super Modeler

  3. The World's Next Top Data Model

DataModel (last edited 2014-07-22 16:34:28 by JonathanEllis)