Differences between revisions 5 and 6
Revision 5 as of 2009-11-16 19:08:18
Size: 6496
Editor: 78-86-128-147
Revision 6 as of 2009-11-16 19:19:31
Size: 6499
Editor: 78-86-128-147
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
Line 188: Line 186:
=== Consistency in a BASE world === == Consistency ==
See also the [[API|API]] documentation.

This is an overview of Cassandra architecture aimed at Cassandra users.

Developers should probably look at the Developers links on the wiki's front page

Information is mainly based on J Ellis OSCON 09 presentation

Why Cassandra

  • MySQL drives too many random I/Os
  • File-based solutions require far too many locks

The new face of data

  • Scale out, not up
  • Online load balancing, cluster growth
  • Flexible schema
  • Key-oriented queries
  • CAP-aware

CAP theorem

The CAP theorem (Brewer00) states that you have to pick two of Consistency, Availability, Partition tolerance. (You can't have the three at the same time).

Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency). But, you can't get row locking: that is a definite win for HBase.

Note: Hbase values Consistency and Partitioning tolerance (CP)

History and approaches

Two famous papers

  • Bigtable: A distributed storage system for structured data, 2006
  • Dynamo: amazon's highly available keyvalue store, 2007

Two approaches

  • Bigtable: "How can we build a distributed db on top of GFS?"
  • Dynamo: "How can we build a distributed hash table appropriate for the data center?"

Cassandra 10,000 ft summary

  • Dynamo partitioning and replication
  • Log-structured ColumnFamily data model similar to Bigtable's

Cassandra highlights

  • High availability
  • Incremental scalability
  • Eventually consistent
  • Tunable tradeoffs between consistency and latency
  • Minimal administration
  • No SPF (Single Point of Failure)

p2p distribution model -- which drives the consistency model -- means there is no single point of failure.

Keys distribution and Partition

Dynamo architecture & Lookup

In a ring of nodes A, B, C, D, E, F and G Nodes B, C and D store keys in the range (a,b) including key k

You can decide where the key should go in Cassandra using the IntitialToken parameter for your Partitioner, see Storage Configuration

Architecture details

  • O(1) node lookup
  • Explicit replication
  • Eventually consistent

Architecture layers

Core Layer

Middle Layer

Top Layer

Messaging service
Gossip Failure detection
Cluster state

Commit log

Hinted handoff
Read repair
Admin tools


Any node Partitioner Commitlog, memtable SSTable Compaction Wait for W responses

Write model:

There are two write modes:

  • Quorum write: blocks until quorum is reached

  • Async write: sends request to any node. That node will push the data to appropriate nodes but return to client immediately

If node down, then write to another node with a hint saying where it should be written two. Harvester every 15 min goes through and find hints and moves the data to the appropriate node

Write path

At write time,

  • you first write to a disk commit log (sequential)

  • After write to log it is sent to the appropriate nodes
  • Each node receiving write first records it in a local log, then makes update to appropriate memtables (one for each column family). A Memtable is Cassandra's in-memory representation of key/value pairs before the data gets flushed to disk as an SSTable.

  • Memtables are flushed to disk when:

    • Out of space
    • Too many keys (128 is default)
    • Time duration (client provided – no cluster clock)
  • When memtables written out two files go out:
    • Data File (SSTable). A SSTable (terminology borrowed from Google) stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys.

    • Index File (SSTable Index). (Similar to Hadoop MapFile / Tfile)

      • (Key, offset) pairs (points into data file)
      • Bloom filter (all keys in data file)
  • When a commit log has had all its column families pushed to disk, it is deleted
  • Compaction: Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)

    • Merge keys
    • Combine columns
    • Discard tombstones

Write properties

  • No reads
  • No seeks
  • Fast

  • Atomic within ColumnFamily

  • Always writable


Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired


Read path

  • Any node
  • Partitioner
  • Wait for R responses
  • Wait for N -­ R responses in the background and perform read repair

Cassandra read properties

  • Read multiple SSTables
  • Slower than writes (but still fast)
  • Seeks can be mitigated with more RAM
  • Scales to billions of rows


See also the API documentation.

If R=read replica count W=write replica count N=replication factor Q=QUORUM (Q = N / 2 + 1)

  • If W + R > N, you will have consistency

  • W=1, R=N
  • W=N, R=1
  • W=Q, R=Q where Q = N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

You get consistency if R + W > N, where R is the number of records to read, W is the number of records to write, and N is the replication factor. A ConsistencyLevel of ONE means R or W is 1. A ConsistencyLevel of QUORUM means R or W is ceiling((N+1)/2). A ConsistencyLevel of ALL means R or W is N. So if you want to write with a ConsistencyLevel of ONE and then get the same data when you read, you need to read with ConsistencyLevel ALL.

Cassandra vs MySQL with 50GB of data



~300ms write

~0.12ms write

~350ms read

~15ms read

ArchitectureOverview (last edited 2015-12-04 04:08:37 by MichaelEdge)