Differences between revisions 1 and 2
Revision 1 as of 2009-11-13 16:21:07
Size: 3166
Editor: 78-86-128-147
Comment:
Revision 2 as of 2009-11-14 14:57:09
Size: 4719
Editor: 78-86-128-147
Comment:
Deletions are marked like this. Additions are marked like this.
Line 77: Line 77:
O(1) node lookup Explicit replication Eventually consistent  * O(1) node lookup   * Explicit replication
 *
Eventually consistent
Line 84: Line 86:
Messaging service Gossip Failure detection Cluster state Partitioner Replication Commit log Memtable SSTable Indexes Compaction Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools

Writes


 *
Messaging service
 *
Gossip
 *
Failure detection
 *
Cluster state
 *
Partitioner
 *
Replication

 *
Commit log
 *
Memtable   * SSTable
 *
Indexes
 *
Compaction

 *
Tombstones   * Hinted handoff
 *
Read repair
 *
Bootstrap
 *
Monitoring
 *
Admin tools

== Writes ==
Line 92: Line 114:









Memtable / SSTable

Disk
Commit log

SSTable format


Key / data

SSTable Indexes


Bloom filter Key Column





(Similar to Hadoop MapFile / Tfile)

Compaction


Merge keys Combine columns Discard tombstones





Remove
Write model:

There are two write modes:
 * ''Quorum write'': blocks until quorum is reached
 * ''Async write'': sends request to any node. That node will push the data to appropriate nodes but return to client immediately


If node down, then write to another node with a hint saying where it should be written two. Harvester every 15 min goes through and find hints and moves the data to the appropriate node

=== Write path ===
At write time,
 * you first write to a '''disk commit log''' (sequential)
 * After write to log it is sent to the appropriate nodes
 * Each node receiving write first records it in a local log, then makes update to appropriate '''memtables''' (one for each column family). A Memtable is Cassandra's in-memory representation of key/value pairs
before the data gets flushed to disk as an SSTable.
 * '''Memtables''' are flushed to disk when:
   * Out of space
   * Too many keys (128 is default)
   * Time duration (client provided – no cluster clock)
 * When memtables written out two files go out:
   * Data File ('''SSTable'''). A SSTable (terminology borrowed from Google) stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys.
   * Index File ('''SSTable Index'''). (Similar to Hadoop !MapFile / Tfile)
     * (Key, offset) pairs (points into data file)
     * Bloom filter (all keys in data file)
 * When a commit log has had all its column families pushed to disk, it is deleted
 * '''Compaction''': Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)
   * Merge keys
   * Combine columns
   * Discard tombstones





== Remove ==
Line 155: Line 172:
Read path == Read path ==

(WORK IN PROGESS!)

This is an overview of Cassandra architecture aimed at Cassandra users.

Developers should probably look at the Developers links on the wiki's front page

Information is mainly based on J Ellis OSCON 09 presentation

Motivation

Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible

... and when you do, it usually isn't relational anymore

* The new face of data

Scale out, not up Online load balancing, cluster growth Flexible schema Key-oriented queries CAP-aware

  • CAP theorem

Pick two of Consistency, Availability, Partition tolerance

Two famous papers

  • Bigtable: A distributed storage system for structured data, 2006
  • Dynamo: amazon's highly available keyvalue store, 2007

Two approaches

  • Bigtable: "How can we build a distributed db on top of GFS?"
  • Dynamo: "How can we build a distributed hash table appropriate for the data center?"

10,000 ft summary

  • Dynamo partitioning and replication
  • Log-structured ColumnFamily data model similar to Bigtable's

Cassandra highlights

  • High availability
  • Incremental scalability
  • Eventually consistent
  • Tunable tradeoffs between consistency and latency
  • Minimal administration
  • No SPF (Single Point of Failure)

Dynamo architecture & Lookup

Architecture details

  • O(1) node lookup
  • Explicit replication
  • Eventually consistent

Architecture layers

  • Messaging service
  • Gossip
  • Failure detection
  • Cluster state
  • Partitioner
  • Replication
  • Commit log
  • Memtable
  • SSTable
  • Indexes
  • Compaction
  • Tombstones
  • Hinted handoff
  • Read repair
  • Bootstrap
  • Monitoring
  • Admin tools

Writes

Any node Partitioner Commitlog, memtable SSTable Compaction Wait for W responses

Write model:

There are two write modes:

  • Quorum write: blocks until quorum is reached

  • Async write: sends request to any node. That node will push the data to appropriate nodes but return to client immediately

If node down, then write to another node with a hint saying where it should be written two. Harvester every 15 min goes through and find hints and moves the data to the appropriate node

Write path

At write time,

  • you first write to a disk commit log (sequential)

  • After write to log it is sent to the appropriate nodes
  • Each node receiving write first records it in a local log, then makes update to appropriate memtables (one for each column family). A Memtable is Cassandra's in-memory representation of key/value pairs

before the data gets flushed to disk as an SSTable.

  • Memtables are flushed to disk when:

    • Out of space
    • Too many keys (128 is default)
    • Time duration (client provided – no cluster clock)
  • When memtables written out two files go out:
    • Data File (SSTable). A SSTable (terminology borrowed from Google) stands for Sorted Strings Table and is a file of key/value string pairs, sorted by keys.

    • Index File (SSTable Index). (Similar to Hadoop MapFile / Tfile)

      • (Key, offset) pairs (points into data file)
      • Bloom filter (all keys in data file)
  • When a commit log has had all its column families pushed to disk, it is deleted
  • Compaction: Data files accumulate over time. Periodically data files are merged sorted into a new file (and creates new index)

    • Merge keys
    • Combine columns
    • Discard tombstones

Remove

Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired

Cassandra write properties

No reads No seeks Fast Atomic within ColumnFamily Always writable

Read path

Any node Partitioner Wait for R responses Wait for N ­ R responses in the background and perform read repair

Cassandra read properties

Read multiple SSTables Slower than writes (but still fast) Seeks can be mitigated with more RAM Scales to billions of rows

Consistency in a BASE world

If W + R > N, you will have consistency W=1, R=N W=N, R=1 W=Q, R=Q where Q = N / 2 + 1

vs MySQL with 50GB of data

MySQL

~300ms write ~350ms read ~0.12ms write ~15ms read

ArchitectureOverview (last edited 2015-12-04 04:08:37 by MichaelEdge)