Differences between revisions 1 and 17 (spanning 16 versions)
Revision 1 as of 2009-11-13 16:21:07
Size: 3166
Editor: 78-86-128-147
Comment:
Revision 17 as of 2015-12-04 04:08:37
Size: 5236
Editor: MichaelEdge
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
(WORK IN PROGESS!)
Line 10: Line 8:
== Motivation ==
Line 13: Line 10:
Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible
Line 16: Line 12:
... and when you do, it usually isn't relational anymore Why Cassandra
 * MySQL drives too many random I/Os
 * File-based solutions require far too many locks
Line 18: Line 16:
* The new face of data The new face of data
Line 21: Line 19:
Scale out, not up Online load balancing, cluster growth Flexible schema Key-oriented queries CAP-aware  * Scale out, not up   * Online load balancing, cluster growth
 *
Flexible schema   * Key-oriented queries
 *
CAP-aware
Line 24: Line 26:
 * CAP theorem === CAP theorem ===
Line 27: Line 29:
Pick two of Consistency, Availability, Partition tolerance The '''CAP''' theorem ([[http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf|Brewer]]) states that you have to pick two of '''Consistency''', '''Availability''', '''Partition tolerance''': You can't have the three at the same time and get an acceptable latency.
Line 29: Line 31:
Cassandra values Availability and Partitioning tolerance ('''AP'''). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency). But, you can't get row locking: that is a definite win for HBase.

Note: Hbase values Consistency and Partitioning tolerance ('''CP''')

=== History and approaches ===
Line 44: Line 51:
10,000 ft summary === Cassandra 10,000 ft summary ===
Line 48: Line 55:
 * Log-structured ColumnFamily data model similar to Bigtable's  * Log-structured !ColumnFamily data model similar to Bigtable's
Line 52: Line 59:
Cassandra highlights === Cassandra highlights ===
Line 63: Line 70:
p2p distribution model -- which drives the consistency model -- means there is no single point of failure.
Line 70: Line 78:
== Keys distribution and Partition ==
Dynamo architecture & Lookup
Line 72: Line 82:
Dynamo architecture & Lookup In a ring of nodes A, B, C, D, E, F and G
Nodes B, C and D store keys in the range (''a'',''b'') including key ''k''


You can decide where the key should go in Cassandra using the {{{InitialToken}}} parameter for your {{{Partitioner}}}, see [[StorageConfiguration|Storage Configuration]]
Line 77: Line 91:
O(1) node lookup Explicit replication Eventually consistent  * O(1) node lookup   * Explicit replication
 *
Eventually consistent
Line 83: Line 99:
Architecture layers
Messaging service Gossip Failure detection Cluster state Partitioner Replication Commit log Memtable SSTable Indexes Compaction Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools
=== Architecture layers ===
Line 86: Line 101:
Writes ||Core Layer || Middle Layer || Top Layer||
|| Messaging service <<BR>> Gossip Failure detection <<BR>>Cluster state <<BR>>Partitioner <<BR>>Replication <<BR>>||Commit log <<BR>>Memtable <<BR>>SSTable <<BR>>Indexes <<BR>>Compaction <<BR>>|| Tombstones <<BR>> Hinted handoff <<BR>> Read repair <<BR>> Bootstrap <<BR>> Monitoring <<BR>> Admin tools ||
Line 88: Line 104:
== Writes ==
Line 89: Line 106:
Any node Partitioner Commitlog, memtable SSTable Compaction Wait for W responses Find details on the [[WritePathForUsers|Cassandra Write Path]] here
Line 91: Line 108:
== Reads ==
Line 92: Line 110:
Find details on the [[ReadPathForUsers|Cassandra Read Path]] here
Line 93: Line 112:








Memtable / SSTable

Disk
Commit log

SSTable format


Key / data

SSTable Indexes


Bloom filter Key Column





(Similar to Hadoop MapFile / Tfile)

Compaction


Merge keys Combine columns Discard tombstones





Remove
== Remove ==
Line 136: Line 116:
== Consistency ==
See also the [[API|API]] documentation.

Consistency describes how and whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.

On the contrary to the strong consistency used in most relational databases ('''ACID''' for ''Atomicity Consistency Isolation Durability'') Cassandra is at the other end of the spectrum ('''BASE''' for ''Basically Available Soft-state Eventual consistency'').
Cassandra weak consistency comes in the form of eventual consistency which means the database eventually reaches a consistent state. As the data is replicated, the latest version of something is sitting on some node in the cluster, but older versions are still out there on other nodes, but eventually all nodes will see the latest version.
Line 138: Line 125:
More specifically:
R=read replica count
W=write replica count
N=replication factor
Q='''QUORUM''' (Q = N / 2 + 1)
 * If W + R > N, you will have consistency
 * W=1, R=N
 * W=N, R=1
 * W=Q, R=Q where Q = N / 2 + 1
Line 140: Line 136:
Cassandra provides consistency when R + W > N (read replica count +
write replica count > replication factor).
Line 142: Line 140:
Cassandra write properties You get consistency if R + W > N, where R is the number of records to read, W is the number of records to write, and N is the replication factor. A !ConsistencyLevel of ONE means R or W is 1. A !ConsistencyLevel of QUORUM means R or W is ceiling((N+1)/2). A !ConsistencyLevel of ALL means R or W is N. So if you want to write with a !ConsistencyLevel of ONE and then get the same data when you read, you need to read with !ConsistencyLevel ALL.
Line 144: Line 142:

No reads No seeks Fast Atomic within ColumnFamily Always writable









Read path


Any node Partitioner Wait for R responses Wait for N ­ R responses in the background and perform read repair







Cassandra read properties


Read multiple SSTables Slower than writes (but still fast) Seeks can be mitigated with more RAM Scales to billions of rows







Consistency in a BASE world


If W + R > N, you will have consistency W=1, R=N W=N, R=1 W=Q, R=Q where Q = N / 2 + 1







vs MySQL with 50GB of data


MySQL


~300ms write ~350ms read ~0.12ms write ~15ms read
{{https://c.statcounter.com/9397521/0/fe557aad/1/|stats}}

This is an overview of Cassandra architecture aimed at Cassandra users.

Developers should probably look at the Developers links on the wiki's front page

Information is mainly based on J Ellis OSCON 09 presentation

Why Cassandra

  • MySQL drives too many random I/Os
  • File-based solutions require far too many locks

The new face of data

  • Scale out, not up
  • Online load balancing, cluster growth
  • Flexible schema
  • Key-oriented queries
  • CAP-aware

CAP theorem

The CAP theorem (Brewer) states that you have to pick two of Consistency, Availability, Partition tolerance: You can't have the three at the same time and get an acceptable latency.

Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency). But, you can't get row locking: that is a definite win for HBase.

Note: Hbase values Consistency and Partitioning tolerance (CP)

History and approaches

Two famous papers

  • Bigtable: A distributed storage system for structured data, 2006
  • Dynamo: amazon's highly available keyvalue store, 2007

Two approaches

  • Bigtable: "How can we build a distributed db on top of GFS?"
  • Dynamo: "How can we build a distributed hash table appropriate for the data center?"

Cassandra 10,000 ft summary

  • Dynamo partitioning and replication
  • Log-structured ColumnFamily data model similar to Bigtable's

Cassandra highlights

  • High availability
  • Incremental scalability
  • Eventually consistent
  • Tunable tradeoffs between consistency and latency
  • Minimal administration
  • No SPF (Single Point of Failure)

p2p distribution model -- which drives the consistency model -- means there is no single point of failure.

Keys distribution and Partition

Dynamo architecture & Lookup

In a ring of nodes A, B, C, D, E, F and G Nodes B, C and D store keys in the range (a,b) including key k

You can decide where the key should go in Cassandra using the InitialToken parameter for your Partitioner, see Storage Configuration

Architecture details

  • O(1) node lookup
  • Explicit replication
  • Eventually consistent

Architecture layers

Core Layer

Middle Layer

Top Layer

Messaging service
Gossip Failure detection
Cluster state
Partitioner
Replication

Commit log
Memtable
SSTable
Indexes
Compaction

Tombstones
Hinted handoff
Read repair
Bootstrap
Monitoring
Admin tools

Writes

Find details on the Cassandra Write Path here

Reads

Find details on the Cassandra Read Path here

Remove

Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired

Consistency

See also the API documentation.

Consistency describes how and whether a system is left in a consistent state after an operation. In distributed data systems like Cassandra, this usually means that once a writer has written, all readers will see that write.

On the contrary to the strong consistency used in most relational databases (ACID for Atomicity Consistency Isolation Durability) Cassandra is at the other end of the spectrum (BASE for Basically Available Soft-state Eventual consistency). Cassandra weak consistency comes in the form of eventual consistency which means the database eventually reaches a consistent state. As the data is replicated, the latest version of something is sitting on some node in the cluster, but older versions are still out there on other nodes, but eventually all nodes will see the latest version.

More specifically: R=read replica count W=write replica count N=replication factor Q=QUORUM (Q = N / 2 + 1)

  • If W + R > N, you will have consistency

  • W=1, R=N
  • W=N, R=1
  • W=Q, R=Q where Q = N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).

You get consistency if R + W > N, where R is the number of records to read, W is the number of records to write, and N is the replication factor. A ConsistencyLevel of ONE means R or W is 1. A ConsistencyLevel of QUORUM means R or W is ceiling((N+1)/2). A ConsistencyLevel of ALL means R or W is N. So if you want to write with a ConsistencyLevel of ONE and then get the same data when you read, you need to read with ConsistencyLevel ALL.

stats

ArchitectureOverview (last edited 2015-12-04 04:08:37 by MichaelEdge)