配置文件根据 DatabaseDescriptor 被解析。(有默认值的项目，默认值将被定义。)
CassandraServer 把客户端请求翻译成一一对应的内部请求。然后 StorageProxy完成这些请求相应的具体工作。其结果由 CassandraServer 返回给客户端。
AbstractReplicationStrategy控制每一个键值范围对应的第二份、第三份复制对应的节点。主本数据存储节点由该数据对应的Token以及其它变量所决定， 如果复制策略是机架不相关的，主本对应的复制存储在Token序列对应的接下来的 N-1 个节点上。如果复制策略是机架相关的，首先主本对应的一个复制是存储在另外一个机架上，而其它 N-2 (假设 N>2 )个复制和主本处于同一机架，存储在 Token 序列内依次的 N-2 个节点上。
MessagingService负责处理内部的连接池，以及在(通过一个多线程的Executorservice)合适的stage上运行内部命令。stage由StageManager进行管理， Stage有读stage, 写stage,和流处理stage等类型(在Cassandra bootstrap或者token的重新定义时，需要移动大量sstable的数据，这个时候进行的就是Streaming的处理)。 (译者注：stage不是阶段，也不是舞台，是一个有时空(Gossip)概念的定义，具体的内涵请查阅Cassandra源代码或者读SEDA方面的资料)。stage上运行的内部命令由StorageService定义，可参考"registerVerbHandlers"
StorageProxy 负责从 ReplicationStrategy 获取的节点副本的钥匙，然后发送 RowMutation 信息给他们(节点)。
on the destination node, RowMutationVerbHandler uses Table.Apply to hand the write first to CommitLog.java, then to the Memtable for the appropriate ColumnFamily.
When a Memtable is full, it gets sorted and written out as an SSTable asynchronously by ColumnFamilyStore.switchMemtable
When enough SSTables exist, they are merged by ColumnFamilyStore.doFileCompaction
- Making this concurrency-safe without blocking writes or reads while we remove the old SSTables from the list and add the new one is tricky, because naive approaches require waiting for all readers of the old sstables to finish before deleting them (since we can't know if they have actually started opening the file yet; if they have not and we delete the file first, they will error out). The approach we have settled on is to not actually delete old SSTables synchronously; instead we register a phantom reference with the garbage collector, so when no references to the SSTable exist it will be deleted. (We also write a compaction marker to the file system so if the server is restarted before that happens, we clean out the old SSTables at startup time.)
StorageProxy gets the nodes responsible for replicas of the keys from the ReplicationStrategy, then sends read messages to them
This may be a SliceFromReadCommand, a SliceByNamesReadCommand, or a RangeSliceReadCommand, depending
On the data node, ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice and sends it back as a ReadResponse
- The row is located by doing a binary search on the index in SSTableReader.getPosition
For single-row requests, we use a QueryFilter subclass to pick the data from the Memtable and SSTables that we are looking for. The Memtable read is straightforward. The SSTable read is a little different depending on which kind of request it is:
- If we are reading a slice of columns, we use the row-level column index to find where to start reading, and deserialize block-at-a-time (where "block" is the group of columns covered by a single index entry) so we can handle the "reversed" case without reading vast amounts into memory
- If we are reading a group of columns by name, we still use the column index to locate each column, but first we check the row-level bloom filter to see if we need to do anything at all
- The column readers provide an Iterator interface, so the filter can easily stop when it's done, without reading more columns than necessary
Since we need to potentially merge columns from multiple SSTable versions, the reader iterators are combined through a ReducingIterator, which takes an iterator of uncombined columns as input, and yields combined versions as output
If a quorum read was requested, StorageProxy waits for a majority of nodes to reply and makes sure the answers match before returning. Otherwise, it returns the data reply as soon as it gets it, and checks the other replies for discrepancies in the background in StorageService.doConsistencyCheck. This is called "read repair," and also helps achieve consistency sooner.
As an optimization, StorageProxy only asks the closest replica for the actual data; the other replicas are asked only to compute a hash of the data.
based on "Efficient reconciliation and flow control for anti-entropy protocols:" http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
See ArchitectureGossip for more details
based on "The Phi accrual failure detector:" http://vsedach.googlepages.com/HDY04.pdf
The idea of dividing work into "stages" with separate thread pools comes from the famous SEDA paper: http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
Crash-only design is another broadly applied principle. Valerie Henson's LWN article is a good introduction
Cassandra's distribution is closely related to the one presented in Amazon's Dynamo paper. Read repair, adjustable consistency levels, hinted handoff, and other concepts are discussed there. This is required background material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html. The related article on article on eventual consistency is also relevant. Jeff Darcy's article on Availability and Partition Tolerance explains the underlying principle of CAP better than most.
Cassandra's on-disk storage model is loosely based on sections 5.3 and 5.4 of the Bigtable paper.
Facebook's Cassandra team authored a paper on Cassandra for LADIS 09: http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf. Most of the information there is applicable to Apache Cassandra (the main exception is the integration of ZooKeeper).