|Deletions are marked like this.||Additions are marked like this.|
|Line 28:||Line 28:|
最近写入的数据总是先写入内存表(aka memtable) , 但是旧的数据会被刷新到磁盘，被保存在操作系统的文件系统缓存。换句话说，内存嘛，韩信点兵多多益善，建议最少有1G的虚拟内存。可以观察到性能的改善依赖于你的比较常用的数据集(hot data set),但是硬件通常需要4GB，在高端应用中，比如你看到的集群，通常每个节点会带着16GB到32GB乃至更多。
Commit logs receive every write made to a Cassandra node and have the potential to block client operations, but they are only ever read on node start-up. SSTable (data file) writes on the other hand occur asynchronously, but are read to satisfy client look-ups. SSTables are also periodically merged and rewritten in a process called compaction. Another important difference between commitlog and sstables is that commit logs are purged after the corresponding data has been flushed to disk as an SSTable, so CommitLogDirectory only holds uncommitted data while the directories in DataFileDirectories store all of the data written to a node.
So to summarize, if you use a different device for your CommitLogDirectory it needn't be large, but it should be fast enough to receive all of your writes (as appends, i.e., sequential i/o). Then, use one or more devices for DataFileDirectories and make sure they are both large enough to house all of your data, and fast enough to both satisfy reads that are not cached in memory and to keep up with flushing and compaction.
As covered in MemtableSSTable, compactions can require up to 100% of your in-use space temporarily in the worst case, free on a single volume (that is, in a data file directory). So if you are going to be approaching 50% or more of your disks' capacity, you should raid0 your data directory volumes. B. Todd Burruss adds on the mailing list, "With the file sizes we're talking about with cassandra and other database products, the [raid] stripe size doesn't seem to matter. Mine is set to 128k, which produced the same results as 16k and 256k." In addition to giving you capacity for compactions, raid0 will help smooth out io hotspots within a single sstable.
On ext2/ext3 the maximum file size is 2TB, even on a 64 bit kernel. On ext4 that goes up to 16TB. Since Cassandra can use almost half your disk space on a single file, if you are raiding large disks together you may want to use XFS instead, particularly if you are using a 32-bit kernel. XFS file size limits are 16TB max on a 32 bit kernel, and basically unlimited on 64 bit.
Several heavy users of Cassandra deploy in the cloud, e.g. CloudKick on Rackspace Cloud Servers and SimpleGeo on Amazon EC2. The general consensus in the community seems to be that Rackspace's VMs offer better performance for Cassandra because of CPU bursting, raided local disks, and separate public/private network interfaces.