Using Cassandra for large data sets (lots of data per node)
This page aims to to give some advise as to the issues one may need to consider when using Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent is not to make original claims, but to collect in one place some issues that are operationally relevant. Other parts of the wiki are highly recommended in order to fully understand the issues involved.
This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced has been resolved but this document has not been updated), please help by editing or e-mail:ing cassandra-user.
Note that not all of these issues are specific to Cassandra (for example, any storage system is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always be strongly correlated with the percentage of requests that penetrate caching layers).
Unless otherwise noted, the points refer to Cassandra 0.7 and above.
- Disk space usage in Cassandra can vary fairly suddenly over time. If you have significant amounts of data such that available disk space is not significantly higher than usage, consider:
- Compaction of a column family can up to double the disk space used by said column family (in the case of a major compaction and no deletions). If your data is predominantly made up of a single, or a select few, column families then doubling the disk space for a CF may be a significant amount compared to your total disk usage.
- Repair operations can increase disk space demands (particularly in 0.6, less so in 0.7; TODO: provide actual maximum growth and what it depends on).
- As your data set becomes larger and larger (assuming significantly larger than memory), you become more and more dependent on caching to elide I/O operations. As you plan and test your capacity, keep in mind that:
- The cassandra row cache is in the JVM heap and unaffected (remains warm) by compactions and repair operations. This is a plus, but the down-side is that the row cache is not very memory efficient compared to the operating system page cache.
- For 0.6.8 and below, the key cache is affected by compaction because it is per-sstable, and compaction moves data to new sstables.
Was fixed/improved as of CASSANDRA-1878, for 0.6.9 and 0.7.0.
- The operating system's page cache is affected by compaction and repair operations. If you are relying on the page cache to keep the active set in memory, you may see significant degradation on performance as a result of compaction and repair operations.
Prior to 0.7.1 (fixed in CASSANDRA-1555), if you had column families with more than 143 million row keys in them, bloom filter false positive rates would be likely to go up because of implementation concerns that limited the maximum size of a bloom filter. See ArchitectureInternals for information on how bloom filters are used. The negative effects of hitting this limit is that reads will start taking additional seeks to disk as the row count increases. Note that the effect you are seeing at any given moment will depend on when compaction was last run, because the bloom filter limit is per-sstable. It is an issue for column families because after a major compaction, the entire column family will be in a single sstable.
- Compaction is currently not concurrent, so only a single compaction runs at a time. This means that sstable counts may spike during larger compactions as several smaller sstables are written while a large compaction is happening. This can cause additional seeks on reads.
- Consider the choice of file system. Removal of large files is notoriously slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables that happens every now and then, and also affects start-up time (if there are sstables pending removal when a node is starting up, they are removed as part of the start-up proceess; it may thus be detrimental if removing a terabyte of sstables takes an hour (numbers are ballparks, not accurately measured and depends on circumstances)).
- Adding nodes is a slow process if each node is responsible for a large amount of data. Plan for this; do not try to throw additional hardware at a cluster at the last minute.
Cassandra will read through sstable index files on start-up, doing what is known as "index sampling". This is used to keep a subset (currently and by default, 1 out of 100) of keys and and their on-disk location in the index, in memory. See ArchitectureInternals. This means that the larger the index files are, the longer it takes to perform this sampling. Thus, for very large indexes (typically when you have a very large number of keys) the index sampling on start-up may be a significant issue.
- A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks).
Potential future improvement: CASSANDRA-1625.
- The total number of rows per node correlates directly with the size of bloom filters and sampled index entries. Expect the base memory requirement of a node to increase linearly with the number of keys (assuming the average row key size remains constant). If you are not using caching at all (e.g. you are doing analysis type workloads), expect these two to be the two biggest consumers of memory.
- You can decrease the memory use due to index sampling by changing the index sampling interval in cassandra.yaml
You should soon be able to tweak the bloom filter sizes too once CASSANDRA-3497 is done