Using Cassandra for large data sets (lots of data per node)

This page aims to to give some advise as to the issues one may need to consider when using Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent is not to make original claims, but to collect in one place some issues that are operationally relevant. Other parts of the wiki are highly recommended in order to fully understand the issues involved.

This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced has been resolved but this document has not been updated), please help by editing or e-mail:ing cassandra-user.

Note that not all of these issues are specific to Cassandra (for example, any storage system is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always be strongly correlated with the percentage of requests that penetrate caching layers).

Unless otherwise noted, the points refer to Cassandra 0.7 and above.

LargeDataSetConsiderations (last edited 2011-12-27 18:33:40 by PeterSchuller)