Differences between revisions 33 and 34
Revision 33 as of 2013-03-21 12:42:53
Size: 1362
Comment:
Revision 34 as of 2013-05-20 14:47:57
Size: 1465
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== Stuff that isn't likely to change ==
 * All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
 * A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate.)
== CQL ==
 * No join or subquery support, and limited support for aggregation. This is [[http://www.datastax.com/docs/1.2/ddl/index|by design]], to force you to denormalize into partitions that can be efficiently queried from a single replica, instead of having to gather data from across the entire cluster.
 * [[http://www.datastax.com/docs/1.2/ddl/table#compound-keys-and-clustering|Ordering is done per-partition]], and is specified at table creation time. Again, this is to enforce good application design; sorting thousands or millions of rows can be fast in development, but sorting billions in production is a bad idea.

== Storage engine ==
 * All data for a single partition must fit (on disk) on a single machine in the cluster. Because partition keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
 * A single column value may not be larger than 2GB; in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values.
Line 6: Line 10:
 * The maximum number of column per row is 2 billion.
 * The key (and column names) must be under 64K bytes.
 * The maximum number of cells (rows x columns) in a single partition is 2 billion.
Line 9: Line 12:
== Artifacts of the current code base ==
 * <<Anchor(streaming)>>Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values.
== Limitations that will be gone soon ==
 * There is no cursor support, so large resultsets must be manually paged. Cursor support is [[https://issues.apache.org/jira/browse/CASSANDRA-4415|scheduled for 2.0]].

Limitations

CQL

  • No join or subquery support, and limited support for aggregation. This is by design, to force you to denormalize into partitions that can be efficiently queried from a single replica, instead of having to gather data from across the entire cluster.

  • Ordering is done per-partition, and is specified at table creation time. Again, this is to enforce good application design; sorting thousands or millions of rows can be fast in development, but sorting billions in production is a bad idea.

Storage engine

  • All data for a single partition must fit (on disk) on a single machine in the cluster. Because partition keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
  • A single column value may not be larger than 2GB; in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values.
  • Collection values may not be larger than 64KB.
  • The maximum number of cells (rows x columns) in a single partition is 2 billion.

Limitations that will be gone soon

  • There is no cursor support, so large resultsets must be manually paged. Cursor support is scheduled for 2.0.

CassandraLimitations (last edited 2013-11-13 19:13:20 by GehrigKunz)