Differences between revisions 124 and 125
Revision 124 as of 2011-07-19 20:18:28
Size: 34397
Editor: aus
Comment: document what i've found about ubuntu/ec2/jna/memlock "task blocked for more than 120 seconds" problems
Revision 125 as of 2011-07-19 21:37:42
Size: 34377
Editor: aus
Comment: note that problem also existed on bare metal
Deletions are marked like this. Additions are marked like this.
Line 44: Line 44:
 * [[#ubuntu_ec2_hangs|I'm using Ubuntu on EC2 with JNA, and holy crap weird things keep hanging and stalling and printing scary tracebacks in dmesg!]]  * [[#ubuntu_hangs|I'm using Ubuntu with JNA, and holy crap weird things keep hanging and stalling and printing scary tracebacks in dmesg!]]
Line 479: Line 479:

== I'm using Ubuntu on EC2 with JNA, and holy crap weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg! ==

== I'm using Ubuntu with JNA, and holy crap weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg! ==
Line 489: Line 490:
It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image-virtual package and rebooting your instances fixes it. There is likely some bug in several of the virtual/xen kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-*-virtual which are known not to have this problem include: It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image package and rebooting your instances fixes it. There is likely some bug in several of the kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-* which are known not to have this problem include:

Frequently asked questions

Why can't I make Cassandra listen on (all my addresses)?

Cassandra is a gossip-based distributed system. ListenAddress is also "contact me here address," i.e., the address it tells other nodes to reach it at. Telling other nodes "contact me on any of my addresses" is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don't want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it's up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to (Java bug 6425769).

See CASSANDRA-256 and CASSANDRA-43 for more gory details.

What ports does Cassandra use?

By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP. See also RunningCassandra.

Why does Cassandra slow down after doing a lot of inserts?

This is a symptom of memory pressure, resulting in a storm of GC operations as the JVM frantically tries to free enough heap to continue to operate. Eventually, the server will crash from OutOfMemory; usually, but not always, it will be able to log this final error before the JVM terminates.

You can increase the amount of memory the JVM uses, or decrease the insert threshold before Cassandra flushes its memtables. See MemtableThresholds for details.

Setting your cache sizes too large can result in memory pressure.

What happens to existing data in my cluster when I add new nodes?

Starting a new node with the -b [bootstrap] option will cause it to contact other nodes in the cluster to copy the right data to itself.

In Cassandra 0.5 and above, there is an "AutoBootStrap" option in the config file. When enabled, using the "-b" options is unnecessary, because new nodes will automatically bootstrap themselves when they start up for the first time. Even with AutoBootstrap it is recommended that you always specify the InitialToken because the picking of an initial token will almost certainly result in an unbalanced ring. If you are building the initial cluster you certainly don't want to leave InitialToken blank. Pick the tokens such that the ring will be balanced afterward and explicitly set them on each node. See token selection in the operations wiki.

Unless you know precisely what you're doing and are aware of how the Cassandra internals work you should never introduce a new empty node to your cluster and have autoboostrap disabled. In version 0.7 under write load it will cause writes to be sent to the new node before the schema arrives from another member of the cluster. This would also indicate to clients that the new node is responsible for servicing reads for data that it definitely doesn't have.

In Cassandra 0.4 and below, it is recommended that you manually specify a value for "InitialToken" in the config file of a new node.

Can I add/remove/rename Column Families on a working cluster?

Yes, but it's important that you do it correctly.

  1. Empty the commitlog with "nodetool drain."
  2. Shutdown Cassandra and verify that there is no remaining data in the commitlog.
  3. Delete the sstable files (-Data.db, -Index.db, and -Filter.db) for any CFs removed, and rename the files for any CFs that were renamed.
  4. Make necessary changes to your storage-conf.xml.
  5. Start Cassandra back up and your edits should take effect.

see also: CASSANDRA-44

Does it matter which node a Thrift or higher-level client connects to?

No, any node in the cluster will work; Cassandra nodes proxy your request as needed. This leaves the client with a number of options for end point selection:

  1. You can maintain a list of contact nodes (all or a subset of the nodes in the cluster), and configure your clients to choose among them.
  2. Use round-robin DNS and create a record that points to a set of contact nodes (recommended).
  3. Use the get_string_property("token map") RPC to obtain an update-to-date list of the nodes in the cluster and cycle through them.

  4. Deploy a load-balancer, proxy, etc.

When using a higher-level client you should investigate which, if any, options are implemented by your higher-level client to help you distribute your requests across nodes in a cluster.

What kind of hardware should I run Cassandra on?

See [CassandraHardware].

What are SSTables and Memtables?

See MemtableSSTable and MemtableThresholds.

Why is it so hard to work with TimeUUIDType in Java?

TimeUUID's are difficult to use from java clients because java.util.UUID does not support generating version 1 (time-based) UUIDs. Here is one way to work with them and Cassandra:

Use the UUID generator from: http://johannburkard.de/software/uuid/

Below are three methods that are quite useful in working with the uuids as they come in and out of Cassandra.

Generate a new UUID to use in a TimeUUIDType sorted column family.

         * Gets a new time uuid.
         * @return the time uuid
        public static java.util.UUID getTimeUUID()
                return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());

When you read out of cassandra your getting a byte[] that needs to be converted into a TimeUUID and since the java.util.UUID doesn't seem to have a simple way of doing this, pass it through the eaio uuid dealio again.

         * Returns an instance of uuid.
         * @param uuid the uuid
         * @return the java.util. uuid
        public static java.util.UUID toUUID( byte[] uuid )
        long msb = 0;
        long lsb = 0;
        assert uuid.length == 16;
        for (int i=0; i<8; i++)
            msb = (msb << 8) | (uuid[i] & 0xff);
        for (int i=8; i<16; i++)
            lsb = (lsb << 8) | (uuid[i] & 0xff);
        long mostSigBits = msb;
        long leastSigBits = lsb;

        com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);
        return java.util.UUID.fromString(u.toString());

When you want to actually place the UUID into the Column then you'll want to convert it like this. This method is often used in conjuntion with the getTimeUUID() mentioned above.

         * As byte array.
         * @param uuid the uuid
         * @return the byte[]
        public static byte[] asByteArray(java.util.UUID uuid)
            long msb = uuid.getMostSignificantBits();
            long lsb = uuid.getLeastSignificantBits();
            byte[] buffer = new byte[16];

            for (int i = 0; i < 8; i++) {
                    buffer[i] = (byte) (msb >>> 8 * (7 - i));
            for (int i = 8; i < 16; i++) {
                    buffer[i] = (byte) (lsb >>> 8 * (7 - i));

            return buffer;

Further, it is often useful to create a TimeUUID object from some time other than the present: for example, to use as the lower bound in a SlicePredicate to retrieve all columns whose TimeUUID comes after time X. Most libraries don't provide this functionality, probably because this breaks the "Universal" part of UUID: this should give you pause! Never assume such a UUID is unique: use it only as a marker for a specific time.

With those disclaimers out of the way, if you feel a need to create a TimeUUID based on a specific date, here is some code that will work:

        public static java.util.UUID uuidForDate(Date d)
  Magic number obtained from #cassandra's thobbs, who
  claims to have stolen it from a Python library.
            final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 0x01b21dd213814000L;

            long origTime = d.getTime();
            long time = origTime * 10000 + NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;
            long timeLow = time &       0xffffffffL;
            long timeMid = time &   0xffff00000000L;
            long timeHi = time & 0xfff000000000000L;
            long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | (timeHi >> 48) ;
            return new java.util.UUID(upperLong, 0xC000000000000000L);

I delete data from Cassandra, but disk usage stays the same. What gives?

Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs after GCGraceSeconds (hint: storage-conf.xml) have expired, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Why are reads slower than writes?

Unlike all major relational databases and some NoSQL systems, Cassandra does not use b-trees and in-place updates on disk. Instead, it uses a sstable/memtable model like Bigtable's: writes to each ColumnFamily are grouped together in an in-memory structure before being flushed (sorted and written to disk). This means that writes cost no random I/O, compared to a b-tree system which not only has to seek to the data location to overwrite, but also may have to seek to read different levels of the index if it outgrows disk cache!

The downside is that on a read, Cassandra has to (potentially) merge row fragments from multiple sstables on disk. We think this is a tradeoff worth making, first because scaling writes has always been harder than scaling reads, and second because as your data corpus grows Cassandra's read disadvantage narrows vs b-tree systems that have to do multiple seeks against a large index. See MemtableSSTable for more details.

Why does nodeprobe ring only show one entry, even though my nodes logged that they see each other joining the ring?

This happens when you have the same token assigned to each node. Don't do that.

Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.

The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

Why do deleted keys show up during range scans?

Because get_range_slice says, "apply this predicate to the range of rows given," meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.

So to special case leaving out result entries for deletions, we would have to check the entire rest of the row to make sure there is no undeleted data anywhere else either (in which case leaving the key out would be an error).

This is what we used to do with the old get_key_range method, but the performance hit turned out to be unacceptable.

See DistributedDeletes for more on how deletes work in Cassandra.

Can I change the ReplicationFactor on a live cluster?

Yes, but it will require restarting and running repair manually to change the replica count of existing data.

  • Alter the ReplicationFactor for the desired keyspace(s) in the storage configuration on each node in the cluster.

  • Restart cassandra on each node in the cluster.

If you're reducing the ReplicationFactor:

  • Run "nodetool cleanup" on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.

If you're increasing the ReplicationFactor:

  • Run "nodetool repair" to run an anti-entropy repair on the cluster. Repair runs on a per-node basis. This is an intensive process that will result in adverse cluster performance. It's highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it and may even kill it.

Can I Store BLOBs in Cassandra?

Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift's behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:

  • The main limitation on a column and super column size is that all the data for a single key and column must fit (on disk) on a single machine(node) in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.
  • When large columns are created and retrieved, that columns data is loaded into RAM which can get resource intensive quickly. Consider, loading 200 rows with columns that store 10Mb image files each into RAM. That small result set would consume about 2Gb of RAM. Clearly as more and more large columns are loaded, RAM would start to get consumed quickly. This can be worked around, but will take some upfront planning and testing to get a workable solution for most applications. You can find more information regarding this behavior here: memtables, and a possible solution in 0.7 here: CASSANDRA-16.

  • Please refer to the notes in the Cassandra limitations section for more information: Cassandra Limitations

Nodetool says "Connection refused to host:" for any remote host. What gives?

Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up it's own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails try passing the -Djava.rmi.server.hostname=$IP option to the JVM at startup (where $IP is the address of the interface you can reach from the remote machine).

How can I iterate over all the rows in a ColumnFamily?

Simple but slow: Use get_range_slices, start with the empty string, and after each call use the last key read as the start key in the next iteration.

Better: use HadoopSupport.

Why were none of the keyspaces described in storage-conf.xml loaded?

Prior to 0.7, cassandra loaded a set of static keyspaces defined in a storage-conf.xml file. CASSANDRA-44 added the ability to modify schema dynamically on a live cluster. Part of this change required that we ignore the schema defined in storage-conf.xml. Additionally, 0.7 converts to YAML based configuration.

If you have an existing storage-conf.xml file, you will first need to convert it to YAML using the bin/config-converter tool, which can generate a cassandra.yaml file from a storage-conf.xml file. Once you have a cassandra.yaml, it is possible to do a one-time load of the schema it defines. 0.7 adds a loadSchemaFromYAML method to StorageServiceMBean (triggered via JMX: see https://issues.apache.org/jira/browse/CASSANDRA-1001 ) which will load the schema defined in cassandra.yaml, but this is a one-time operation. A node that has had its schema defined via loadSchemaFromYAML will load its schema from the system table on subsequent restarts, which means that any further changes to the schema need to be made using the system_* thrift operations (see API).

It is recommended that you only perform schema updates on one node and let cassandra propagate changes to the rest of the cluster. If you try to perform the same updates simultaneously on multiple nodes, you run the risk of introducing inconsistent migrations, which will lead to a confused cluster.

See LiveSchemaUpdates for more information.

Is there a GUI admin tool for Cassandra?

Insert operation throws InvalidRequestException with message "A long is exactly 8 bytes"

You are propably using LongType column sorter in your column family. LongType assumes that the numbers stored into column names are exactly 64bit (8 bytes) long and in big endian format. Example code how to pack and unpack an integer for storing into cassandra and unpacking it for php:

         * Takes php integer and packs it to 64bit (8 bytes) long big endian binary representation.
         * @param  $x integer
         * @return string eight bytes long binary repersentation of the integer in big endian order.
        public static function pack_longtype($x) {
                return pack('C8', ($x >> 56) & 0xff, ($x >> 48) & 0xff, ($x >> 40) & 0xff, ($x >> 32) & 0xff,
                                ($x >> 24) & 0xff, ($x >> 16) & 0xff, ($x >> 8) & 0xff, $x & 0xff);

         * Takes eight bytes long big endian binary representation of an integer and unpacks it to a php integer.
         * @param  $x
         * @return php integer
        public static function unpack_longtype($x) {
                $a = unpack('C8', $x);
                return ($a[1] << 56) + ($a[2] << 48) + ($a[3] << 40) + ($a[4] << 32) + ($a[5] << 24) + ($a[6] << 16) + ($a[7] << 8) + $a[8];

Cassandra says "ClusterName mismatch: oldClusterName != newClusterName" and refuses to start

To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, it is safe to remove system/LocationInfo* after forcing a compaction on all ColumnFamilies (with the old cluster name) if you've specified the node's token in the config file, or if you don't care about preserving the node's token (for instance in single node clusters.)

Are batch_mutate operations atomic?

As a special case, mutations against a single key are atomic but not isolated. Reads which occur during such a mutation may see part of the write before they see the whole thing. More generally, batch_mutate operations are not atomic. batch_mutate allows grouping operations on many keys into a single call in order to save on the cost of network round-trips. If batch_mutate fails in the middle of its list of mutations, no rollback occurs and the mutations that have already been applied stay applied. The client should typically retry the batch_mutate operation.

Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?

For the latest on Hadoop-Cassandra integration, see HadoopSupport.

Can a Cassandra cluster be multi-tenant?

There is work being done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant.

Who is using Cassandra and for what?

For information on who is using Cassandra and what they are using it for, see CassandraUsers.

Are there any OBDC drivers for Cassandra?


Are there ways to do logging directly to Cassandra?

For information on logging directly to Cassandra, see LoggingToCassandra.

On RHEL nodes are unable to join the ring

Check if selinux is on, if it is turn it OFF

Is there an authentication/authorization mechanism for Cassandra?

Yes. For details, see ExtensibleAuth.

How do I bulk load data into Cassandra?

See BulkLoading

Why aren't range slices/sequential scans giving me the expected results?

You're probably using the RandomPartitioner. This is the default because it avoids hotspots, but it means your rows are ordered by the md5 of the row key rather than lexicographically by the raw key bytes.

You can start out with a start key and end key of [empty] and use the row count argument instead, if your goal is paging the rows. To get the next page, start from the last key you got in the previous page.

You can also use intra-row ordering of column names to get ordered results within a row; with appropriate row 'bucketing,' you often don't need the rows themselves to be ordered.

How do I unsubscribe from the email list?

Send an email to user-unsubscribe@cassandra.apache.org

I compacted, so why did space used not decrease?

SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted. Read more on this subject here.

Why does top report that Cassandra is using a lot more memory than the Java heap max?

Cassandra uses mmap to do zero-copy reads. That is, we use the operating system's virtual memory system to map the sstable data files into the Cassandra process' address space. This will "use" virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.

What matters from the perspective of "memory use" in the sense as it is normally meant, is the amount of data allocated on brk() or mmap'd /dev/zero, which represent real memory used. The key issue is that for a mmap'd file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.

The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory - in the case of the memory being resident, you just read it - you don't even take a page fault (so no overhead in entering the kernel and doing a semi-context switch). This is covered in more detail here.

I'm getting java.io.IOException: Cannot run program "ln" when trying to snapshot or update a keyspace

Updating a keyspace first takes a snapshot. This involves creating hardlinks to the existing SSTables, but Java has no native way to create hard links, so it must fork "ln". When forking, there must be as much memory free as the parent process, even though the child isn't going to use it all. Because Java is a large process, this is problematic. The solution is to install Java Native Access so it can create the hard links itself.

How does Cassandra decide which nodes have what data?

The set of nodes (a single node, or several) responsible for any given piece of data is determined by:

  • The row key (data is partitioned on row key)
  • The replication factor (decides how many nodes are in the replica set for a given row)

  • The replication strategy (decides which nodes are part of said replica set)

In the case of the SimpleStrategy, replicas are placed on succeeding nodes in the ring. The first node is determined by the partitioner and the row key, and the remainder are placed on succeeding node. In the case of NetworkTopologyStrategy placement is affected by data-center and rack awareness, and the placement will depend on how nodes in different racks or data centers are placed in the ring.

It is important to understand that Cassandra does not alter the replica set for a given row key based on changing characteristics like current load, which nodes are up or down, or which node your client happens to talk to.

I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?


Commit Log gets very big. Cassandra does not delete "old" commit logs. Why?

You probably have one or more Column Families with very low throughput. These will typically not be flushed by crossing the throughput or operations thresholds, causing old commit segments to be retained until the memtable_flush_after_min threshold has been crossed. The default value for this threshold is 60 minutes and may be decreased via cassandra-cli by doing:

 update column family XXX with memtable_flush_after=YY; 

where YY is a number of minutes.

What are seeds?

If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip for details ) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.

Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn't need seed on subsequent boot.

If you configure a node as seed, it doesn't auto_bootstrap. So if you want to add a node to ring and make it seed, auto_bootstrap first, and then make it seed next.

Does single seed mean single point of failure?

If you are using replicated CF on the ring, only one seed in the ring doesn't mean single point of failure. The ring can operate or boot without the seed. However, it will need more time to spread status changes of node over the ring. It is recommended to have multiple seeds in production system.

Why can't I call jmx method X on jconsole? (ex. getNaturalEndpoints)

Some of JMX operations can't be called with jconsole because the buttons are inactive for them. Jconsole doesn't support array argument, so operations which need array as arugument can't be invoked on jconsole. You need to write a JMX client to call such operations or need array capable JMX monitoring tool.

What's the maximum key size permitted?

The key (and column names) must be under 64K bytes.

Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large "natural" keys use their hashes instead to cut down the size.

I'm using Ubuntu with JNA, and holy crap weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg!

We have come across several different, but similar, sets of symptoms that might match what you're seeing. They might all have the same root cause; it's not clear. One common piece is messages like this in dmesg:

INFO: task (some_taskname):(some_pid) blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image package and rebooting your instances fixes it. There is likely some bug in several of the kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-* which are known not to have this problem include:

  • linux-image-2.6.38-10-virtual (2.6.38-10.46) (Ubuntu 11.04/Natty Narwhal)
  • linux-image-2.6.35-24-virtual (2.6.35-24.42) (Ubuntu 10.10/Maverick Meerkat)

Uninstalling libjna-java or recompiling Cassandra with CLibrary.tryMlockall()'s mlockall() call commented out also make at least some sorts of this problem go away, but that's a lot less desirable of a fix.

If you have more information on the problem and better ways to avoid it, please do update this space.

FAQ (last edited 2016-08-15 15:30:03 by JonathanEllis)