Frequently asked questions

Why can't I make Cassandra listen on 0.0.0.0 (all my addresses)?

Cassandra is a gossip-based distributed system. ListenAddress is also "contact me here address," i.e., the address it tells other nodes to reach it at. Telling other nodes "contact me on any of my addresses" is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don't want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it's up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to 0.0.0.0 (Java bug 6425769).

See CASSANDRA-256 and CASSANDRA-43 for more gory details.

What ports does Cassandra use?

By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9160 for clients (Thrift), and 7199 for JMX. The internode communication and Thrift ports are configurable in cassandra.yaml, and the JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP. See also RunningCassandra.

What happens to existing data in my cluster when I add new nodes?

When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

What kind of hardware should I run Cassandra on?

See [CassandraHardware].

What are SSTables and Memtables?

See MemtableSSTable.

Why is it so hard to work with TimeUUIDType in Java?

TimeUUID's are difficult to use from java clients because java.util.UUID does not support generating version 1 (time-based) UUIDs. Here is one way to work with them and Cassandra:

Use the UUID generator from: http://johannburkard.de/software/uuid/. See Time based UUID Notes

Below are three methods that are quite useful in working with the uuids as they come in and out of Cassandra.

Generate a new UUID to use in a TimeUUIDType sorted column family.

        /**
         * Gets a new time uuid.
         *
         * @return the time uuid
         */
        public static java.util.UUID getTimeUUID()
        {
                return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());
        }

When you read out of cassandra your getting a byte[] that needs to be converted into a TimeUUID and since the java.util.UUID doesn't seem to have a simple way of doing this, pass it through the eaio uuid dealio again.

        /**
         * Returns an instance of uuid.
         *
         * @param uuid the uuid
         * @return the java.util. uuid
         */
        public static java.util.UUID toUUID( byte[] uuid )
        {
        long msb = 0;
        long lsb = 0;
        assert uuid.length == 16;
        for (int i=0; i<8; i++)
            msb = (msb << 8) | (uuid[i] & 0xff);
        for (int i=8; i<16; i++)
            lsb = (lsb << 8) | (uuid[i] & 0xff);
        long mostSigBits = msb;
        long leastSigBits = lsb;

        com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);
        return java.util.UUID.fromString(u.toString());
        }

When you want to actually place the UUID into the Column then you'll want to convert it like this. This method is often used in conjuntion with the getTimeUUID() mentioned above.

        /**
         * As byte array.
         *
         * @param uuid the uuid
         *
         * @return the byte[]
         */
        public static byte[] asByteArray(java.util.UUID uuid)
        {
            long msb = uuid.getMostSignificantBits();
            long lsb = uuid.getLeastSignificantBits();
            byte[] buffer = new byte[16];

            for (int i = 0; i < 8; i++) {
                    buffer[i] = (byte) (msb >>> 8 * (7 - i));
            }
            for (int i = 8; i < 16; i++) {
                    buffer[i] = (byte) (lsb >>> 8 * (7 - i));
            }

            return buffer;
        }

Further, it is often useful to create a TimeUUID object from some time other than the present: for example, to use as the lower bound in a SlicePredicate to retrieve all columns whose TimeUUID comes after time X. Most libraries don't provide this functionality, probably because this breaks the "Universal" part of UUID: this should give you pause! Never assume such a UUID is unique: use it only as a marker for a specific time.

With those disclaimers out of the way, if you feel a need to create a TimeUUID based on a specific date, here is some code that will work:

        public static java.util.UUID uuidForDate(Date d)
        {
/*
  Magic number obtained from #cassandra's thobbs, who
  claims to have stolen it from a Python library.
*/
            final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 0x01b21dd213814000L;

            long origTime = d.getTime();
            long time = origTime * 10000 + NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;
            long timeLow = time &       0xffffffffL;
            long timeMid = time &   0xffff00000000L;
            long timeHi = time & 0xfff000000000000L;
            long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | (timeHi >> 48) ;
            return new java.util.UUID(upperLong, 0xC000000000000000L);
        }

I delete data from Cassandra, but disk usage stays the same. What gives?

Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs between the data and the tombstone, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Why does nodetool ring only show one entry, even though my nodes logged that they see each other joining the ring?

This happens when you have the same token assigned to each node. Don't do that.

Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.

The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

Can I change the ReplicationFactor on a live cluster?

Yes, but it will require running repair to change the replica count of existing data.

If you're reducing the ReplicationFactor:

If you're increasing the ReplicationFactor:

Can I Store BLOBs in Cassandra?

Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift's behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:

Nodetool says "Connection refused to host: 127.0.1.1" for any remote host. What gives?

Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up its own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails, try setting the -Djava.rmi.server.hostname=<public name> JVM option near the bottom of cassandra-env.sh to an interface that you can reach from the remote machine.

How can I iterate over all the rows in a ColumnFamily?

Use a CQL client ([ClientOptions]) and Cassandra 2.0. The cursor support in 2.0 means you can just write "SELECT * FROM foo" and paging through the resultset will be handled automatically.

Alternatively, you may with to use HadoopSupport.

Is there a GUI admin tool for Cassandra?

Cassandra says "ClusterName mismatch: oldClusterName != newClusterName" and refuses to start

To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, you can:

Perform these steps on each node:

  1. Start the cassandra-cli connected locally to this node.

  2. Run the following:
    1. use system;
    2. set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('<new cluster name>');

    3. exit;
  3. Run nodetool flush on this node.

  4. Update the cassandra.yaml file for the cluster_name as the same as 2b).
  5. Restart the node.

Once all nodes have been had this operation performed and restarted, nodetool ring should show all nodes as UP.

Are batch_mutate operations atomic?

Since Cassandra 1.2, CQL batches are atomic by default (http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2). Thrift API users must call atomic_batch_mutate instead of batch_mutate if they want this behavior.

Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?

For the latest on Hadoop-Cassandra integration, see HadoopSupport.

Can a Cassandra cluster be multi-tenant?

Most users of Cassandra stand up a cluster for each application or related set of applications as it is much simpler to tune and troubleshoot. There has been work done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant. However the well-trodden path is definitely single-tenant.

Who is using Cassandra and for what?

See http://planetcassandra.org/Company/ViewCompany?IndustryId=-1.

Are there any OBDC drivers for Cassandra?

Yes: http://www.datastax.com/dev/blog/using-the-datastax-odbc-driver-for-apache-cassandra

Are there ways to do logging directly to Cassandra?

For information on logging directly to Cassandra, see LoggingToCassandra.

On RHEL nodes are unable to join the ring

Check if selinux is on; if it is, turn it OFF.

Is there an authentication/authorization mechanism for Cassandra?

Yes. For details, see ExtensibleAuth.

How do I bulk load data into Cassandra?

See BulkLoading

How do I unsubscribe from the email list?

Send an email to user-unsubscribe@cassandra.apache.org

Why does top report that Cassandra is using a lot more memory than the Java heap max?

Cassandra uses mmap to do zero-copy reads. That is, we use the operating system's virtual memory system to map the sstable data files into the Cassandra process' address space. This will "use" virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.

What matters from the perspective of "memory use" in the sense as it is normally meant, is the amount of data allocated on brk() or mmap'd /dev/zero, which represent real memory used. The key issue is that for a mmap'd file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.

The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory - in the case of the memory being resident, you just read it - you don't even take a page fault (so no overhead in entering the kernel and doing a semi-context switch). This is covered in more detail here.

I'm getting java.io.IOException: Cannot run program "ln" when trying to snapshot or update a keyspace

Updating a keyspace first takes a snapshot. This involves creating hardlinks to the existing SSTables, but Java has no native way to create hard links, so it must fork "ln". When forking, there must be as much memory free as the parent process, even though the child isn't going to use it all. Because Java is a large process, this is problematic. The solution is to install Java Native Access so it can create the hard links itself.

How does Cassandra decide which nodes have what data?

The set of nodes (a single node, or several) responsible for any given piece of data is determined by:

In the case of the SimpleStrategy, replicas are placed on succeeding nodes in the ring. The first node is determined by the partitioner and the row key, and the remainder are placed on succeeding node. In the case of NetworkTopologyStrategy placement is affected by data-center and rack awareness, and the placement will depend on how nodes in different racks or data centers are placed in the ring.

It is important to understand that Cassandra does not alter the replica set for a given row key based on changing characteristics like current load, which nodes are up or down, or which node your client happens to talk to.

I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?

XX%

What are seeds?

Seeds are used during startup to discover the cluster

If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip for details ) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.

Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn't need seed on subsequent boot.

Newer versions of cassandra persist the cluster topology making seeds less important then they were in the 0.6.X series, where they were used every startup

You can make a seed a node at any time. There is nothing special about seed nodes. If you list the node in seed list it is a seed

Seeds do not auto bootstrap (ie if a node has itself in its seed list it will not automatically transfer data to itself) If you want a node to do that bootstrap it first and then add it to seeds later. If you have no data (new install) you do not have to worry about bootstrap or autobootstrap at all.

Recommended usage of seeds:

Does single seed mean single point of failure?

The ring can operate or boot without a seed; however, you will not be able to add new nodes to the cluster. It is recommended to configure multiple seeds in production system.

Why can't I call jmx method X on jconsole? (ex. getNaturalEndpoints)

Some of JMX operations can't be called with jconsole because the buttons are inactive for them. Jconsole doesn't support array argument, so operations which need array as arugument can't be invoked on jconsole. You need to write a JMX client to call such operations or need array-capable JMX monitoring tool.

What's the maximum key size permitted?

The key (and column names) must be under 64K bytes.

Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large "natural" keys use their hashes instead to cut down the size.

I'm using Ubuntu with JNA, and weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg!

We have come across several different, but similar, sets of symptoms that might match what you're seeing. They might all have the same root cause; it's not clear. One common piece is messages like this in dmesg:

INFO: task (some_taskname):(some_pid) blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image package and rebooting your instances fixes it. There is likely some bug in several of the kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-* which are known not to have this problem include:

Uninstalling libjna-java or recompiling Cassandra with CLibrary.tryMlockall()'s mlockall() call commented out also make at least some sorts of this problem go away, but that's a lot less desirable of a fix.

If you have more information on the problem and better ways to avoid it, please do update this space.

What are schema disagreement errors and how do I fix them?

Prior to Cassandra 1.1 and 1.2, Cassandra schema updates assume that schema changes are done one-at-a-time. If you make multiple changes at the same time, you can cause some nodes to end up with a different schema, than others. (Before 0.7.6, this can also be caused by cluster system clocks being substantially out of sync with each other.)

To fix schema disagreements, you need to force the disagreeing nodes to rebuild their schema. Here's how:

Open the cassandra-cli and run: 'connect localhost/9160;', then 'describe cluster;'. You'll see something like this:

[default@unknown] describe cluster;
Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
75eece10-bf48-11e0-0000-4d205df954a7: [192.168.1.9, 192.168.1.25]
5a54ebd0-bd90-11e0-0000-9510c23fceff: [192.168.1.27]

Note which schemas are in the minority and mark down those IPs -- in the above example, 192.168.1.27. Login to each of those machines and cleaninly stop the Cassandra service/process, typically by running:

At the end of this process the commit log directory (/var/lib/cassandra/commitlog) should contain only a single small file.

Remove the Schema* and Migration* sstables inside of your system keyspace (/var/lib/cassandra/data/system, if you're using the defaults).

After starting Cassandra again, this node will notice the missing information and pull in the correct schema from one of the other nodes. In version 1.0.X and before the schema is applied one mutation at a time. While it is being applied the node may log messages, such as the one below, that a Column Family cannot be found. These messages can be ignored.

ERROR [MutationStage:1] 2012-05-18 16:23:15,664 RowMutationVerbHandler.java (line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1012

To confirm everything is on the same schema, verify that 'describe cluster;' only returns one schema version.

Why do I see "... messages dropped.." in the logs?

This is a symptom of load shedding -- Cassandra defending itself against more requests than it can handle.

Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client.

For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair.

For READ messages this means a read request may not have completed.

Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster.

Cassandra dies with "java.lang.OutOfMemoryError: Map failed"

IF Cassandra is dying specifically with the "Map failed" message it means the OS is denying java the ability to lock more memory. In linux, this typically means memlock is limited. Check /proc/<pid of cassandra>/limits to verify this and raise it (eg, via ulimit in bash.) You may also need to increase vm.max_map_count. Note that the debian and redhat packages handle this for you automatically.

Why should I avoid order-preserving partitioners?

See [Partitioners].

What happens if two updates are made with the same timestamp?

Updates must be commutative, since they may arrive in different orders on different replicas. As long as Cassandra has a deterministic way to pick the winner, the one selected is as valid as any other, and the specifics should be treated as an implementation detail. That said, in the case of a timestamp tie, Cassandra follows two rules: first, deletes take precedence over inserts/updates. Second, if there are two updates, the one with the lexically larger value is selected.

stats

Why bootstrapping a new node fails with a "Stream failed" error?

Two main possibilities:

  1. the GC may be creating long pauses disrupting the streaming process
  2. compactions happening in the background hold streaming long enough that the TCP connection fails

In the first case, regular GC tuning advices apply. In the second case, you need to set TCP keepalive to a lower value (default is very high on Linux). Try to just run the following:

$ sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5

To make those settings permanent, add them to your /etc/sysctl.conf file.

Note: GCE's firewall will always interrupt TCP connections that are inactive for more than 10 min. Running the above command is highly recommended in that environment.

See [Partitioners].

FAQ (last edited 2014-10-24 22:10:35 by mriou)