Differences between revisions 159 and 160
Revision 159 as of 2012-12-19 07:45:11
Size: 35665
Editor: SergeRider
Comment:
Revision 160 as of 2013-08-13 21:46:51
Size: 35608
Comment:
Deletions are marked like this. Additions are marked like this.
Line 198: Line 198:
Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs after `gc_grace_seconds` (configurable per-column family as part of the schema) have passed, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail. Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs between the data and the tombstone, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Frequently asked questions

Why can't I make Cassandra listen on 0.0.0.0 (all my addresses)?

Cassandra is a gossip-based distributed system. ListenAddress is also "contact me here address," i.e., the address it tells other nodes to reach it at. Telling other nodes "contact me on any of my addresses" is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don't want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it's up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to 0.0.0.0 (Java bug 6425769).

See CASSANDRA-256 and CASSANDRA-43 for more gory details.

What ports does Cassandra use?

By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9160 for clients (Thrift), and 7199 for JMX. The internode communication and Thrift ports are configurable in cassandra.yaml, and the JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP. See also RunningCassandra.

Why does Cassandra slow down after doing a lot of inserts?

This is a symptom of memory pressure, resulting in a storm of GC operations as the JVM frantically tries to free enough heap to continue to operate. Eventually, the server will crash from OutOfMemory; usually, but not always, it will be able to log this final error before the JVM terminates.

You can increase the amount of memory the JVM uses, or decrease the insert threshold before Cassandra flushes its memtables. See MemtableThresholds for details.

Setting your cache sizes too large can result in memory pressure.

What happens to existing data in my cluster when I add new nodes?

When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

In general, you should set the initial_token config option in cassandra.yaml before starting a new node. Otherwise, a suboptimal token may be selected automatically, leading to an unbalanced ring. See token selection in the operations wiki.

Does it matter which node a Thrift or higher-level client connects to?

No, any node in the cluster will work; Cassandra nodes proxy your request as needed. This leaves the client with a number of options for end point selection:

  1. You can maintain a list of contact nodes (all or a subset of the nodes in the cluster), and configure your clients to choose among them.
  2. Use round-robin DNS and create a record that points to a set of contact nodes (recommended).
  3. Use the describe_ring(keyspace) Thrift RPC call to obtain an update-to-date list of the nodes in the cluster and cycle through them.

  4. Deploy a load-balancer, proxy, etc.

When using a higher-level client you should investigate which, if any, options are implemented by your higher-level client to help you distribute your requests across nodes in a cluster.

What kind of hardware should I run Cassandra on?

See [CassandraHardware].

What are SSTables and Memtables?

See MemtableSSTable and MemtableThresholds.

Why is it so hard to work with TimeUUIDType in Java?

TimeUUID's are difficult to use from java clients because java.util.UUID does not support generating version 1 (time-based) UUIDs. Here is one way to work with them and Cassandra:

Use the UUID generator from: http://johannburkard.de/software/uuid/. See Time based UUID Notes

Below are three methods that are quite useful in working with the uuids as they come in and out of Cassandra.

Generate a new UUID to use in a TimeUUIDType sorted column family.

        /**
         * Gets a new time uuid.
         *
         * @return the time uuid
         */
        public static java.util.UUID getTimeUUID()
        {
                return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());
        }

When you read out of cassandra your getting a byte[] that needs to be converted into a TimeUUID and since the java.util.UUID doesn't seem to have a simple way of doing this, pass it through the eaio uuid dealio again.

        /**
         * Returns an instance of uuid.
         *
         * @param uuid the uuid
         * @return the java.util. uuid
         */
        public static java.util.UUID toUUID( byte[] uuid )
        {
        long msb = 0;
        long lsb = 0;
        assert uuid.length == 16;
        for (int i=0; i<8; i++)
            msb = (msb << 8) | (uuid[i] & 0xff);
        for (int i=8; i<16; i++)
            lsb = (lsb << 8) | (uuid[i] & 0xff);
        long mostSigBits = msb;
        long leastSigBits = lsb;

        com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);
        return java.util.UUID.fromString(u.toString());
        }

When you want to actually place the UUID into the Column then you'll want to convert it like this. This method is often used in conjuntion with the getTimeUUID() mentioned above.

        /**
         * As byte array.
         *
         * @param uuid the uuid
         *
         * @return the byte[]
         */
        public static byte[] asByteArray(java.util.UUID uuid)
        {
            long msb = uuid.getMostSignificantBits();
            long lsb = uuid.getLeastSignificantBits();
            byte[] buffer = new byte[16];

            for (int i = 0; i < 8; i++) {
                    buffer[i] = (byte) (msb >>> 8 * (7 - i));
            }
            for (int i = 8; i < 16; i++) {
                    buffer[i] = (byte) (lsb >>> 8 * (7 - i));
            }

            return buffer;
        }

Further, it is often useful to create a TimeUUID object from some time other than the present: for example, to use as the lower bound in a SlicePredicate to retrieve all columns whose TimeUUID comes after time X. Most libraries don't provide this functionality, probably because this breaks the "Universal" part of UUID: this should give you pause! Never assume such a UUID is unique: use it only as a marker for a specific time.

With those disclaimers out of the way, if you feel a need to create a TimeUUID based on a specific date, here is some code that will work:

        public static java.util.UUID uuidForDate(Date d)
        {
/*
  Magic number obtained from #cassandra's thobbs, who
  claims to have stolen it from a Python library.
*/
            final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 0x01b21dd213814000L;

            long origTime = d.getTime();
            long time = origTime * 10000 + NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;
            long timeLow = time &       0xffffffffL;
            long timeMid = time &   0xffff00000000L;
            long timeHi = time & 0xfff000000000000L;
            long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | (timeHi >> 48) ;
            return new java.util.UUID(upperLong, 0xC000000000000000L);
        }

I delete data from Cassandra, but disk usage stays the same. What gives?

Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs between the data and the tombstone, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Why does nodetool ring only show one entry, even though my nodes logged that they see each other joining the ring?

This happens when you have the same token assigned to each node. Don't do that.

Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.

The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

Why do deleted keys show up during range scans?

Because get_range_slice says, "apply this predicate to the range of rows given," meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.

So to special case leaving out result entries for deletions, we would have to check the entire rest of the row to make sure there is no undeleted data anywhere else either (in which case leaving the key out would be an error).

This is what we used to do with the old get_key_range method, but the performance hit turned out to be unacceptable.

See DistributedDeletes for more on how deletes work in Cassandra.

Can I change the ReplicationFactor on a live cluster?

Yes, but it will require running repair to change the replica count of existing data.

If you're reducing the ReplicationFactor:

  • Run "nodetool cleanup" on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.

If you're increasing the ReplicationFactor:

  • Run "nodetool repair" to run an anti-entropy repair on the cluster. Repair runs on a per-replica set basis. This is an intensive process that may result in adverse cluster performance. It's highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it.

Can I Store BLOBs in Cassandra?

Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift's behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:

  • The main limitation on a column and super column size is that all the data for a single key and column must fit (on disk) on a single machine(node) in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.
  • When large columns are created and retrieved, that columns data is loaded into RAM which can get resource intensive quickly. Consider, loading 200 rows with columns that store 10Mb image files each into RAM. That small result set would consume about 2Gb of RAM. Clearly as more and more large columns are loaded, RAM would start to get consumed quickly. This can be worked around, but will take some upfront planning and testing to get a workable solution for most applications. You can find more information regarding this behavior here: memtables, and a possible solution in 0.7 here: CASSANDRA-16.

  • Please refer to the notes in the Cassandra limitations section for more information: Cassandra Limitations

Nodetool says "Connection refused to host: 127.0.1.1" for any remote host. What gives?

Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up its own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails, try setting the -Djava.rmi.server.hostname=<public name> JVM option near the bottom of cassandra-env.sh to an interface that you can reach from the remote machine.

How can I iterate over all the rows in a ColumnFamily?

Simple but slow: Use get_range_slices, start with the empty string, and after each call use the last key read as the start key in the next iteration.

Most clients support an easy way to do this. For example, pycassa's get_range(), and phpcassa's get_range() return an iterator that fetches the next batch of rows automatically. Hector has an example of how to do this.

Better: use HadoopSupport.

Why were none of the keyspaces described in storage-conf.xml loaded?

Prior to 0.7, cassandra loaded a set of static keyspaces defined in a storage-conf.xml file. CASSANDRA-44 added the ability to modify schema dynamically on a live cluster. Part of this change required that we ignore the schema defined in storage-conf.xml. Additionally, 0.7 converts to YAML based configuration.

If you have an existing storage-conf.xml file, you will first need to convert it to YAML using the bin/config-converter tool, which can generate a cassandra.yaml file from a storage-conf.xml file. Once you have a cassandra.yaml, it is possible to do a one-time load of the schema it defines. 0.7 adds a loadSchemaFromYAML method to StorageServiceMBean (triggered via JMX: see https://issues.apache.org/jira/browse/CASSANDRA-1001 ) which will load the schema defined in cassandra.yaml, but this is a one-time operation. A node that has had its schema defined via loadSchemaFromYAML will load its schema from the system table on subsequent restarts, which means that any further changes to the schema need to be made using the system_* thrift operations (see API).

It is recommended that you only perform schema updates on one node and let cassandra propagate changes to the rest of the cluster. If you try to perform the same updates simultaneously on multiple nodes, you run the risk of introducing inconsistent migrations, which will lead to a confused cluster.

See LiveSchemaUpdates for more information.

Is there a GUI admin tool for Cassandra?

Cassandra says "ClusterName mismatch: oldClusterName != newClusterName" and refuses to start

To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, you can:

Perform these steps on each node:

  1. Start the cassandra-cli connected locally to this node.

  2. Run the following:
    1. use system;
    2. set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('<new cluster name>');

    3. exit;
  3. Run nodetool flush on this node.

  4. Update the cassandra.yaml file for the cluster_name as the same as 2b).
  5. Restart the node.

Once all nodes have been had this operation performed and restarted, nodetool ring should show all nodes as UP.

Are batch_mutate operations atomic?

As a special case, mutations against a single key are atomic but not isolated. Reads which occur during such a mutation may see part of the write before they see the whole thing. More generally, batch_mutate operations are not atomic. batch_mutate allows grouping operations on many keys into a single call in order to save on the cost of network round-trips. If batch_mutate fails in the middle of its list of mutations, no rollback occurs and the mutations that have already been applied stay applied. The client should typically retry the batch_mutate operation.

Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?

For the latest on Hadoop-Cassandra integration, see HadoopSupport.

Can a Cassandra cluster be multi-tenant?

There is work being done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant.

Who is using Cassandra and for what?

See CassandraUsers.

Are there any OBDC drivers for Cassandra?

No.

Are there ways to do logging directly to Cassandra?

For information on logging directly to Cassandra, see LoggingToCassandra.

On RHEL nodes are unable to join the ring

Check if selinux is on, if it is turn it OFF

Is there an authentication/authorization mechanism for Cassandra?

Yes. For details, see ExtensibleAuth.

How do I bulk load data into Cassandra?

See BulkLoading

Why aren't range slices/sequential scans giving me the expected results?

You're probably using the RandomPartitioner. This is the default because it avoids hotspots, but it means your rows are ordered by the md5 of the row key rather than lexicographically by the raw key bytes.

You can start out with a start key and end key of [empty] and use the row count argument instead, if your goal is paging the rows. To get the next page, start from the last key you got in the previous page. This is what the Cassandra Hadoop RecordReader does, for instance.

You can also use intra-row ordering of column names to get ordered results within a row; with appropriate row 'bucketing,' you often don't need the rows themselves to be ordered.

How do I unsubscribe from the email list?

Send an email to user-unsubscribe@cassandra.apache.org

Why does top report that Cassandra is using a lot more memory than the Java heap max?

Cassandra uses mmap to do zero-copy reads. That is, we use the operating system's virtual memory system to map the sstable data files into the Cassandra process' address space. This will "use" virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.

What matters from the perspective of "memory use" in the sense as it is normally meant, is the amount of data allocated on brk() or mmap'd /dev/zero, which represent real memory used. The key issue is that for a mmap'd file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.

The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory - in the case of the memory being resident, you just read it - you don't even take a page fault (so no overhead in entering the kernel and doing a semi-context switch). This is covered in more detail here.

I'm getting java.io.IOException: Cannot run program "ln" when trying to snapshot or update a keyspace

Updating a keyspace first takes a snapshot. This involves creating hardlinks to the existing SSTables, but Java has no native way to create hard links, so it must fork "ln". When forking, there must be as much memory free as the parent process, even though the child isn't going to use it all. Because Java is a large process, this is problematic. The solution is to install Java Native Access so it can create the hard links itself.

How does Cassandra decide which nodes have what data?

The set of nodes (a single node, or several) responsible for any given piece of data is determined by:

  • The row key (data is partitioned on row key)
  • The replication factor (decides how many nodes are in the replica set for a given row)

  • The replication strategy (decides which nodes are part of said replica set)

In the case of the SimpleStrategy, replicas are placed on succeeding nodes in the ring. The first node is determined by the partitioner and the row key, and the remainder are placed on succeeding node. In the case of NetworkTopologyStrategy placement is affected by data-center and rack awareness, and the placement will depend on how nodes in different racks or data centers are placed in the ring.

It is important to understand that Cassandra does not alter the replica set for a given row key based on changing characteristics like current load, which nodes are up or down, or which node your client happens to talk to.

I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?

XX%

What are seeds?

Seeds are used during startup to discover the cluster

If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip for details ) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.

Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn't need seed on subsequent boot.

Newer versions of cassandra persist the cluster topology making seeds less important then they were in the 0.6.X series, where they were used every startup

You can make a seed a node at any time. There is nothing special about seed nodes. If you list the node in seed list it is a seed

Seeds do not auto bootstrap (ie if a node has itself in its seed list it will not automatically transfer data to itself) If you want a node to do that bootstrap it first and then add it to seeds later. If you have no data (new install) you do not have to worry about bootstrap or autobootstrap at all.

Recommended usage of seeds:

  • pick two (or more) nodes per data center as seed nodes.
  • sync the seed list to all your nodes

Does single seed mean single point of failure?

If you are using replicated CF on the ring, only one seed in the ring doesn't mean single point of failure. The ring can operate or boot without the seed. However, it will need more time to spread status changes of node over the ring. It is recommended to have multiple seeds in production system.

Why can't I call jmx method X on jconsole? (ex. getNaturalEndpoints)

Some of JMX operations can't be called with jconsole because the buttons are inactive for them. Jconsole doesn't support array argument, so operations which need array as arugument can't be invoked on jconsole. You need to write a JMX client to call such operations or need array-capable JMX monitoring tool.

What's the maximum key size permitted?

The key (and column names) must be under 64K bytes.

Routing is O(N) of the key size and querying and updating are O(N log N). In practice these factors are usually dwarfed by other overhead, but some users with very large "natural" keys use their hashes instead to cut down the size.

I'm using Ubuntu with JNA, and holy crap weird things keep hanging and stalling and blocking and printing scary tracebacks in dmesg!

We have come across several different, but similar, sets of symptoms that might match what you're seeing. They might all have the same root cause; it's not clear. One common piece is messages like this in dmesg:

INFO: task (some_taskname):(some_pid) blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It does not seem that anyone has had the time to track this down to the real root cause, but it does seem that upgrading the linux-image package and rebooting your instances fixes it. There is likely some bug in several of the kernel builds distributed by Ubuntu which is fixed in later versions. Versions of linux-image-* which are known not to have this problem include:

  • linux-image-2.6.38-10-virtual (2.6.38-10.46) (Ubuntu 11.04/Natty Narwhal)
  • linux-image-2.6.35-24-virtual (2.6.35-24.42) (Ubuntu 10.10/Maverick Meerkat)

Uninstalling libjna-java or recompiling Cassandra with CLibrary.tryMlockall()'s mlockall() call commented out also make at least some sorts of this problem go away, but that's a lot less desirable of a fix.

If you have more information on the problem and better ways to avoid it, please do update this space.

What are schema disagreement errors and how do I fix them?

Prior to Cassandra 1.1 and 1.2, Cassandra schema updates assume that schema changes are done one-at-a-time. If you make multiple changes at the same time, you can cause some nodes to end up with a different schema, than others. (Before 0.7.6, this can also be caused by cluster system clocks being substantially out of sync with each other.)

To fix schema disagreements, you need to force the disagreeing nodes to rebuild their schema. Here's how:

Open the cassandra-cli and run: 'connect localhost/9160;', then 'describe cluster;'. You'll see something like this:

[default@unknown] describe cluster;
Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
75eece10-bf48-11e0-0000-4d205df954a7: [192.168.1.9, 192.168.1.25]
5a54ebd0-bd90-11e0-0000-9510c23fceff: [192.168.1.27]

Note which schemas are in the minority and mark down those IPs -- in the above example, 192.168.1.27. Login to each of those machines and cleaninly stop the Cassandra service/process, typically by running:

  • nodetool disablethrift
  • nodetool disablegossip
  • nodetool drain
  • 'sudo service cassandra stop' or 'kill <pid>'.

At the end of this process the commit log directory (/var/lib/cassandra/commitlog) should contain only a single small file.

Remove the Schema* and Migration* sstables inside of your system keyspace (/var/lib/cassandra/data/system, if you're using the defaults).

After starting Cassandra again, this node will notice the missing information and pull in the correct schema from one of the other nodes. In version 1.0.X and before the schema is applied one mutation at a time. While it is being applied the node may log messages, such as the one below, that a Column Family cannot be found. These messages can be ignored.

ERROR [MutationStage:1] 2012-05-18 16:23:15,664 RowMutationVerbHandler.java (line 61) Error in row mutation
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1012

To confirm everything is on the same schema, verify that 'describe cluster;' only returns one schema version.

Why do I see "... messages dropped.." in the logs?

Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client.

For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair.

For READ messages this means a read request may not have completed.

Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster.

Why does the 0.8 cli not assume keys are strings anymore?

Prior to 0.8, there was no type metadata available for row keys, and the cli interface treated all keys as strings. This made the cli unusable for the many applications whose rows were numeric, uuids, or other non-string data.

0.8 added key_validation_class to the ColumnFamily definition, similarly to the existing comparator for column names, and column_metadata validation_class for column values. This both lets clients know the expected data type, and rejects updates with non-conformant values.

To preserve application compatibility, the default key_validation_class is BytesType, i.e., "anything goes." The CLI expects bytes to be provided in hex.

If all your keys are of the same type, you should add information to the CF metadata, e.g., "update column family <cf> with key_validation_class = 'UTF8Type'". If you have heterogeneous keys, you can tell the cli what type to use on case-by-case basis, as in, "assume <cf> keys as utf8".

Cassandra dies with "java.lang.OutOfMemoryError: Map failed"

IF Cassandra is dying specifically with the "Map failed" message it means the OS is denying java the ability to lock more memory. In linux, this typically means memlock is limited. Check /proc/<pid of cassandra>/limits to verify this and raise it (eg, via ulimit in bash.) You may also need to increase vm.max_map_count. Note that the debian and redhat packages handle this for you automatically.

Why should I avoid order-preserving partitioners?

See [Partitioners].

FAQ (last edited 2014-11-12 03:43:37 by JonathanEllis)