Differences between revisions 103 and 104
Revision 103 as of 2011-02-04 01:45:49
Size: 25405
Editor: pub
Comment: correctly suggest that InitialToken not be left blank
Revision 104 as of 2011-03-05 00:19:55
Size: 26052
Editor: pub
Comment:
Deletions are marked like this. Additions are marked like this.
Line 34: Line 34:
 * [[#cleaning_compacted_tables|I compacted, so why do I still have all my SSTables?]]
Line 383: Line 384:

<<Anchor(cleaning_compacted_tables)>>

== I compacted, so why do I still have all my SSTables? ==
SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted. Read more on this subject [[http://wiki.apache.org/cassandra/MemtableSSTable|here]].

Frequently asked questions

Why can't I make Cassandra listen on 0.0.0.0 (all my addresses)?

Cassandra is a gossip-based distributed system. ListenAddress is also "contact me here address," i.e., the address it tells other nodes to reach it at. Telling other nodes "contact me on any of my addresses" is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.

If you don't want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it's up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).

One exception to this process is JMX, which by default binds to 0.0.0.0 (Java bug 6425769).

See CASSANDRA-256 and CASSANDRA-43 for more gory details.

What ports does Cassandra use?

By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP. See also RunningCassandra.

Why does Cassandra slow down after doing a lot of inserts?

This is a symptom of memory pressure, resulting in a storm of GC operations as the JVM frantically tries to free enough heap to continue to operate. Eventually, the server will crash from OutOfMemory; usually, but not always, it will be able to log this final error before the JVM terminates.

You can increase the amount of memory the JVM uses, or decrease the insert threshold before Cassandra flushes its memtables. See MemtableThresholds for details.

Setting your cache sizes too large can result in memory pressure.

What happens to existing data in my cluster when I add new nodes?

Starting a new node with the -b [bootstrap] option will cause it to contact other nodes in the cluster to copy the right data to itself.

In Cassandra 0.5 and above, there is an "AutoBootStrap" option in the config file. When enabled, using the "-b" options is unnecessary, because new nodes will automatically bootstrap themselves when they start up for the first time. Even with AutoBootstrap it is recommended that you always specify the InitialToken because the picking of an initial token will almost certainly result in an unbalanced ring. If you are building the initial cluster you certainly don't want to leave InitialToken blank. Pick the tokens such that the ring will be balanced afterward and explicitly set them on each node. See token selection in the operations wiki.

Unless you know precisely what you're doing and are aware of how the Cassandra internals work you should never introduce a new empty node to your cluster and have autoboostrap disabled. In version 0.7 under write load it will cause writes to be sent to the new node before the schema arrives from another member of the cluster. This would also indicate to clients that the new node is responsible for servicing reads for data that it definitely doesn't have.

In Cassandra 0.4 and below, it is recommended that you manually specify a value for "InitialToken" in the config file of a new node.

Can I add/remove/rename Column Families on a working cluster?

Yes, but it's important that you do it correctly.

  1. Empty the commitlog with "nodetool drain."
  2. Shutdown Cassandra and verify that there is no remaining data in the commitlog.
  3. Delete the sstable files (-Data.db, -Index.db, and -Filter.db) for any CFs removed, and rename the files for any CFs that were renamed.
  4. Make necessary changes to your storage-conf.xml.
  5. Start Cassandra back up and your edits should take effect.

see also: CASSANDRA-44

Does it matter which node a Thrift or higher-level client connects to?

No, any node in the cluster will work; Cassandra nodes proxy your request as needed. This leaves the client with a number of options for end point selection:

  1. You can maintain a list of contact nodes (all or a subset of the nodes in the cluster), and configure your clients to choose among them.
  2. Use round-robin DNS and create a record that points to a set of contact nodes (recommended).
  3. Use the get_string_property("token map") RPC to obtain an update-to-date list of the nodes in the cluster and cycle through them.

  4. Deploy a load-balancer, proxy, etc.

When using a higher-level client you should investigate which, if any, options are implemented by your higher-level client to help you distribute your requests across nodes in a cluster.

What kind of hardware should I run Cassandra on?

See [CassandraHardware].

What are SSTables and Memtables?

See MemtableSSTable and MemtableThresholds.

Why is it so hard to work with TimeUUIDType in Java?

TimeUUID's are difficult to use from java clients because java.util.UUID does not support generating version 1 (time-based) UUIDs. Here is one way to work with them and Cassandra:

Use the UUID generator from: http://johannburkard.de/software/uuid/

Below are three methods that are quite useful in working with the uuids as they come in and out of Cassandra.

Generate a new UUID to use in a TimeUUIDType sorted column family.

        /**
         * Gets a new time uuid.
         *
         * @return the time uuid
         */
        public static java.util.UUID getTimeUUID()
        {
                return java.util.UUID.fromString(new com.eaio.uuid.UUID().toString());
        }

When you read out of cassandra your getting a byte[] that needs to be converted into a TimeUUID and since the java.util.UUID doesn't seem to have a simple way of doing this, pass it through the eaio uuid dealio again.

        /**
         * Returns an instance of uuid.
         *
         * @param uuid the uuid
         * @return the java.util. uuid
         */
        public static java.util.UUID toUUID( byte[] uuid )
        {
        long msb = 0;
        long lsb = 0;
        assert uuid.length == 16;
        for (int i=0; i<8; i++)
            msb = (msb << 8) | (uuid[i] & 0xff);
        for (int i=8; i<16; i++)
            lsb = (lsb << 8) | (uuid[i] & 0xff);
        long mostSigBits = msb;
        long leastSigBits = lsb;

        com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);
        return java.util.UUID.fromString(u.toString());
        }

When you want to actually place the UUID into the Column then you'll want to convert it like this. This method is often used in conjuntion with the getTimeUUID() mentioned above.

        /**
         * As byte array.
         *
         * @param uuid the uuid
         *
         * @return the byte[]
         */
        public static byte[] asByteArray(java.util.UUID uuid)
        {
            long msb = uuid.getMostSignificantBits();
            long lsb = uuid.getLeastSignificantBits();
            byte[] buffer = new byte[16];

            for (int i = 0; i < 8; i++) {
                    buffer[i] = (byte) (msb >>> 8 * (7 - i));
            }
            for (int i = 8; i < 16; i++) {
                    buffer[i] = (byte) (lsb >>> 8 * (7 - i));
            }

            return buffer;
        }

Further, it is often useful to create a TimeUUID object from some time other than the present: for example, to use as the lower bound in a SlicePredicate to retrieve all columns whose TimeUUID comes after time X. Most libraries don't provide this functionality, probably because this breaks the "Universal" part of UUID: this should give you pause! Never assume such a UUID is unique: use it only as a marker for a specific time.

With those disclaimers out of the way, if you feel a need to create a TimeUUID based on a specific date, here is some code that will work:

        public static java.util.UUID uuidForDate(Date d)
        {
/*
  Magic number obtained from #cassandra's thobbs, who
  claims to have stolen it from a Python library.
*/
            final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 0x01b21dd213814000L;

            long origTime = d.getTime();
            long time = origTime * 10000 + NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;
            long timeLow = time &       0xffffffffL;
            long timeMid = time &   0xffff00000000L;
            long timeHi = time & 0xfff000000000000L;
            long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | (timeHi >> 48) ;
            return new java.util.UUID(upperLong, 0xC000000000000000L);
        }

I delete data from Cassandra, but disk usage stays the same. What gives?

Data you write to Cassandra gets persisted to SSTables. Since SSTables are immutable, the data can't actually be removed when you perform a delete, instead, a marker (also called a "tombstone") is written to indicate the value's new status. Never fear though, on the first compaction that occurs after GCGraceSeconds (hint: storage-conf.xml) have expired, the data will be expunged completely and the corresponding disk space recovered. See DistributedDeletes for more detail.

Why are reads slower than writes?

Unlike all major relational databases and some NoSQL systems, Cassandra does not use b-trees and in-place updates on disk. Instead, it uses a sstable/memtable model like Bigtable's: writes to each ColumnFamily are grouped together in an in-memory structure before being flushed (sorted and written to disk). This means that writes cost no random I/O, compared to a b-tree system which not only has to seek to the data location to overwrite, but also may have to seek to read different levels of the index if it outgrows disk cache!

The downside is that on a read, Cassandra has to (potentially) merge row fragments from multiple sstables on disk. We think this is a tradeoff worth making, first because scaling writes has always been harder than scaling reads, and second because as your data corpus grows Cassandra's read disadvantage narrows vs b-tree systems that have to do multiple seeks against a large index. See MemtableSSTable for more details.

Why does nodeprobe ring only show one entry, even though my nodes logged that they see each other joining the ring?

This happens when you have the same token assigned to each node. Don't do that.

Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.

The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.

Why do deleted keys show up during range scans?

Because get_range_slice says, "apply this predicate to the range of rows given," meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.

So to special case leaving out result entries for deletions, we would have to check the entire rest of the row to make sure there is no undeleted data anywhere else either (in which case leaving the key out would be an error).

This is what we used to do with the old get_key_range method, but the performance hit turned out to be unacceptable.

See DistributedDeletes for more on how deletes work in Cassandra.

Can I change the ReplicationFactor on a live cluster?

Yes, but it will require restarting and running repair manually to change the replica count of existing data.

  • Alter the ReplicationFactor for the desired keyspace(s) in the storage configuration on each node in the cluster.

  • Restart cassandra on each node in the cluster.

If you're reducing the ReplicationFactor:

  • Run "nodetool cleanup" on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.

If you're increasing the ReplicationFactor:

  • Run "nodetool repair" to run an anti-entropy repair on the cluster. Repair runs on a per-node basis. This is an intensive process that will result in adverse cluster performance. It's highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it and may even kill it.

Can I Store BLOBs in Cassandra?

Currently Cassandra isn't optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra's public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift's behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:

  • The main limitation on a column and super column size is that all the data for a single key and column must fit (on disk) on a single machine(node) in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.
  • When large columns are created and retrieved, that columns data is loaded into RAM which can get resource intensive quickly. Consider, loading 200 rows with columns that store 10Mb image files each into RAM. That small result set would consume about 2Gb of RAM. Clearly as more and more large columns are loaded, RAM would start to get consumed quickly. This can be worked around, but will take some upfront planning and testing to get a workable solution for most applications. You can find more information regarding this behavior here: memtables, and a possible solution in 0.7 here: CASSANDRA-16.

  • Please refer to the notes in the Cassandra limitations section for more information: Cassandra Limitations

Nodetool says "Connection refused to host: 127.0.1.1" for any remote host. What gives?

Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up it's own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.

If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails try passing the -Djava.rmi.server.hostname=$IP option to the JVM at startup (where $IP is the address of the interface you can reach from the remote machine).

How can I iterate over all the rows in a ColumnFamily?

Simple but slow: Use get_range_slices, start with the empty string, and after each call use the last key read as the start key in the next iteration.

Better: use HadoopSupport.

Why were none of the keyspaces described in storage-conf.xml loaded?

Prior to 0.7, cassandra loaded a set of static keyspaces defined in a storage-conf.xml file. CASSANDRA-44 added the ability to modify schema dynamically on a live cluster. Part of this change required that we ignore the schema defined in storage-conf.xml. Additionally, 0.7 converts to YAML based configuration.

If you have an existing storage-conf.xml file, you will first need to convert it to YAML using the bin/config-converter tool, which can generate a cassandra.yaml file from a storage-conf.xml file. Once you have a cassandra.yaml, it is possible to do a one-time load of the schema it defines. 0.7 adds a loadSchemaFromYAML method to StorageServiceMBean (triggered via JMX: see https://issues.apache.org/jira/browse/CASSANDRA-1001 ) which will load the schema defined in cassandra.yaml, but this is a one-time operation. A node that has had its schema defined via loadSchemaFromYAML will load its schema from the system table on subsequent restarts, which means that any further changes to the schema need to be made using the system_* thrift operations (see API).

It is recommended that you only perform schema updates on one node and let cassandra propagate changes to the rest of the cluster. If you try to perform the same updates simultaneously on multiple nodes, you run the risk of introducing inconsistent migrations, which will lead to a confused cluster.

See LiveSchemaUpdates for more information.

Is there a GUI admin tool for Cassandra?

The closest is chiton, a GTK data browser.

Another java UI http://code.google.com/p/cassandra-gui, a Swing data browser.

Insert operation throws InvalidRequestException with message "A long is exactly 8 bytes"

You are propably using LongType column sorter in your column family. LongType assumes that the numbers stored into column names are exactly 64bit (8 bytes) long and in big endian format. Example code how to pack and unpack an integer for storing into cassandra and unpacking it for php:

        /**
         * Takes php integer and packs it to 64bit (8 bytes) long big endian binary representation.
         * @param  $x integer
         * @return string eight bytes long binary repersentation of the integer in big endian order.
         */
        public static function pack_longtype($x) {
                return pack('C8', ($x >> 56) & 0xff, ($x >> 48) & 0xff, ($x >> 40) & 0xff, ($x >> 32) & 0xff,
                                ($x >> 24) & 0xff, ($x >> 16) & 0xff, ($x >> 8) & 0xff, $x & 0xff);
        }

        /**
         * Takes eight bytes long big endian binary representation of an integer and unpacks it to a php integer.
         * @param  $x
         * @return php integer
         */
        public static function unpack_longtype($x) {
                $a = unpack('C8', $x);
                return ($a[1] << 56) + ($a[2] << 48) + ($a[3] << 40) + ($a[4] << 32) + ($a[5] << 24) + ($a[6] << 16) + ($a[7] << 8) + $a[8];
        }

Cassandra says "ClusterName mismatch: oldClusterName != newClusterName" and refuses to start

To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, it is safe to remove system/LocationInfo* after forcing a compaction on all ColumnFamilies (with the old cluster name) if you've specified the node's token in the config file, or if you don't care about preserving the node's token (for instance in single node clusters.)

Are batch_mutate operations atomic?

As a special case, mutations against a single key are atomic but not isolated. Reads which occur during such a mutation may see part of the write before they see the whole thing. More generally, batch_mutate operations are not atomic. batch_mutate allows grouping operations on many keys into a single call in order to save on the cost of network round-trips. If batch_mutate fails in the middle of its list of mutations, no rollback occurs and the mutations that have already been applied stay applied. The client should typically retry the batch_mutate operation.

Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?

For the latest on Hadoop-Cassandra integration, see HadoopSupport.

Can a Cassandra cluster be multi-tenant?

There is work being done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant.

Who is using Cassandra and for what?

For information on who is using Cassandra and what they are using it for, see CassandraUsers.

Are there any OBDC drivers for Cassandra?

No.

Are there ways to do logging directly to Cassandra?

For information on logging directly to Cassandra, see LoggingToCassandra.

On RHEL nodes are unable to join the ring

Check if selinux is on, if it is turn it OFF

Is there an authentication/authorization mechanism for Cassandra?

Yes. For details, see ExtensibleAuth.

How do I bulk load data into Cassandra?

See BulkLoading

Why aren't range slices/sequential scans giving me the expected results?

You're probably using the RandomPartitioner. This is the default because it avoids hotspots, but it means your rows are ordered by the md5 of the row key rather than lexicographically by the raw key bytes.

You can start out with a start key and end key of [empty] and use the row count argument instead, if your goal is paging the rows. To get the next page, start from the last key you got in the previous page.

You can also use intra-row ordering of column names to get ordered results within a row; with appropriate row 'bucketing,' you often don't need the rows themselves to be ordered.

How do I unsubscribe from the email list?

Send an email to user-unsubscribe@cassandra.apache.org

I compacted, so why do I still have all my SSTables?

SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted. Read more on this subject here.

FAQ (last edited 2016-08-15 15:30:03 by JonathanEllis)