Differences between revisions 57 and 58
Revision 57 as of 2013-06-20 17:43:37
Size: 16216
Editor: JensAlfke
Comment: Removed obsolete statement about push vs pull performance.
Revision 58 as of 2013-09-23 10:18:10
Size: 15852
Comment:
Deletions are marked like this. Additions are marked like this.
Line 105: Line 105:
Note: this naive filter will lead to problems if used as-is with a source database in which documents are deleted -- deletions will not be replicated (the document submitted to the filter will not have a {{{type}}} field anymore, so the filter will return {{{false}}}), and if a new document with the same {{{_id}}} is created later, it will not get replicated. A symptom of this issue is an incrementing {{{missing_revisions_found}}} counter in a replication process.
Deleted documents have a {{{_deleted}}} field though; a better filter might be:
{{{
  function(doc, req) {a
  if(doc._deleted) {
    return true;
  }
  if (doc.type && doc.type == "foo") {
    return true;
  } else {
    return false;
  }
}
}}}

Note: When using filtered replication you should not use the DELETE method to remove documents, but instead use PUT and add a {{{_deleted:true}}} field to the document, preserving the fields required for the filter. Your Document Update Handler should make sure these fields are always present. This will ensure that the filter will propagate deletions properly.

The official documentation has moved to http://docs.couchdb.org — The transition is not 100% complete, but http://docs.couchdb.org should be seen as having the latest info. In some cases, the wiki still has some more or older info on certain topics inside CouchDB.

You need to be added to the ContributorsGroup to edit the wiki. But don't worry! Just email any Mailing List or grab us on IRC and let us know your user name.

Replication

See also the official documentation for the replication and replicator database topics.

Overview

The replication is an incremental one way process involving two databases (a source and a destination).

The aim of the replication is that at the end of the process, all active documents on the source database are also in the destination database and all documents that were deleted in the source databases are also deleted (if exists) on the destination database.

The replication process only copies the last revision of a document, so all previous revisions that were only on the source database are not copied to the destination database.

Changes on the master will not automatically replicate to the slaves. See “Continuous Replication” below.

One-shot Replication

One-shot replication is triggered by sending a POST request to the _replicate URL.

The body is JSON with the following allowed fields:

Field Name

Description

source

Required. Identifies the database to copy revisions from. Can be a string containing a local database name or a remote database URL, or an object whose url property contains the database name or URL.

target

Required. Identifies the database to copy revisions to. Same format and interpretation as source.

cancel

Include this property with a value of true to cancel an existing replication between the specified source and target.

continuous

A value of true makes the replication continuous (see below for details.)

create_target

A value of true tells the replicator to create the target database if it doesn't exist yet.

doc_ids

Array of document IDs; if given, only these documents will be replicated.

filter

Name of a filter function that can choose which revisions get replicated.

proxy

Proxy server URL.

query_params

Object containing properties that are passed to the filter function.

The source and a target fields indicate the databases that documents will be copied from and to, respectively. Use just the name for a local database, or the full URL for a remote database. A local-to-remote replication is called a push, and remote-to-local is called a pull. Local-to-local or even remote-to-remote are also allowed, but rarer. For example:

POST /_replicate HTTP/1.1

{"source":"example-database","target":"http://example.org/example-database"}

If your local CouchDB instance is secured by an admin account, you need to use the full URL format

POST /_replicate HTTP/1.1

{"source":"http://example.org/example-database","target":"http://admin:password@127.0.0.1:5984/example-database"}

The target database has to exist and is not implicitly created. Add create_target:true to the JSON object to create the target database (remote or local) prior to replication. The names of the source and target databases do not have to be the same.

Cancel replication

Before 1.2.0

A replication triggered by POSTing to /_replicate/ can be canceled by POSTing the exact same JSON object but with the additional "cancel" property set to the boolean true value.

POST /_replicate HTTP/1.1
{"source":"example-database", "target":"http://example.org/example-database", "cancel": true}

Notice: the request which initiated the replication will fail with error 500 (shutdown).

from 1.2.0 onward

Starting from CouchDB version 1.2.0, the original replication object no longer needs to be known. Instead a simple JSON object with the fields "replication_id" (a string) and "cancel" (set to the boolean true value) is enough. The names _local_id and id are aliases to replication_id. The replication ID can be obtained from the original replication request (if it's a continuous replication), from _active_tasks or from the log. Example:

$ curl -H 'Content-Type: application/json' -X POST http://localhost:5984/_replicate -d ' {"source": "http://myserver:5984/foo", "target": "bar", "create_target": true, "continuous": true} '
{"ok":true,"_local_id":"0a81b645497e6270611ec3419767a584+continuous+create_target"}

$ curl -H 'Content-Type: application/json' -X POST http://localhost:5984/_replicate -d ' {"replication_id": "0a81b645497e6270611ec3419767a584+continuous+create_target", "cancel": true} '
{"ok":true,"_local_id":"0a81b645497e6270611ec3419767a584+continuous+create_target"}

Continuous replication

To make replication continuous, add "continuous":true parameter to JSON, for example:

POST /_replicate HTTP/1.1

{"source":"http://example.org/example-database","target":"http://admin:password@127.0.0.1:5984/example-database", "continuous":true}

CouchDB can persist continuous replications over a server restart. For more, see the _replicator database below.

Filtered Replication

Sometimes you don't want to transfer all documents from source to target. You can include one or more filter functions in a design document on the source and then tell the replicator to use them.

A filter function takes two arguments (the document to be replicated and the the replication request) and returns true or false. If the result is true, then the document is replicated.

  function(doc, req) {
  if (doc.type && doc.type == "foo") {
    return true;
  } else {
    return false;
  }
}

Note: When using filtered replication you should not use the DELETE method to remove documents, but instead use PUT and add a _deleted:true field to the document, preserving the fields required for the filter. Your Document Update Handler should make sure these fields are always present. This will ensure that the filter will propagate deletions properly.

Filters live under the top-level "filters" key;

  {
    "_id":"_design/myddoc",
    "filters": {
      "myfilter": "function goes here"
    }
  }

Invoke them as follows;

{"source":"http://example.org/example-database","target":"http://admin:password@127.0.0.1:5984/example-database", "filter":"myddoc/myfilter"}

You can even pass arguments to them;

{"source":"http://example.org/example-database","target":"http://admin:password@127.0.0.1:5984/example-database", "filter":"myddoc/myfilter", "query_params": {"key":"value"}}

Named Document Replication

Sometimes you only want to replicate some documents. For this simple case you do not need to write a filter function. Simply add the list of keys in the doc_ids field;

{"source":"http://example.org/example-database","target":"http://admin:password@127.0.0.1:5984/example-database", "doc_ids":["foo","bar","baz]}

Replicating through a proxy

Pass a "proxy" argument in the replication data to have replication go through an HTTP proxy:

POST /_replicate HTTP/1.1

{"source":"example-database","target":"http://example.org/example-database", "proxy":"http://localhost:8888"}

See also:

Authentication

The remote database may require authentication, especially if it's the target because the replicator will need to write to it. The easiest way to authenticate is to put a username and password into the URL; the replicator will use these for HTTP Basic auth:

{"source":"https://myusername:mypassword@example.net:5984/db", "target":"local-db"}

The password will not be visible to other users, even if they inspect the document in the _replicator database, but it's still stored in plaintext in the database file.

OAuth

CouchDB supports OAuth 1 authentication, but not yet (as of CouchDB 1.2) OAuth 2.

To replicate with OAuth authentication, use the form in which the source or target property is an object instead of a direct URL string. Then add the OAuth tokens to the object as shown:

{"source": "example-database",
 "target": {
    "url": "http://example.org/example-database",
    "auth": {
        "oauth": {
            "consumer_secret": "...", "consumer_key": "...", "token_secret": "...", "token": "..." } } }

Username Workaround (older CouchDBs only)

In some older versions of CouchDB, if the remote username or password contains a special character like an @ sign, CouchDB will not handle these properly. You can work around this by making the source or target property an object, and adding a headers property to add a custom Authorization: header.

For example this may not work (assuming username "bob@example.com" has password "password"), even though the URL is properly formatted:

POST /_replicate HTTP/1.1

{"source":"https://bob%40example.com:password@example.net:5984/db", "target":"local-db"}

(In this case a broken CouchDB will encode the username as "bob%40example.com" instead of "bob@example.com" when submitting authorization to the remote source.)

To work around the issue, use a JSON object for source or target, instead of a string, as follows:

POST /_replicate HTTP/1.1

{"source":{"url":"https://example.net:5984/db","headers":{"Authorization":"Basic Ym9iQGV4YW1wbGUuY29tOnBhc3N3b3Jk"}}, "target":"local-db"}

where the base64 string following the word "Basic" is the output of:

echo -n 'bob@example.com:password' | base64

Replicator database

Since CouchDB 1.1.0, a system database named _replicator can be used to manage replications. Replications triggered by POSTing to /_replicate are not managed by this system database. The currently most reliable documentation about it can be found at:

https://gist.github.com/832610 (from the author)

Since CouchDB 1.2.0, special security restrictions are in place for the replicator database:

  • Users can only update replication documents they created themselves.
  • Users that read replication documents by other user will not get shown passwords and oath tokens used for authenticating replications.
  • For all authenticated users, their username gets added to the owner field of the replication document.

  • Only server and database admins can create design docs and access views.
  • Server and database admins can update any replication document.

New features introduced in CouchDB 1.2.0

CouchDB 1.2 ships with a new replicator implementation. Besides offering performance improvements, more resilience and better logging/reporting, it offers new configuration parameters. These parameters can be specified globally in the default.ini configuration file:

* worker_processes - The number of process the replicator uses (per replication) to transfer documents from the source to the target database. Higher values can imply better throughput (due to more parallelism of network and disk IO) at the expense of more memory and eventually CPU. Default value is 4.

* worker_batch_size - Workers process batches with the size defined by this parameter (the size corresponds to number of _changes feed rows). Larger batch sizes can offer better performance, while lower values imply that checkpointing is done more frequently. Default value is 500.

* http_connections - The maximum number of HTTP connections per replication. For push replications, the effective number of HTTP connections used is min(worker_processes + 1, http_connections). For pull replications, the effective number of connections used corresponds to this parameter's value. Default value is 20.

* connection_timeout - The maximum period of inactivity for a connection in milliseconds. If a connection is idle for this period of time, its current request will be retried. Default value is 30000 milliseconds (30 seconds).

* retries_per_request - The maximum number of retries per request. Before a retry, the replicator will wait for a short period of time before repeating the request. This period of time doubles between each consecutive retry attempt. This period of time never goes beyond 5 minutes and its minimum value (before the first retry is attempted) is 0.25 seconds. The default value of this parameter is 10 attempts.

* socket_options - A list of options to pass to the connection sockets. The available options can be found in the documentation for the Erlang function setopts/2 of the inet module. Default value is [{keepalive, true}, {nodelay, false}].

* verify_ssl_certificates - Whether the replicator should validate or not peer SSL certificates. Default value is false.

* ssl_certificate_max_depth - The maximum allowed depth for peer SSL certificates. This option only has effect if the option 'verify_ssl_certificates is enabled. Default value is 3.

* cert_file, key_file, password - These options allow the replicator to authenticate to the other peer with an SSL certificate. The first one is a path to a certificate in the PEM format, the second is a path to a file containg the PEM encoded private key, and the third is a password needed to access the key file if this file is password protected. By default these options are disabled.

All these options, except for the ones related to peer authentication with SSL certificates, can also be set per replication by simply including them in the replication object/document. Example:

POST /_replicate HTTP/1.1

{
    "source": "example-database",
    "target": "http://example.org/example-database",
    "connection_timeout": 60000,
    "retries_per_request": 20,
    "http_connections": 30
}

When a replication is started, CouchDB will log the value its parameter. Example:

[info] [<0.152.0>] Replication `"1447443f5d0837538c771c3af68518eb+create_target"` is using:
        4 worker processes
        a worker batch size of 500
        30 HTTP connections
        a connection timeout of 60000 milliseconds
        20 retries per request
        socket options are: [{keepalive,true},{nodelay,false}]
        source start sequence 9243679
[info] [<0.128.0>] starting new replication `1447443f5d0837538c771c3af68518eb+create_target` at <0.152.0> (`my_database` -> `http://www.server.com:5984/my_database_copy/`)

As for monitoring progress, the active tasks API was enhanced to report additional information for replication tasks. Example:

$ curl http://localhost:5984/_active_tasks
[
    {
        "pid": "<0.1303.0>",
        "replication_id": "e42a443f5d08375c8c7a1c3af60518fb+create_target",
        "checkpointed_source_seq": 17333,
        "continuous": false,
        "doc_write_failures": 0,
        "docs_read": 17833,
        "docs_written": 17833,
        "missing_revisions_found": 17833,
        "progress": 3,
        "revisions_checked": 17833,
        "source": "http://fdmanana.iriscouch.com/test_db/",
        "source_seq": 551202,
        "started_on": 1316229471,
        "target": "test_db",
        "type": "replication",
        "updated_on": 1316230082
    }
]

Protocol Documentation

Replication (last edited 2013-09-23 10:18:10 by StephaneAlnet)