Overview
The main features of the data store are:
Space saving: only one copy per unique object it kept
Fast copy: only the identifier is copied
Storing and reading does not block others
Objects in the data store are immutable
Garbage collection is used to purge unused objects
Hot backup is supported
Requirements
Jackrabbit 1.4 is required, it is not available in the previous releases.
A Bundle persistence manager is required, or any other persistence manager that supports data stores. Currently the SimpleDbPersistenceManager does not support the data store, meaning large objects are still saved multiple times if it is used.
How to configure the file data store
To use the file based data store, add this to your repository.xml after the <Repository> start tag:
<DataStore class="org.apache.jackrabbit.core.data.FileDataStore"/>
File Data Store Configuration Options
This is a full configuration using the default values:
<DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
<param name="path" value="${rep.home}/repository/datastore"/>
<param name="minRecordLength" value="100"/>
</DataStore>
Database Data Store
Here is a possible configuration using the database data store:
<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
<param name="url" value="jdbc:postgresql:test"/>
<param name="user" value="sa"/>
<param name="password" value="sa"/>
<param name="databaseType" value="postgresql"/>
<param name="driver" value="org.postgresql.Driver"/>
<param name="minRecordLength" value="1024"/>
<param name="maxConnections" value="2"/>
<param name="copyWhenReading" value="true"/>
<param name="tablePrefix" value=""/>
</DataStore>
There is a limitation on the minRecordLength: the maximum value is around 32000. The reason for this is that Java doesn't support strings longer than 64 KB in writeUTF.
FAQ
Q: Can I disable the data store? A: Only if there are no elements in the data store. If there it is not empty, you need to copy the data to a new repository.
Q: When I use the database data store I get the message: 'Table or view does not exists'. A: Maybe the data store table already exists in another schema. When starting the repository, the database data store checks if the table already exists (using a database meta data call), and will create the table if not. If the table exists, but is in another schema, the table is not created, but accessing it may fail (if the other schema is not in the schema search path for this user).
Clustering is supported if you use a shared file system, such as SAN or NFS (Windows file sharing works as well). You need to set data store path of all cluster nodes to the same location.
Blob Store: When the data store is enabled, the blob store is not used. The data store solves the same (and more) problems than the blob store. Therefore, the blob store is now deprecated, however it will be supported for quite some time.
Transaction: transactional semantics are guaranteed.
There is only one data store per repository (not one per Workspace).
Backup: It is very easy to backup the data store: just copy all files. They are never modified, and only renamed from temp file to live file. Deleted only when no longer used (and only by the garbage collector). Backup can be incremental. Backup at runtime (hot backup) is supported.
The main advantages of the data store over the blob store are: unlike the blob store, the data store keeps only one copy per object, even if it is used multiple times. The data store detects if the same object is already stored and will only store a link to the existing object. The data store can be shared across multiple workspaces, and even across multiple repositories if required. Data store operations (read and write) don't block other users because they are done outside the persistence manager. Multiple data store operations can be done at the same time.
Migration: currently there is no special mechanism to migrate data from a blob store to a data store. The only known way to convert is to export the data, and re-import into a new repository.
How does it work
When adding a binary object, Jackrabbit checks the size of it. When it is larger than minRecordLength, it is added to the data store, otherwise it is kept in-memory. This is done very early (possible when calling Property.setValue(stream)). Only the unique data identifier is stored in the persistence manager (except for in-memory objects, where the data is stored). When updating a value, the old value is kept there (potentially becoming garbage) an the new value is added. There is no update operation.
The current implementation still stores temporary files in some situations, for example in the RMI client. Those cases will be changed to use the data store directly where it makes sense.
Very small objects (where it does not make sense to create a file) are stored in the persistence manager (in-place).
Objects in the data store are only removed when they are not reachable (that means, objects referenced in the cache or in memory are not collected). There is no 'update' operation, only 'add new entry'. Data is added before the transaction is committed. Additions are globally atomic, cluster nodes can share the same data store. Even different repositories can share the same store, as long as garbage collection is done correctly.
Overview:
Objects are usually stored early in the data store, even before the transaction is committed. Only the the identifier is stored in the persistence manager. The blob store is not used any longer (except for backward compatibility). When using the RMI client, large objects are not stored directly in the data store, instead they are first transferred to the server.
Running data store garbage collection
Running the garbage collection is currently a manual process. You can run this as a separate thread concurrently to your application:
import org.apache.jackrabbit.core.data.GarbageCollector; ... GarbageCollector gc; SessionImpl si = (SessionImpl)session; gc = si.createDataStoreGarbageCollector(); // optional (if you want to implement a progress bar / output): gc.setScanEventListener(this); gc.scan(); gc.stopScan(); // delete old data gc.deleteUnused();
If multiple repositories use the same data store, the deleteUnused() method must not be used. Instead, the process is:
* Write down the current time = X * Run gc.scan() on each repository * Manually delete files with last modified date older than X
How to write a new data store implementation
New implementations are welcome! Cool would be a S3 data store (
http://en.wikipedia.org/wiki/Amazon_S3). A caching data store would be great as well (items that are used a lot are stored in fast file system, others in a slower one).
Future ideas
Theoretically the data store could be split to different directories / hard drives. Content that is accessed more often could be moved to a faster disk, and less used data could eventually be moved to slower / cheaper disk. That would be an extension of the 'memory hierarchy' (see also
http://en.wikipedia.org/wiki/Memory_hierarchy). Of course this wouldn't limit the space used per workspace, but would improve system performance if done right. Maybe we need to do that anyway in the near future to better support solid state disk.
Other feature requests:
A replicating data store
Currently the FileDataStore creates a lot of directories (and files). If possible the number of directories (and maybe files) should be reduced to improve performance.
Fulltext search and meta data extraction could be done when storing the object (only once per object) and stored next to the object.
Client should first send the checksum and size of large objects when they store something (import, adding or updating data), in many cases the actual data does not need to be sent.
Speed up garbage collection. One idea is to use 'back references' for larger objects: each larger object would know the set of nodes that reference it. This would be an 'append only' set, that means at runtime links only added, not removed. Only the garbage collection process removes links. The garbage collection would first update links for large objects (this process could stop at the first link that still exists). That way large objects can be removed quickly if they are not used any more. Afterwards, objects with a low use count should be scanned. This algorithm wouldn't necessarily speed up the total garbage collection time, but it would free up space more quickly.