More About WebDB
Michael Cafarella February 15, 2004
This is the second of two documents that describe the Nutch WebDB. Part one covered the single-processor WebDB; we now turn to the distributed version.
We would like the distributed WebDB to be updated across multiple disks and machines simultaneously. The completed WebDB can exist across many disks, but we assume a single machine can mount and read all of them. The distributed WebDB exists so we can distribute the burdens of update over many machines.
The machines in a distributed WebDBWriter installation are connected either by NFS or by ssh/scp access to one another. (Our NFS implementation is a little more advanced right now.)
The distributed WebDBReader just loads in files from multiple locations. We aren't too worried about distributed reading; it can be implemented by WebDBReader clients instead of in the db itself.
Multiple WebDBWriter processes will each receive separate, possibly overlapping, sets of WebDBWriter API calls to apply to the db. It's OK for the WebDB to know in advance how many WebDBWriter? processes will be involved in any update operation. All of these processes will be run concurrently before the db update is finished.
Since we know the number of writers ahead of time, it's easy to partition the WebDB keyspace into k regions, one for each WebDBWriter process. MD5 keys can be evenly partitioned according to the first (log k) bits. We derived from data a set of good breakpoint URLs that evenly divide the URL keys.
It would be great if each WebDBWriter process could simply work on just its own part of the db, wholly ignorant of the other processes. Unfortunately, that's not possible. A WebDBWriter can receive an API call that might touch a record in another Writer's keyspace partition. Even worse, remember that every stage of WebDBWriter?
execution can emit edits to be processed in later stages. Each of these edits might involve a record in a different Writer's keyspace partition.
Take a moment to recall how the single-processor WebDBWriter applies changes to the WebDB:
So a distributed WebDBWriter cannot labor in blissful ignorance. It needs to communicate with other writers, particularly when writing out edits to be performed. We modify the above single-processor algorithm to account for this inter-writer protocol.
The Distributed DB system works as follows:
Reading from a distributed WebDB is comparatively simple. The WebDB now exists in k separate pieces. When the distributed WebDBReader needs to examine a certain key, it simply checks the relevant segment first. When asked to enumerate all entries, it enumerates each of the k segments in order.
It's worthwhile to spend a little time on how the distributed WebDBWriter's inter-process communication works. No process ever opens a socket to communicate directly with another. Rather, all data is communicated via files. However, since the WebDBWriters exist over many machines and filesystems, these files need to be copied back and forth.
We do that via the NutchDistributedFileSystem. Really, "filesystem" is overstating the case a little bit. Rather it's a mechanism for a shared file namespace, with some automatic file copying between machines that announce interest in that namespace.
Every object under control of a NutchDistributedFileSystem machine group is represented with a "NutchFile" object. A NutchFile is named using three different parameters:
A NutchFile object can be resolved to a "real-world" disk File with the help of a local NutchDistributedFileSystem object. Each machine in a NutchDistributedFileSystem machine group has a NutchDistributedFileSystem object that handles configuration and other services. One such config value is the place on a local disk where the "root" of the NutchDistributedFileSystem is found. The disk File embodiment of a NutchFile is a combination of that root, the shareGroupName, and the name.
(Of course, not all sharegroups' files will be present on each machine. That depends on the (sharegroup<->machine) mapping.)
The NutchDistributedFileSystem also takes care of file moves, deleted, locking, and guaranteed atomicity.
It should be clear that the NutchDistributedFileSystem can be implemented across any group of machines that have mutual remote-access rights. It can also be used across machines that share mutual Network File System mounts.