Upgrading nutch from 0.8.x to 0.9.0-dev (current svn trunk)

Hadoop 0.7.x deprecates the use of UTF8 class for storing String-s, instead providing Text. As a part of this upgrade all places in Nutch that used UTF8 have been changed to use Text - this change would have to be done sooner or later.

However, this means that all previously created data is no longer compatible with the new tools, which expect data to use Text class instead of UTF8 class.

Now, quickly before you panic ... there is an upgrade path to save your precious data, please read on. ;)

If your data is easily re-created from scratch, I recommend doing this, it might be quicker.

Otherwise, please follow these steps to upgrade your data to the new formats (note: this does not require re-fetching):

CrawlDb upgrade

Use the new tool convdb <old_db> <new_db> [-withMetadata] to convert your existing CrawlDb from old format using <UTF8, CrawlDatum> to the new format using <Text, CrawlDatum>. Optionally, you can also replace all UTF8 metadata keys to use Text (normally not needed).

Segment upgrade

At the moment you can only upgrade non-parsed segments. Please follow these steps:

   for i in segments/2007* ; do
      (cd $i && rm -rf crawl_parse parse_data parse_text)

   mkdir converted
   for i in segments/2007* ; do
      nutch mergesegs converted $i

(Of course, you can use this opportunity to actually merge some segments and/or re-slice them - the above command creates exactly one converted segment for one input segment).

   for i in converted/* ; do
      nutch parse $i

LinkDb upgrade

There is no option to upgrade your linkdb - but you can easily re-create it from parsed segments, using 'nutch invertlinks' command. Be sure to remove the old linkdb first! (otherwise the tool will attempt to merge new data with old data and things will explode).

Index upgrade

Theoretically, if you rename your converted segments to have exactly the same names as the old segments, you shouldn't need to rebuild your indexes. However, this part of the upgrade process wasn't tested - so to be safe it's better to re-index converted segments.

Your suggestions, comments, success (or horror) stories are appreciated. Good luck!

Upgrading_from_0.8.x_to_0.9 (last edited 2009-09-20 23:10:16 by localhost)