Based on some conversations on list:

We've gathered some requirements for a Debug Tool, that could be useful in allowing users to know precisely what decisions that Nutch is making while it navigates the URL space. So far, here's what we have from Ken Krugler, primarily, and those others (Markus Jelsma, Chris Mattmann, Lewis John McGibbney) participating in the above referenced thread:

It should be possible to generate information that would have answered all of the "is it X" questions that came up during a user's crawl. E.g.

  1. which URLs were put on the fetch list, versus skipped.
  2. which fetched documents were truncated.
  1. which URLs in a parsed page were skipped, due to the max outlinks per page limit.
  2. which URLs got filtered by regex, prefix, suffix, domain filters
  3. exclusions by robots directives
  4. URLs mapped to another URL

Please add more requirements and discussion here.