BasicUrlNormalizer Notes

The Basic URL Normalizer class manipulates an URL in several ways.

  1. Trims white space from the end of the URL. (java.lang.String.trim())
  2. may lower case protocol. (java.net.URL)
  3. if protocol is http or ftp:
    1. lower cases host.
    2. removes port if default.
    3. adds trailing slash if no file specified.
    4. removes any refrence text
    5. removes any relative paths

For example:

would be rewriten:

Notes

Other then trimming trailing white space and the normalization performed by java.net.URL no protocols other then http and ftp are further normalized.

org.apache.nutch.net.BasicUrlNormalizer (last edited 2009-09-20 23:09:31 by localhost)