Differences between revisions 1 and 2
Revision 1 as of 2006-03-07 16:51:16
Size: 791
Editor: JeffRitchie
Comment: adding page
Revision 2 as of 2009-09-20 23:09:31
Size: 791
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 14: Line 14:
For example:[[BR]]
 {{{http://wiKI.apache.ORG:80/somedirectory/../DevelopmentCommandLineOptions}}}[[BR]]
would be rewriten:[[BR]]
 {{{http://wiki.apache.org/DevelopmentCommandLineOptions}}}[[BR]]
For example:<<BR>>
 {{{http://wiKI.apache.ORG:80/somedirectory/../DevelopmentCommandLineOptions}}}<<BR>>
would be rewriten:<<BR>>
 {{{http://wiki.apache.org/DevelopmentCommandLineOptions}}}<<BR>>

BasicUrlNormalizer Notes

The Basic URL Normalizer class manipulates an URL in several ways.

  1. Trims white space from the end of the URL. (java.lang.String.trim())
  2. may lower case protocol. (java.net.URL)
  3. if protocol is http or ftp:
    1. lower cases host.
    2. removes port if default.
    3. adds trailing slash if no file specified.
    4. removes any refrence text
    5. removes any relative paths

For example:

  • http://wiKI.apache.ORG:80/somedirectory/../DevelopmentCommandLineOptions

would be rewriten:

  • http://wiki.apache.org/DevelopmentCommandLineOptions

Notes

Other then trimming trailing white space and the normalization performed by java.net.URL no protocols other then http and ftp are further normalized.

org.apache.nutch.net.BasicUrlNormalizer (last edited 2009-09-20 23:09:31 by localhost)