Attachment 'crawl-urlfilter.txt'

Download

   1 # Licensed to the Apache Software Foundation (ASF) under one or more
   2 # contributor license agreements.  See the NOTICE file distributed with
   3 # this work for additional information regarding copyright ownership.
   4 # The ASF licenses this file to You under the Apache License, Version 2.0
   5 # (the "License"); you may not use this file except in compliance with
   6 # the License.  You may obtain a copy of the License at
   7 #
   8 #     http://www.apache.org/licenses/LICENSE-2.0
   9 #
  10 # Unless required by applicable law or agreed to in writing, software
  11 # distributed under the License is distributed on an "AS IS" BASIS,
  12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13 # See the License for the specific language governing permissions and
  14 # limitations under the License.
  15 
  16 
  17 # The url filter file used by the crawl command.
  18 
  19 # Better for intranet crawling.
  20 # Be sure to change MY.DOMAIN.NAME to your domain name.
  21 
  22 # Each non-comment, non-blank line contains a regular expression
  23 # prefixed by '+' or '-'.  The first matching pattern in the file
  24 # determines whether a URL is included or ignored.  If no pattern
  25 # matches, the URL is ignored.
  26 
  27 # skip file:, ftp:, & mailto: urls
  28 -^(file|ftp|mailto):
  29 
  30 # skip image and other suffixes we can't yet parse
  31 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
  32 
  33 # skip URLs containing certain characters as probable queries, etc.
  34 -[?*!@=]
  35 
  36 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
  37 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
  38 
  39 # accept hosts in MY.DOMAIN.NAME
  40 +^http://([a-z0-9]*\.)*apache.org/
  41 
  42 # skip everything else
  43 -.

Attached Files

To refer to attachments on a page, use attachment:filename, as shown below in the list of files. Do NOT use the URL of the [get] link, since this is subject to change and can break easily.

You are not allowed to attach a file to this page.