IndexStructure

The Index Structure

The index structure formed after indexing is shown below :

FieldName

Stored

Index

IndexingFilter

Comment

boost

YES

NotIndexed

Indexer

digest

YES

NotIndexed

Indexer

lang

YES

UnTokenized

language-identifier

segment

YES

NotIndexed

Indexer

tstamp

YES

Tokenized

Indexer

anchor

NO

Tokenized

index-basic

title

YES

Tokenized

index-basic

also by index-more

site

NO

UnTokenized

index-basic

host

NO

Tokenized

index-basic

hostname

url

YES

Tokenized

index-basic

content

NO

Tokenized

index-basic

content

lastModified

YES

NotIndexed

index-more

date

NO

UnTokenized

index-more

contentLength

YES

NotIndexed

index-more

type

NO

UnTokenized

index-more

contentType,primaryType,subType (all mime-types)

primaryType

YES

UnTokenized

index-more

primaryType (mime-type)

subType

YES

UnTokenized

index-more

subType (mime-type)

domain

NO

Tokenized

index-domain

see [WWW] http://issues.apache.org/jira/browse/NUTCH-445

tld

YES

UnTokenized / NotStored(based on conf)

tld

see [WWW] http://issues.apache.org/jira/browse/NUTCH-439

category

NO

UnTokenized

index-url-category

see [WWW] http://issues.apache.org/jira/browse/NUTCH-386

subcollection

YES

Tokenized

subcollection

see subcollection plugin


Jira Issues about indexing and IndexingFilterPlugins are


The index plugins to include are :

last edited 2007-02-16 09:59:31 by EnisSoztutar