Differences between revisions 22 and 23
Revision 22 as of 2015-07-16 15:01:39
Size: 4291
Comment:
Revision 23 as of 2015-07-16 15:13:20
Size: 3982
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
The index structure formed after indexing is shown below :
||'''Field Name''' ||'''Stored''' ||'''Index''' ||'''Plugin/Class''' ||'''Comment''' ||||'''version''' ||
|| || || || || ||'''1.x''' ||'''2.x''' ||
||id ||YES ||Indexed, Un-Tokenized ||[[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]] ||'''URL''' used as '''ID''' to update and delete documents ||X ||X ||
||boost ||YES ||Not Indexed ||various scoring plugins ||Adds a '''score''' value field to a particular document. This is allocated based upon its importance within the webgraph. ||? ||? ||
||digest ||YES ||Not Indexed ||org.apache.nutch.indexer.IndexerMapReduce.java ||Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content. ||? ||? ||
||lang ||YES ||Un-Tokenized ||language-identifier ||Add a '''lang''', language field to a document. ||? ||? ||
||segment ||YES ||Not Indexed ||org.apache.nutch.indexer.IndexerMapReduce.java ||Adds the originating '''segment''' field to the document, used to identify the most recent segment in which this document was fetched. ||? ||? ||
||tstamp ||YES ||Tokenized || /!\ NEEDS COMMENT /!\ ||Adds a '''timestamp''' field of the most recent time this document was fetched ||? ||? ||
||cc:license ||YES ||Indexed, Tokenized ||creativecommons ||Adds the entire license as '''cc:license=xxx''' and '''attributes''' extracted of the license url ||? ||? ||
||cc:meta ||YES ||Indexed, Tokenized ||creativecommons ||Adds the license location as '''cc:meta=xxx''' ||? ||? ||
||cc:type ||YES ||Indexed,Tokenized ||creativecommons ||Adds the work type as '''cc:type=xxx''' ||? ||? ||
||anchor ||NO ||Tokenized ||index-anchor ||Indexing filter that indexes all inbound '''anchor text''' for a document. ||? ||? ||
||title ||YES ||Tokenized ||index-basic ||Adds basic searchable '''title field''' to a document. Also indexed by index-more ||? ||? ||
||host ||NO ||Tokenized ||index-basic ||Adds basic searchable '''hostname field''' to a document. ||? ||? ||
||url ||YES ||Tokenized ||index-basic ||Adds basic searchable '''URL field''' to a document. ||? ||? ||
||content ||NO ||Tokenized ||index-basic ||Adds basic searchable '''content field''' to a document. ||? ||? ||
||lastModified ||NO ||Indexed, Un-Tokenized ||index-more ||Adds some time related meta info in the form of '''last-modified''' if present. ||? ||? ||
||date ||NO ||Indexed, Un-Tokenized ||index-more ||Index date as last-modified, or, if that's not present, uses fetch time. ||? ||? ||
||contentLength ||NO ||Indexed, Un-Tokenized ||index-more || /!\ NEEDS COMMENT /!\ ||? ||? ||
||type ||NO ||Indexed, Un-Tokenized ||index-more ||Adds contentType, primaryType, subType (all mime-types) ||? ||? ||
||primaryType ||NO ||Indexed, Un-Tokenized ||index-more ||primaryType (mime-type) ||? ||? ||
||subType ||NO ||Indexed, Un-Tokenized ||index-more ||subType (mime-type) ||? ||? ||
||tld ||YES ||Un-Tokenized / NotStored(based on conf) ||tld ||Adds a '''top level domain''' field to the document. ||? ||? ||
||subcollection ||YES ||Tokenized ||subcollection ||For Comprehensive description see src/java/org/apache/nutch/collection/'''package.html''' ||? ||? ||
||urlmeta ||NO ||Indexed, Un-Tokenized ||urlmeta ||Adds any specified '''url metadata tags''' to the document in the index. ||? ||? ||
Line 3: Line 29:
The index structure formed after indexing is shown below :
Line 5: Line 30:
||'''Field Name'''||'''Stored'''||'''Index'''|| '''Plugin/Class''' ||'''Comment'''||<-2> '''version'''||
|| || || || || || '''1.x''' || '''2.x''' ||
|| id || YES || Indexed, Un-Tokenized || [[http://nutch.apache.org/apidocs/apidocs-1.8/org/apache/nutch/indexer/IndexerMapReduce.html|IndexerMapReduce]]/[[http://nutch.apache.org/apidocs/apidocs-2.2.1/org/apache/nutch/indexer/IndexUtil.html|IndexUtil]] || '''URL''' used as '''ID''' to update and delete documents || X || X ||
|| boost || YES || Not Indexed || various scoring plugins || Adds a '''score''' value field to a particular document. This is allocated based upon its importance within the webgraph. || ? || ? ||
|| digest || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds a '''message digest''' field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content. || ? || ? ||
|| lang || YES || Un-Tokenized || language-identifier || Add a '''lang''', language field to a document.|| ? || ? ||
|| segment || YES || Not Indexed || org.apache.nutch.indexer.IndexerMapReduce.java || Adds the originating '''segment''' field to the document, used to identify the most recent segment in which this document was fetched. || ? || ? ||
|| tstamp || YES || Tokenized || /!\ NEEDS COMMENT /!\ || Adds a '''timestamp''' field of the most recent time this document was fetched || ? || ? ||
|| cc:license || YES || Indexed, Tokenized || creativecommons || Adds the entire license as '''cc:license=xxx''' and '''attributes''' extracted of the license url|| ? || ? ||
|| cc:meta || YES || Indexed, Tokenized || creativecommons || Adds the license location as '''cc:meta=xxx''' || ? || ? ||
|| cc:type || YES || Indexed,Tokenized || creativecommons || Adds the work type as '''cc:type=xxx'''|| ? || ? ||
|| anchor || NO || Tokenized || index-anchor || Indexing filter that indexes all inbound '''anchor text''' for a document.|| ? || ? ||
|| title || YES || Tokenized || index-basic || Adds basic searchable '''title field''' to a document. Also indexed by index-more || ? || ? ||
|| host || NO || Tokenized || index-basic || Adds basic searchable '''hostname field''' to a document. || ? || ? ||
|| url || YES || Tokenized || index-basic || Adds basic searchable '''URL field''' to a document. || ? || ? ||
|| content || NO || Tokenized || index-basic || Adds basic searchable '''content field''' to a document. || ? || ? ||
|| lastModified || NO || Indexed, Un-Tokenized || index-more || Adds some time related meta info in the form of '''last-modified''' if present. || ? || ? ||
|| date || NO || Indexed, Un-Tokenized || index-more || Index date as last-modified, or, if that's not present, uses fetch time. || ? || ? ||
|| contentLength || NO || Indexed, Un-Tokenized || index-more || /!\ NEEDS COMMENT /!\ || ? || ? ||
|| type || NO || Indexed, Un-Tokenized || index-more || Adds contentType, primaryType, subType (all mime-types) || ? || ? ||
|| primaryType || NO || Indexed, Un-Tokenized || index-more || primaryType (mime-type) || ? || ? ||
|| subType || NO || Indexed, Un-Tokenized || index-more || subType (mime-type) || ? || ? ||
|| tld || YES || Un-Tokenized / NotStored(based on conf) || tld || Adds a '''top level domain''' field to the document. || ? || ? ||
|| subcollection || YES || Tokenized || subcollection || For Comprehensive description see src/java/org/apache/nutch/collection/'''package.html''' || ? || ? ||
|| urlmeta || NO || Indexed, Un-Tokenized || urlmeta || Adds any specified '''url metadata tags''' to the document in the index.|| ? || ? ||

Line 31: Line 33:
Jira Issues about indexing and IndexingFilterPlugins are  Jira Issues about indexing and IndexingFilterPlugins are
Line 35: Line 37:
 * [[index-replace plugin]]  * [[IndexReplace|index-replace plugin]]
Line 38: Line 40:
The index plugins to include are :
Line 39: Line 42:
The index plugins to include are :

index-(anchor | basic | more | static | replace ) | tld | subcollection | creativecommons | language-identifier | urlmeta
 . index-(anchor | basic | more | static | replace ) | tld | subcollection | creativecommons | language-identifier | urlmeta

The Index Structure

The index structure formed after indexing is shown below :

Field Name

Stored

Index

Plugin/Class

Comment

version

1.x

2.x

id

YES

Indexed, Un-Tokenized

IndexerMapReduce/IndexUtil

URL used as ID to update and delete documents

X

X

boost

YES

Not Indexed

various scoring plugins

Adds a score value field to a particular document. This is allocated based upon its importance within the webgraph.

?

?

digest

YES

Not Indexed

org.apache.nutch.indexer.IndexerMapReduce.java

Adds a message digest field to a document. Can be MD5 over content and headers or more sophisticated text profile of the content.

?

?

lang

YES

Un-Tokenized

language-identifier

Add a lang, language field to a document.

?

?

segment

YES

Not Indexed

org.apache.nutch.indexer.IndexerMapReduce.java

Adds the originating segment field to the document, used to identify the most recent segment in which this document was fetched.

?

?

tstamp

YES

Tokenized

/!\ NEEDS COMMENT /!\

Adds a timestamp field of the most recent time this document was fetched

?

?

cc:license

YES

Indexed, Tokenized

creativecommons

Adds the entire license as cc:license=xxx and attributes extracted of the license url

?

?

cc:meta

YES

Indexed, Tokenized

creativecommons

Adds the license location as cc:meta=xxx

?

?

cc:type

YES

Indexed,Tokenized

creativecommons

Adds the work type as cc:type=xxx

?

?

anchor

NO

Tokenized

index-anchor

Indexing filter that indexes all inbound anchor text for a document.

?

?

title

YES

Tokenized

index-basic

Adds basic searchable title field to a document. Also indexed by index-more

?

?

host

NO

Tokenized

index-basic

Adds basic searchable hostname field to a document.

?

?

url

YES

Tokenized

index-basic

Adds basic searchable URL field to a document.

?

?

content

NO

Tokenized

index-basic

Adds basic searchable content field to a document.

?

?

lastModified

NO

Indexed, Un-Tokenized

index-more

Adds some time related meta info in the form of last-modified if present.

?

?

date

NO

Indexed, Un-Tokenized

index-more

Index date as last-modified, or, if that's not present, uses fetch time.

?

?

contentLength

NO

Indexed, Un-Tokenized

index-more

/!\ NEEDS COMMENT /!\

?

?

type

NO

Indexed, Un-Tokenized

index-more

Adds contentType, primaryType, subType (all mime-types)

?

?

primaryType

NO

Indexed, Un-Tokenized

index-more

primaryType (mime-type)

?

?

subType

NO

Indexed, Un-Tokenized

index-more

subType (mime-type)

?

?

tld

YES

Un-Tokenized / NotStored(based on conf)

tld

Adds a top level domain field to the document.

?

?

subcollection

YES

Tokenized

subcollection

For Comprehensive description see src/java/org/apache/nutch/collection/package.html

?

?

urlmeta

NO

Indexed, Un-Tokenized

urlmeta

Adds any specified url metadata tags to the document in the index.

?

?


Jira Issues about indexing and IndexingFilterPlugins are


The index plugins to include are :

  • index-(anchor | basic | more | static | replace ) | tld | subcollection | creativecommons | language-identifier | urlmeta

IndexStructure (last edited 2015-07-16 15:13:20 by PeterCiuffetti)