Pluggable Indexing

The index command (running org.apache.nutch.indexer.IndexingJob) takes the content from one or multiple segments and passes it to all enabled IndexWriter plugins which send the documents to Solr, Elasticsearch, and various other index back-ends.


Nutch 1.x


Usage: Indexer (<crawldb> | -nocrawldb) (<segment> ... | -dir <segments>) [general options]

Index given segments using configured indexer plugins

The CrawlDb is optional but it is required to send deletion requests for duplicates
and to read the proper document score/boost/weight passed to the indexers.

Required arguments:

        <crawldb>       path to CrawlDb, or
        -nocrawldb      flag to indicate that no CrawlDb shall be used

        <segment> ...   path(s) to segment, or
        -dir <segments> path to segments/ directory,
                        (all subdirectories are read as segments)

General options:

        -linkdb <linkdb>        use LinkDb to index anchor texts of incoming links
        -params k1=v1&k2=v2...  parameters passed to indexer plugins
                                (via property indexer.additional.params)

        -noCommit       do not call the commit method of indexer plugins
        -deleteGone     send deletion requests for 404s, redirects, duplicates
        -filter         skip documents with URL rejected by configured URL filters
        -normalize      normalize URLs before indexing
        -addBinaryContent       index raw/binary content in field `binaryContent`
        -base64         use Base64 encoding for binary content

Indexwriter plugins have to be enabled by the property plugin.includes. See IndexWriter how to configure these plugins.

  • No labels