How to run Jobs using the Nutch REST service

Introduction

This tutorial shows how REST calls can be made to the NutchServer to run various jobs like Inject, Generate, Fetch, etc.

Instructions to start Nutch Server

Follow the steps below to start an instance of the Nutch Server on localhost.

  1. :~$ cd runtime/local

2. :~$ bin/nutch startserver -port <port_number> -host <host_name> [If the host/port option is not specified then by default the server starts on localhost:8081]

Jobs

Currently the service supports the running of the following jobs - Inject, Generate, Fetch, Parse, Index, Updatedb, Invertlinks, Dedup and Readdb. Any new job can be created by issuing a POST request to /job/create with following JSON data
{

POST /job/create
   {
      "type":"job type",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

}

Inject Job

To run the inject job call POST /job/create with following
{

POST /job/create
{   
    "type":"INJECT",
    "confId":"default",
    "crawlId":"crawl01"
    "args": {"url_dir":"url/"}
}

}
The args contains one key - url_dir. This should correspond to the path of the url dir where the seed file is stored The response of the request is a JSON output
{

{
   "confId":"default",
   "args":{"url_dir":"url/"},
   "crawlId":"crawl01",
   "msg":"OK",
   "id":"default-INJECT-635077497",
   "state":"RUNNING",
   "type":"INJECT",
   "result":null
}

}

Generate Job

To run the generate job call POST /job/create with following
{

POST /job/create
{  
    "type":"GENERATE",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {}
}

}
The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

{
    "confId":"default",
    "args":{},
    "crawlId":"crawl01",
    "msg":"OK",
    "id":"default-GENERATE-274614034",
    "state":"RUNNING",
    "type":"GENERATE",
    "result":null
}

}

Fetch Job

To run the fetch job call POST /job/create with following
{

POST /job/create
{  
    "type":"FETCH",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {}
}

}
The args contain keys - threads, noParsing. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

{
     "confId":"default",
     "args":{},
     "crawlId":"crawl01",
     "msg":"idle",
     "id":"default-FETCH-99398319",
     "state":"IDLE",
     "type":"FETCH",
     "result":null
}

}

Parse Job

To run the parse job call POST /job/create with following
{

POST /job/create
{  
    "type":"PARSE",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {"noFilter":"true"}
}

}
The args contain keys - noFilter, noNormalize. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

{
     "confId":"default",
     "args":{"noFilter":"true"},
     "crawlId":"crawl01",
     "msg":"OK",
     "id":"default-PARSE-1413156163",
     "state":"IDLE",
     "type":"PARSE",
     "result":null
}

}

Index Job

To run the index job call POST /job/create with following
{

POST /job/create
{  
    "type":"INDEX",
    "confId":"new-config",
    "crawlId":"crawl01",
    "args": {}
}

}

Before running the index job, the user needs to configure an indexer. User defined index like (Solr, Elasticsearch) can be configured by using the configuration end point. A detailed description of how to configure and run the index job can be found at here.

The args contain keys - crawldb, linkdb, params, dir, segements, noCommit, deleteGone, filter, normalize

The response of the request in a JSON output
{

{
    "confId":"new-config",
    "args":{},
    "crawlId":"crawl01",
    "msg":"OK",
    "id":"default-INDEX-572647647",
    "state":"RUNNING",
    "type":"INDEX",
    "result":null
}

}

Updatedb Job

To run the updatedb job call POST /job/create with following
{

POST /job/create
{  
    "type":"UPDATEDB",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {}
}

}
The args contain keys - force, normalize, filter, noAdditions. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb","segments":"crawl/segments/20150331153517"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-UPDATEDB-1250603698",
    "state":"RUNNING",
    "type":"UPDATEDB",
    "result":null
}

}

Invertlinks Job

To run the invertlinks job call POST /job/create with following
{

POST /job/create
{  
    "type":"INVERTLINKS",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {}
}

}

The args contain keys -force, noNormalize, noFilter. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

{
    "confId":"default",
    "args":{},
    "crawlId":"crawl01",
    "msg":"OK",
    "id":"default-INVERTLINKS-572647647",
    "state":"RUNNING",
    "type":"INVERTLINKS",
    "result":null
}

}

Dedup Job

To run the dedup job call POST /job/create with following
{

POST /job/create
{  
    "type":"DEDUP",
    "confId":"default",
    "crawlId":"crawl01",
    "args": {}
}

}

The response of the request is a JSON output
{

{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb"},
    "crawlId":"crawl01",
    "msg":"OK",
    "id":"default-DEDUP-1394212503",
    "state":"RUNNING",
    "type":"DEDUP",
    "result":null
}

}

  • No labels