Nutch 2.x REST API

Introduction

This page both documents and provides a UML graphic for the Nutch 2.X REST API.

It explains the logic behind the entire API and also provides detail on the type of REST calls which can be made to the Nutch 2.x REST API. This can be read in conjunction with the documentation on bin/nutch nutchserver command.

REST API Calls

Administration

Responsible class is AdminResource. This API point is created in order to get server status and manage server's state.

Get server status

GET /admin

Response contains server startup date, availible configuration names, job history and currently running jobs.

{
   "startDate":1403716000012,
   "configuration":[
      "default"
   ],
   "jobs":[

   ],
   "runningJobs":[

   ]
}

Stop server

It is possible to stop running server using /admin/stop. You can use non-mandatory force parameter, if you want to stop server despite running tasks.

GET /admin/stop
GET /admin/stop?force=true

Response

Stopping in 5 seconds.

Jobs

Responsible class is JobResource. This point is created for job's management.

Listing jobs

GET /job

Response contains list of all jobs (running and history)

[
   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }
   {
      "id":"job-id-5978",
      "type":"PARSE",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"RUNNING",
      "msg":"",
      "crawlId":"crawl-01"
   }
]

Get job info

GET /job/job-id-5977

Response

   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }

Stop job

GET /job/job-id-5977/stop

Response

  true

Kill job

GET /job/job-id-5977/abort

Response

  true

Create job

Create job with given parameters. You should either specify JobType or jobClassName.

POST /job/create
   {
      "crawlId":"crawl-01",
      "type":"FETCH",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl-01",
      "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

Response is created job's id.

    job-id-43243

Configuration

Configuration's list

GET /config

Response contains names of availible configurations.

  ["default","custom-config"]

Configuration parameters

GET /config/{configuration name}

Examples:
GET /config/default
GET /config/custom-config

Response contains parameters with values

  {
   "anchorIndexingFilter.deduplicate":"false",
   "crawl.gen.delay":"604800000",
   "db.fetch.interval.default":"2592000",
   "db.fetch.interval.max":"7776000",
   "db.fetch.retry.max":"3",
   ....
   ....
   }

Get property value

GET /config/{configuration name}/{property}

Examples:
GET /config/default/db.fetch.retry.max
GET /config/custom-config/crawl.gen.delay

Response contains parameter's value as string

    604800000

Create configuration

Creates new nutch configuration with given parameters. It force field is true, then already existing configuration will be overrided, otherwise not.

POST /config/{configuration name}

Examples:
POST /config/new-config
   {
      "configId":"new-config",
      "force":"true",
      "params":{"anchorIndexingFilter.deduplicate":"false",... }
   }

Response is created config's id.

    new-config

Delete configuration

DELETE /config/{configuration name}

Examples:
DELETE /config/new-config

Update property value

PUT /config/{property name}/
value={value}

Examples:
PUT /config/anchorIndexingFilter.deduplicate
value=true

Database

Responsible class is DbResource. This point is created in order to get data from database.

Run query

Examples:

POST /db
   {
      
   }

POST /db
   {
      "fields": ["headers"]
   }

POST /db
   {
      "batchId": "batch-id"
   }

POST /db
   {
      "startKey":"http://google.com",
      "endKey":"http://yahoo.com",
      "isKeysReversed":"false",
   }


POST /db
   {
      "startKey":"com.google",
      "endKey":"com.yahoo",
      "isKeysReversed":"true"
   }

Response contains data from database with filtered fields.

    {
   "values":[
      {
         "headers":{

         },
         "status":0,
         "markers":{

         },
         "modifiedTime":0,
         "score":0.0,
         "prevModifiedTime":0,
         "url":"http://google.com",
         "__g__dirty":"\\x00\\x00\\x00\\x00",
         "fetchInterval":0,
         "prevFetchTime":0,
         "inlinks":{

         },
         "retriesSinceFetch":0,
         "outlinks":{

         },
         "fetchTime":0,
         "metadata":{

         }
      }
   ]
}

Rest API improvement proposals

Nutch Jobs

UML Graphic

The Unified Modeling Language (UML) is a general-purpose modeling language in the field of software engineering, which is designed to provide a standard way to visualize the design of a system.

The graphic below displays the REST API architecture and described the classes as well as the role and context within the API operation.

API.png

Some comments about class roles in Nutch API.

back to FrontPage

NutchRESTAPI (last edited 2014-06-25 21:26:08 by FjodorVershinin)