Differences between revisions 17 and 18
Revision 17 as of 2014-06-25 21:19:30
Size: 7040
Comment:
Revision 18 as of 2014-06-25 21:26:08
Size: 7219
Comment:
Deletions are marked like this. Additions are marked like this.
Line 305: Line 305:
 * Naming of RAMConfManager and RAMJobManager
 * crawlId support in DbFilter
 * remove jobs from status, add jobHistory
 * Check, if POST method is situable for /db requests.

Nutch 2.x REST API

Introduction

This page both documents and provides a UML graphic for the Nutch 2.X REST API.

It explains the logic behind the entire API and also provides detail on the type of REST calls which can be made to the Nutch 2.x REST API. This can be read in conjunction with the documentation on bin/nutch nutchserver command.

REST API Calls

Administration

Responsible class is AdminResource. This API point is created in order to get server status and manage server's state.

Get server status

GET /admin

Response contains server startup date, availible configuration names, job history and currently running jobs.

{
   "startDate":1403716000012,
   "configuration":[
      "default"
   ],
   "jobs":[

   ],
   "runningJobs":[

   ]
}

Stop server

It is possible to stop running server using /admin/stop. You can use non-mandatory force parameter, if you want to stop server despite running tasks.

GET /admin/stop
GET /admin/stop?force=true

Response

Stopping in 5 seconds.

Jobs

Responsible class is JobResource. This point is created for job's management.

Listing jobs

GET /job

Response contains list of all jobs (running and history)

[
   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }
   {
      "id":"job-id-5978",
      "type":"PARSE",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"RUNNING",
      "msg":"",
      "crawlId":"crawl-01"
   }
]

Get job info

GET /job/job-id-5977

Response

   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }

Stop job

GET /job/job-id-5977/stop

Response

  true

Kill job

GET /job/job-id-5977/abort

Response

  true

Create job

Create job with given parameters. You should either specify JobType or jobClassName.

POST /job/create
   {
      "crawlId":"crawl-01",
      "type":"FETCH",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl-01",
      "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

Response is created job's id.

    job-id-43243

Configuration

Configuration's list

GET /config

Response contains names of availible configurations.

  ["default","custom-config"]

Configuration parameters

GET /config/{configuration name}

Examples:
GET /config/default
GET /config/custom-config

Response contains parameters with values

  {
   "anchorIndexingFilter.deduplicate":"false",
   "crawl.gen.delay":"604800000",
   "db.fetch.interval.default":"2592000",
   "db.fetch.interval.max":"7776000",
   "db.fetch.retry.max":"3",
   ....
   ....
   }

Get property value

GET /config/{configuration name}/{property}

Examples:
GET /config/default/db.fetch.retry.max
GET /config/custom-config/crawl.gen.delay

Response contains parameter's value as string

    604800000

Create configuration

Creates new nutch configuration with given parameters. It force field is true, then already existing configuration will be overrided, otherwise not.

POST /config/{configuration name}

Examples:
POST /config/new-config
   {
      "configId":"new-config",
      "force":"true",
      "params":{"anchorIndexingFilter.deduplicate":"false",... }
   }

Response is created config's id.

    new-config

Delete configuration

DELETE /config/{configuration name}

Examples:
DELETE /config/new-config

Update property value

PUT /config/{property name}/
value={value}

Examples:
PUT /config/anchorIndexingFilter.deduplicate
value=true

Database

Responsible class is DbResource. This point is created in order to get data from database.

Run query

Examples:

POST /db
   {
      
   }

POST /db
   {
      "fields": ["headers"]
   }

POST /db
   {
      "batchId": "batch-id"
   }

POST /db
   {
      "startKey":"http://google.com",
      "endKey":"http://yahoo.com",
      "isKeysReversed":"false",
   }


POST /db
   {
      "startKey":"com.google",
      "endKey":"com.yahoo",
      "isKeysReversed":"true"
   }

Response contains data from database with filtered fields.

    {
   "values":[
      {
         "headers":{

         },
         "status":0,
         "markers":{

         },
         "modifiedTime":0,
         "score":0.0,
         "prevModifiedTime":0,
         "url":"http://google.com",
         "__g__dirty":"\\x00\\x00\\x00\\x00",
         "fetchInterval":0,
         "prevFetchTime":0,
         "inlinks":{

         },
         "retriesSinceFetch":0,
         "outlinks":{

         },
         "fetchTime":0,
         "metadata":{

         }
      }
   ]
}

Rest API improvement proposals

  • Naming of RAMConfManager and RAMJobManager
  • crawlId support in DbFilter

  • remove jobs from status, add jobHistory
  • Check, if POST method is situable for /db requests.

Nutch Jobs

UML Graphic

The Unified Modeling Language (UML) is a general-purpose modeling language in the field of software engineering, which is designed to provide a standard way to visualize the design of a system.

The graphic below displays the REST API architecture and described the classes as well as the role and context within the API operation.

API.png

Some comments about class roles in Nutch API.

  • NutchServer - entry point. Parses commandline parameters and configures Restlet application through JAX-RS API.

  • AbstractResource - abstract JAX-RS resource. Other JAX-RS extend it in order to get references to ConfManager, JobManager and NutchServer.

  • JobFactory - factory class, which creates job objects based on JobType or class name.

  • DbReader - manages connections to web store, processes filter and runs Gora query.

  • DbIterator - navigates through selected data, skips non-relevant records

  • DbPageConverter - converts database record into Nutch API model object

  • NutchServerPoolExecutor - manages running jobs and job's history.

  • RAMConfManager - manages nutch configuration in memory
  • RAMJobManager - stores job info in memory, job execution

back to FrontPage

NutchRESTAPI (last edited 2014-06-25 21:26:08 by FjodorVershinin)