Nutch 1.x REST API v1.0

Table of Contents

Introduction

This page documents the Nutch 1.X REST API v1.0.

It provides details on the type of REST calls which can be made to the Nutch 1.x REST API. Many of the API points are adapted from the ones provided by the Nutch 2.x REST API. One of the reasons to come up with a REST API is to integrate D3 to show visualizations about the working of a Nutch crawl.

Instructions to start Nutch Server

Follow the steps below to start an instance of the Nutch Server on localhost.

:~$ cd runtime/local

Wiki Markup
2. :~$ bin/nutch startserver -port <port_number> \[If the port option is not mentioned then by default the server starts on port 8081\]

The different API calls that can be made are listed below.

REST API Calls

Administration

This API point is created in order to get server status and manage server's state.

Get server status

No Format
GET /admin

Response contains server startup date, availible configuration names, job history and currently running jobs.

No Format
{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] }

Stop server

It is possible to stop running server using /admin/stop.

No Format
POST /admin/stop

Response

No Format
Ok

Configuration

Configuration's list

No Format
GET /config

Response contains names of available configurations.

No Format
["default","custom-config"]

Configuration parameters

No Format
GET /config/{configuration name} Examples: GET /config/default GET /config/custom-config

Response contains parameters with values

No Format
{ "anchorIndexingFilter.deduplicate":"false", "crawl.gen.delay":"604800000", "db.fetch.interval.default":"2592000", "db.fetch.interval.max":"7776000", .... .... }

Create configuration

Creates new Nutch configuration with given parameters.

No Format
POST /config/create Examples: POST /config/create { "configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false",... } }

Response is created config's id.

No Format
new-config

Get property value

No Format
GET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate

Response contains parameter's value as string

No Format
false

Set property value

{

No Format
PUT /config/{configuration name}/{property} Examples: PUT /config/default/http.agent.name

Response contains parameter's value as string

No Format
NUTCH_SOLR

Delete configuration

No Format
DELETE /config/{configuration name} Examples: DELETE /config/new-config

Jobs

This point allows job management, including creation, job information and killing of a job. For a complete tutorial, please follow How to run Jobs using the REST service.

Listing all jobs

No Format
GET /job

Response contains list of all jobs (running and history)

No Format

[
   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }
   {
      "id":"job-id-5978",
      "type":"PARSE",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"RUNNING",
      "msg":"",
      "crawlId":"crawl-01"
   }
]

Get job info

No Format
GET /job/job-id-5977

Response

No Format
{ "id":"job-id-5977", "type":"FETCH", "confId":"default", "args":null, "result":null, "state":"FINISHED", "msg":"", "crawlId":"crawl01" }

Stop job

No Format
POST /job/job-id-5977/stop

Response

No Format
true

Kill job

No Format
GET /job/job-id-5977/abort

}

Response
{

No Format
true

}

Create job

Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.

No Format

POST /job/create
   {
      "crawlId":"crawl01",
      "type":"FETCH",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl01",
      "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
      "confId":"default",
      "args":{"someParam":"someValue"}
   }

Response is created job's id.

No Format
job-id-43243

Seed List creation

The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job.

No Format
POST /seed/create { "name":"name-of-seedlist", "seedUrls":["http://www.example.com",....] }

Response is the file directory path

No Format
/var/folders/m9/hsls1krx12x968plt2brlhr00000gn/T/1443721976324-0

Database

This point provides access to information stored in the CrawlDb.

No Format
POST /db/crawldb with following { "type":"stats", "confId":"default", "crawlId":"crawl01", "args":{"someParam":"someValue"} }

The different values for the type parameter are - dump, topN and url. Their corresponding arguments can be found here.

Response contains information from the CrawlDbReader.java class. For the above mentioned request, the JSON response would like like-

No Format

  {
      "retry 0":"8350",
      "minScore":"0.0",
      "retry 1":"96",
      "status":{ 
                "3":{"count":"21","statusValue":"db_gone"},
                "2":{"count":"594","statusValue":"db_fetched"},
                "1":{"count":"7721","statusValue":"db_unfetched"},
                "5":{"count":"86","statusValue":"db_redir_perm"},
                "4":{"count":"24","statusValue":"db_redir_temp"}
                },
      "totalUrls":"8446",
      "maxScore":"0.528",
      "avgScore":"0.029593771"
  }

Note: If any other type than stats (like dump, topN, url) is used then the response will be a file (application-octet-stream).

More

Description of more API points coming soon.

Space shortcuts

Child pages

Versions Compared

Old Version 53

New Version 54

Key

Nutch 1.x REST API v1.0

Introduction

Instructions to start Nutch Server

REST API Calls

Administration

Get server status

Stop server

Configuration

Configuration's list

Configuration parameters

Create configuration

Get property value

Set property value

Delete configuration

Jobs

Listing all jobs

Get job info

Stop job

Kill job

Create job

Seed List creation

Database

More

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 53

New Version 54

Key

Nutch 1.x REST API v1.0

Introduction

Instructions to start Nutch Server

REST API Calls

Administration

Get server status

Stop server

Configuration

Configuration's list

Configuration parameters

Create configuration

Get property value

Set property value

Delete configuration

Jobs

Listing all jobs

Get job info

Stop job

Kill job

Create job

Seed List creation

Database

More