Nutch 1.x REST API v1.0

Table of Contents

Introduction

This page documents the Nutch 1.X REST API v1.0.

...

Follow the steps below to start an instance of the Nutch Server on localhost.

:~$ cd runtime/local

...

Code Block

language	bash
title	Starting Nutch Server
linenumbers	true
collapse	true

% cd runtime/local
% ./bin/nutch startserver -port <port_number> \[If the port option is not mentioned then by default the server starts on port 8081\]

The different API calls that can be made are listed below.

...

This API point is created in order to get server status and manage server's state.

Get server status

...

No Format
GET /admin

...

Response contains server startup date, availible configuration names, job history and currently running jobs.
{

No Format
{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] }

}

Stop server

It is possible to stop running server using /admin/stop.
{

No Format
POSTGET /admin/stop

...

Response
{

No Format
Stopping in server Ok

}

on port 8081

Configuration

Configuration's list

{

No Format
GET /config

}

Response contains names of availible available configurations.
{

No Format
["default","custom-config"]

...

Configuration parameters

...

No Format
GET /config/{configuration name} Examples: GET /config/default GET /config/custom-config

...

Response contains parameters with values
{

No Format
{ "anchorIndexingFilter.deduplicate":"false", "crawl.gen.delay":"604800000", "db.fetch.interval.default":"2592000", "db.fetch.interval.max":"7776000", .... .... }

}

Create configuration

Creates new nutch Nutch configuration with given parameters.
{

No Format
POST /config/create Examples: POST /config/create { "configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false",... } }

...

# curl
curl -X POST -H "Content-Type: application/json" http://localhost:8081/config/create -d '{"configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false"}}'

Response is created config's id.
{

No Format
new-config

...

Get property value

...

No Format
GET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate

}

Response contains parameter's value as string
{

No Format
false

...

Set property value

{

No Format
PUT /config/{configuration name}/{property} Examples: PUT /config/default/http.agent.name

...

Response contains parameter's value as string
{

No Format
NUTCH_SOLR

}

Delete configuration

{

No Format
DELETE /config/{configuration name} Examples: DELETE /config/new-config

}

Jobs

This point allows job management, including creation, job information and killing of a job. For a complete tutorial, please follow How to run Jobs using the REST service.

Listing all jobs

{

No Format
GET /job

...

curl -X GET -H 'Content-Type: application/json' -i http://localhost:8081/job

Response contains list of all jobs (running and history)
{

No Format

[
   {
      "id":"job-id-5977",
      "type":"FETCH",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"FINISHED",
      "msg":"",
      "crawlId":"crawl-01"
   }
   {
      "id":"job-id-5978",
      "type":"PARSE",
      "confId":"default",
      "args":null,
      "result":null,
      "state":"RUNNING",
      "msg":"",
      "crawlId":"crawl-01"
   }
]

}

Get job info

{

No Format
GET /job/job-id-5977

}

Response
{

No Format
{ "id":"job-id-5977", "type":"FETCH", "confId":"default", "args":null, "result":null, "state":"FINISHED", "msg":"", "crawlId":"crawl01" }

...

Stop job

{

No Format
POST /job/job-id-5977/stop

}

Response
{

No Format
true

}

Kill job

{

No Format
GET /job/job-id-5977/abort

...

Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.
{

No Format


POST /curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/job/create
 --data  {
      '{"crawlId":"crawl01",
      "type":"FETCHINJECT",
      "confId":"default",
      "args": {"someParamurl_dir":"someValue"}
   }
seedFiles/seed-1641959745623", "crawldb": "crawldb"}}'

Response object is provided below

No Format

{
  "id": "crawl01-default-INJECT-1877363907",
  "type": "INJECT",
  "confId": "default
POST /job/create
   {
      "crawlId":"crawl01",
  "args": {
    "jobClassNameurl_dir":"org.apache.nutch.fetcher.FetcherJob" "seedFiles/seed-1641959745623",
      "confIdcrawldb": "defaultcrawldb",
   },
   "argsresult":{"someParam null,
  "state": "someValueRUNNING"},
   }

}

Response is created job's id.
{

No Format
job-id-43243

}

...

"msg": "OK",
  "crawlId": "crawl01"
}

Seed Lists

Create seed list

The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. It's also worth noting that the seed

No Format
curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/seed/create { --data '{"name":"name-of-seedlisttest", "seedUrls":["httphttps://wwwnutch.exampleapache.comorg",....] } '

Response is the relative file directory path
{

No Format
/var/folders/m9/hsls1krx12x968plt2brlhr00000gn/T/1443721976324-0

. Note, this is relative to where the Nutch server was started. It's also worth noting that any seed lists which are created are persistent. That is to say they remain on disk even when nutch server is not running.

No Format
seedFiles/seed-1641959745623

Get seed lists

The /seed endpoint facilitates retrieval of any seedlists which were created during the current server runtime.

As of Nutch 1.18 seed lists generated by previous server runtime sessions will not be available if the server is shutdown and restarted.}

Database

This point provides access to information stored in the CrawlDb.
{

No Format
POST /db/crawldb with following { "type":"stats", "confId":"default", "crawlId":"crawl01", "args":{"someParam":"someValue"} }

...

The different values for the type parameter are - dump, topN and url. Their corresponding arguments can be found here.

Response contains information from the CrawlDbReader.java class. For the above mentioned request, the JSON response would like like-
{

No Format

  {
      "retry 0":"8350",
      "minScore":"0.0",
      "retry 1":"96",
      "status":{ 
                "3":{"count":"21","statusValue":"db_gone"},
                "2":{"count":"594","statusValue":"db_fetched"},
                "1":{"count":"7721","statusValue":"db_unfetched"},
                "5":{"count":"86","statusValue":"db_redir_perm"},
                "4":{"count":"24","statusValue":"db_redir_temp"}
                },
      "totalUrls":"8446",
      "maxScore":"0.528",
      "avgScore":"0.029593771"
  }

...

Note: If any other type than stats (like dump, topN, url) is used then the response will be a file (application-octet-stream).

...

Space shortcuts

Child pages

Versions Compared

Old Version 19

New Version Current

Key

Nutch 1.x REST API v1.0

Introduction

Get server status

No Format
GET /admin
...

Response contains server startup date, availible configuration names, job history and currently running jobs.
{

No Format
{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] }
}

Stop server

Configuration

Configuration's list

Configuration parameters

Create configuration

Get property value

No Format
GET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate
}

Response contains parameter's value as string
{
No Format
false
...

Set property value

Delete configuration

Jobs

Listing all jobs

Get job info

Stop job

Kill job

Seed Lists

Create seed list

Get seed lists

Database

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 19

New Version Current

Key

Nutch 1.x REST API v1.0

Introduction

Get server status

No Format GET /admin ...Response contains server startup date, availible configuration names, job history and currently running jobs. { No Format{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] } }

Stop server

Configuration

Configuration's list

Configuration parameters

Create configuration

Get property value

No FormatGET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate }Response contains parameter's value as string { No Format false ...

Set property value

Delete configuration

Jobs

Listing all jobs

Get job info

Stop job

Kill job

Seed Lists

Create seed list

Get seed lists

Database

No Format
GET /admin
...

Response contains server startup date, availible configuration names, job history and currently running jobs.
{

No Format
{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] }
}

No Format
GET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate
}

Response contains parameter's value as string
{
No Format
false
...