How to run Jobs using the Nutch REST service

Table of Contents

Introduction

This tutorial shows how REST calls can be made to the NutchServer to run various jobs like Inject, Generate, Fetch, etc.

Instructions to start Nutch Server

Follow the steps below to start an instance of the Nutch Server on localhost.

:~$ cd runtime/local

Wiki Markup
2. :~$ bin/nutch startserver -port <port_number> -host <host_name> \[If the host/port option is not specified then by default the server starts on localhost:8081\]

Jobs

Currently the service supports the running of the following jobs - Inject, Generate, Fetch, Parse, Index, Updatedb, Invertlinks, Dedup and Readdb. Any new job can be created by issuing a POST request to /job/create with following JSON data
{

No Format
POST /job/create { "type":"job type", "confId":"default", "args":{"someParam":"someValue"} }

}

Inject Job

To run the inject job call POST /job/create with following
{

No Format
POST /job/create { "type":"INJECT", "confId":"default", "crawlId":"crawl01" "args": {"url_dir":"url/"} }

}
The args contains one key - url_dir. This should correspond to the path of the url dir where the seed file is stored The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{"url_dir":"url/"}, "crawlId":"crawl01", "msg":"OK", "id":"default-INJECT-635077497", "state":"RUNNING", "type":"INJECT", "result":null }

}

Generate Job

To run the generate job call POST /job/create with following
{

No Format
POST /job/create { "type":"GENERATE", "confId":"default", "crawlId":"crawl01", "args": {} }

}
The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-GENERATE-274614034", "state":"RUNNING", "type":"GENERATE", "result":null }

}

Fetch Job

To run the fetch job call POST /job/create with following
{

No Format
POST /job/create { "type":"FETCH", "confId":"default", "crawlId":"crawl01", "args": {} }

}
The args contain keys - threads, noParsing. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"idle", "id":"default-FETCH-99398319", "state":"IDLE", "type":"FETCH", "result":null }

}

Parse Job

To run the parse job call POST /job/create with following
{

No Format
POST /job/create { "type":"PARSE", "confId":"default", "crawlId":"crawl01", "args": {"noFilter":"true"} }

}
The args contain keys - noFilter, noNormalize. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{"noFilter":"true"}, "crawlId":"crawl01", "msg":"OK", "id":"default-PARSE-1413156163", "state":"IDLE", "type":"PARSE", "result":null }

}

Index Job

To run the index job call POST /job/create with following
{

No Format
POST /job/create { "type":"INDEX", "confId":"new-config", "crawlId":"crawl01", "args": {} }

}

Before running the index job, the user needs to configure an indexer. User defined index like (Solr, Elasticsearch) can be configured by using the configuration end point. A detailed description of how to configure and run the index job can be found at here.

The args contain keys - crawldb, linkdb, params, dir, segements, noCommit, deleteGone, filter, normalize

The response of the request in a JSON output
{

No Format
{ "confId":"new-config", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-INDEX-572647647", "state":"RUNNING", "type":"INDEX", "result":null }

}

Updatedb Job

To run the updatedb job call POST /job/create with following
{

No Format
POST /job/create { "type":"UPDATEDB", "confId":"default", "crawlId":"crawl01", "args": {} }

}
The args contain keys - force, normalize, filter, noAdditions. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

No Format

{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb","segments":"crawl/segments/20150331153517"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-UPDATEDB-1250603698",
    "state":"RUNNING",
    "type":"UPDATEDB",
    "result":null
}

}

Invertlinks Job

To run the invertlinks job call POST /job/create with following
{

No Format
POST /job/create { "type":"INVERTLINKS", "confId":"default", "crawlId":"crawl01", "args": {} }

}

The args contain keys -force, noNormalize, noFilter. These should be put with appropriate values.

The description of these parameters can be found here.

The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-INVERTLINKS-572647647", "state":"RUNNING", "type":"INVERTLINKS", "result":null }

}

Dedup Job

To run the dedup job call POST /job/create with following
{

No Format
POST /job/create { "type":"DEDUP", "confId":"default", "crawlId":"crawl01", "args": {} }

}

The response of the request is a JSON output
{

No Format
{ "confId":"default", "args":{"crawldb":"crawl/crawldb"}, "crawlId":"crawl01", "msg":"OK", "id":"default-DEDUP-1394212503", "state":"RUNNING", "type":"DEDUP", "result":null }

}

Space shortcuts

Child pages

Versions Compared

Old Version 61

New Version Current

Key

How to run Jobs using the Nutch REST service

Introduction

Instructions to start Nutch Server

Jobs

Inject Job

Generate Job

Fetch Job

Parse Job

Index Job

Updatedb Job

Invertlinks Job

Dedup Job

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 61

New Version Current

Key

How to run Jobs using the Nutch REST service

Introduction

Instructions to start Nutch Server

Jobs

Inject Job

Generate Job

Fetch Job

Parse Job

Index Job

Updatedb Job

Invertlinks Job

Dedup Job