How to run Jobs using the Nutch REST service
Table of Contents |
---|
Introduction
This tutorial shows how REST calls can be made to the NutchServer to run various jobs like Inject, Generate, Fetch, etc.
Instructions to start Nutch Server
Follow the steps below to start an instance of the Nutch Server on localhost.
- :~$ cd runtime/local
Wiki Markup |
---|
2. :~$ bin/nutch startserver -port <port_number> -host <host_name> \[If the host/port option is not specified then by default the server starts on localhost:8081\] |
Jobs
Currently the service supports the running of the following jobs - Inject, Generate, Fetch, Parse, Index, Updatedb, Invertlinks, Dedup and Readdb. Any new job can be created by issuing a POST request to /job/create with following JSON data
{
No Format |
---|
POST /job/create { "type":"job type", "confId":"default", "args":{"someParam":"someValue"} } |
}
Inject Job
To run the inject job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"INJECT", "confId":"default", "crawlId":"crawl01" "args": {"url_dir":"url/"} } |
}
The args contains one key - url_dir. This should correspond to the path of the url dir where the seed file is stored The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{"url_dir":"url/"}, "crawlId":"crawl01", "msg":"OK", "id":"default-INJECT-635077497", "state":"RUNNING", "type":"INJECT", "result":null } |
}
Generate Job
To run the generate job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"GENERATE", "confId":"default", "crawlId":"crawl01", "args": {} } |
}
The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate values.
The description of these parameters can be found here.
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-GENERATE-274614034", "state":"RUNNING", "type":"GENERATE", "result":null } |
}
Fetch Job
To run the fetch job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"FETCH", "confId":"default", "crawlId":"crawl01", "args": {} } |
}
The args contain keys - threads, noParsing. These should be put with appropriate values.
The description of these parameters can be found here.
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"idle", "id":"default-FETCH-99398319", "state":"IDLE", "type":"FETCH", "result":null } |
}
Parse Job
To run the parse job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"PARSE", "confId":"default", "crawlId":"crawl01", "args": {"noFilter":"true"} } |
}
The args contain keys - noFilter, noNormalize. These should be put with appropriate values.
The description of these parameters can be found here.
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{"noFilter":"true"}, "crawlId":"crawl01", "msg":"OK", "id":"default-PARSE-1413156163", "state":"IDLE", "type":"PARSE", "result":null } |
}
Index Job
To run the index job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"INDEX", "confId":"new-config", "crawlId":"crawl01", "args": {} } |
}
Before running the index job, the user needs to configure an indexer. User defined index like (Solr, Elasticsearch) can be configured by using the configuration end point. A detailed description of how to configure and run the index job can be found at here.
The args contain keys - crawldb, linkdb, params, dir, segements, noCommit, deleteGone, filter, normalize
The response of the request in a JSON output
{
No Format |
---|
{ "confId":"new-config", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-INDEX-572647647", "state":"RUNNING", "type":"INDEX", "result":null } |
}
Updatedb Job
To run the updatedb job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"UPDATEDB", "confId":"default", "crawlId":"crawl01", "args": {} } |
}
The args contain keys - force, normalize, filter, noAdditions. These should be put with appropriate values.
The description of these parameters can be found here.
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{"crawldb":"crawl/crawldb","segments":"crawl/segments/20150331153517"}, "crawlId":null, "msg":"OK", "id":"default-UPDATEDB-1250603698", "state":"RUNNING", "type":"UPDATEDB", "result":null } |
}
Invertlinks Job
To run the invertlinks job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"INVERTLINKS", "confId":"default", "crawlId":"crawl01", "args": {} } |
}
The args contain keys -force, noNormalize, noFilter. These should be put with appropriate values.
The description of these parameters can be found here.
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{}, "crawlId":"crawl01", "msg":"OK", "id":"default-INVERTLINKS-572647647", "state":"RUNNING", "type":"INVERTLINKS", "result":null } |
}
Dedup Job
To run the dedup job call POST /job/create with following
{
No Format |
---|
POST /job/create { "type":"DEDUP", "confId":"default", "crawlId":"crawl01", "args": {} } |
}
The response of the request is a JSON output
{
No Format |
---|
{ "confId":"default", "args":{"crawldb":"crawl/crawldb"}, "crawlId":"crawl01", "msg":"OK", "id":"default-DEDUP-1394212503", "state":"RUNNING", "type":"DEDUP", "result":null } |
}