Nutch 1.x REST API v1.0
Table of Contents |
---|
Introduction
This page documents the Nutch 1.X REST API v1.0.
...
Follow the steps below to start an instance of the Nutch Server on localhost.
- :~$ cd runtime/local
Wiki Markup |
---|
2. :~$ bin/nutch startserver -port <port_number> \[If the port option is not mentioned then by default the server starts on port 8081\] |
The different API calls that can be made are listed below.
...
This API point is created in order to get server status and manage server's state.
Get server status
{
No Format |
---|
GET /admin |
...
Response contains server startup date, availible configuration names, job history and currently running jobs.
{
No Format |
---|
{ "startDate":1424572500000, "configuration":[ "default" ], "jobs":[ ], "runningJobs":[ ] } |
...
Stop server
It is possible to stop running server using /admin/stop.
{
No Format |
---|
POST /admin/stop |
}
Response
{
No Format |
---|
Ok |
}
Configuration
Configuration's list
...
No Format |
---|
GET /config |
}
Response contains names of availible available configurations.
{
No Format |
---|
["default","custom-config"] |
...
Configuration parameters
...
No Format |
---|
GET /config/{configuration name} Examples: GET /config/default GET /config/custom-config |
...
Response contains parameters with values
{
No Format |
---|
{ "anchorIndexingFilter.deduplicate":"false", "crawl.gen.delay":"604800000", "db.fetch.interval.default":"2592000", "db.fetch.interval.max":"7776000", .... .... } |
...
Create configuration
Creates new nutch Nutch configuration with given parameters.
{
No Format |
---|
POST /config/create Examples: POST /config/create { "configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false",... } } |
}
Response is created config's id.
{
No Format |
---|
new-config |
...
Get property value
{
No Format |
---|
GET /config/{configuration name}/{property} Examples: GET /config/default/anchorIndexingFilter.deduplicate |
}
Response contains parameter's value as string
{
No Format |
---|
false |
...
Set property value
{
No Format |
---|
PUT /config/{configuration name}/{property} Examples: PUT /config/default/http.agent.name |
...
Response contains parameter's value as string
{
No Format |
---|
NUTCH_SOLR |
...
Delete configuration
{
No Format |
---|
DELETE /config/{configuration name} Examples: DELETE /config/new-config |
...
Jobs
This point allows job management, including creation, job information and killing of a job. For a complete tutorial, please follow How to run Jobs using the REST service.
Listing all jobs
...
No Format |
---|
GET /job |
}
Response contains list of all jobs (running and history)
{
No Format |
---|
[ { "id":"job-id-5977", "type":"FETCH", "confId":"default", "args":null, "result":null, "state":"FINISHED", "msg":"", "crawlId":"crawl-01" } { "id":"job-id-5978", "type":"PARSE", "confId":"default", "args":null, "result":null, "state":"RUNNING", "msg":"", "crawlId":"crawl-01" } ] |
}
Get job info
...
No Format |
---|
GET /job/job-id-5977 |
}
Response
{
No Format |
---|
{ "id":"job-id-5977", "type":"FETCH", "confId":"default", "args":null, "result":null, "state":"FINISHED", "msg":"", "crawlId":"crawl01" } |
...
Stop job
...
No Format |
---|
POST /job/job-id-5977/stop |
...
Response
{
No Format |
---|
true |
}
Kill job
...
No Format |
---|
GET /job/job-id-5977/abort |
...
Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.
{
No Format |
---|
POST /job/create { "crawlId":"crawl01", "type":"FETCH", "confId":"default", "args":{"someParam":"someValue"} } POST /job/create { "crawlId":"crawl01", "jobClassName":"org.apache.nutch.fetcher.FetcherJob" "confId":"default", "args":{"someParam":"someValue"} } |
...
Response is created job's id.
{
No Format |
---|
job-id-43243 |
}
Seed List creation
The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job.
...
Response is the file directory path
{
No Format |
---|
/var/folders/m9/hsls1krx12x968plt2brlhr00000gn/T/1443721976324-0 |
}
Database
This point provides access to information stored in the CrawlDb.
{
No Format |
---|
POST /db/crawldb with following { "type":"stats", "confId":"default", "crawlId":"crawl01", "args":{"someParam":"someValue"} } |
}
The different values for the type parameter are - dump, topN and url. Their corresponding arguments can be found here.
Response contains information from the CrawlDbReader.java class. For the above mentioned request, the JSON response would like like-
{
No Format |
---|
{ "retry 0":"8350", "minScore":"0.0", "retry 1":"96", "status":{ "3":{"count":"21","statusValue":"db_gone"}, "2":{"count":"594","statusValue":"db_fetched"}, "1":{"count":"7721","statusValue":"db_unfetched"}, "5":{"count":"86","statusValue":"db_redir_perm"}, "4":{"count":"24","statusValue":"db_redir_temp"} }, "totalUrls":"8446", "maxScore":"0.528", "avgScore":"0.029593771" } |
...
Note: If any other type than stats (like dump, topN, url) is used then the response will be a file (application-octet-stream).
...