...
Follow the steps below to start an instance of the Nutch Server on localhost.
- :~$ cd runtime/local
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
% cd runtime/local % ./bin/nutch startserver -port <port_number> \[If the port option is not mentioned then by default the server starts on port 8081\] |
The different API calls that can be made are listed below.
...
Get server status
No Format |
---|
GET /admin
|
Response contains server startup date, availible configuration names, job history and currently running jobs.
...
It is possible to stop running server using /admin/stop.
No Format |
---|
POSTGET /admin/stop |
Response
No Format |
---|
Stopping in server on port Ok 8081 |
Configuration
Configuration's list
No Format |
---|
GET /config
|
Response contains names of available configurations.
No Format |
---|
["default","custom-config"]
|
Configuration parameters
No Format |
---|
GET /config/{configuration name}
Examples:
GET /config/default
GET /config/custom-config
|
...
Creates new Nutch configuration with given parameters.
No Format |
---|
POST /config/create Examples: POST /config/create { "configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false",... } } # curl curl -X POST -H "Content-Type: application/json" http://localhost:8081/config/create -d '{"configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false"}}' |
Response is created config's id.
...
Listing all jobs
No Format |
---|
curl -X GET -H 'Content-Type: application/job json' -i http://localhost:8081/job |
Response contains list of all jobs (running and history)
...
Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.
No Format |
---|
POST /curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/job/create --data { '{"crawlId":"crawl01", "type":"FETCHINJECT", "confId":"default", "args": {"someParamurl_dir":"someValue"} } seedFiles/seed-1641959745623", "crawldb": "crawldb"}}' |
Response object is provided below
No Format |
---|
{ "id": "crawl01-default-INJECT-1877363907", "type": "INJECT", "confId": "default POST /job/create { "crawlId":"crawl01", "args": { "jobClassNameurl_dir":"org.apache.nutch.fetcher.FetcherJob" "seedFiles/seed-1641959745623", "confIdcrawldb": "defaultcrawldb", }, "argsresult":{"someParam null, "state": "someValueRUNNING"}, } |
...
No Format |
---|
"msg": "OK", job-id-43243 |
...
"crawlId": "crawl01"
} |
Seed Lists
Create seed list
The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. It's also worth noting that the seed
No Format |
---|
POST /seed/create { curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/seed/create --data '{"name":"name-of-seedlisttest", "seedUrls":["httphttps://wwwnutch.exampleapache.comorg",....] } ' |
Response is the relative file directory path. Note, this is relative to where the Nutch server was started. It's also worth noting that any seed lists which are created are persistent. That is to say they remain on disk even when nutch server is not running.
No Format |
---|
seedFiles/seed-1641959745623 |
Get seed lists
The /seed endpoint facilitates retrieval of any seedlists which were created during the current server runtime.
As of Nutch 1.18 seed lists generated by previous server runtime sessions will not be available if the server is shutdown and restarted.
Database
This point provides access to information stored in the CrawlDb.
...