Page History

...

Follow the steps below to start an instance of the Nutch Server on localhost.

:~$ cd runtime/local

Code Block

language	bash
title	Starting Nutch Server
linenumbers	true
collapse	true

% cd runtime/local
% ./

Wiki Markup

2. :~$ bin/nutch startserver -port <port_number> \[If the port option is not mentioned then by default the server starts on port 8081\]

...

Get server status

No Format
GET /admin

Response contains server startup date, availible configuration names, job history and currently running jobs.

...

It is possible to stop running server using /admin/stop.

No Format
POSTGET /admin/stop

Response

No Format
Stopping in server on port Ok 8081

Configuration

Configuration's list

No Format
GET /config

Response contains names of available configurations.

No Format
["default","custom-config"]

Configuration parameters

No Format
GET /config/{configuration name} Examples: GET /config/default GET /config/custom-config

...

Creates new Nutch configuration with given parameters.

No Format


POST /config/create

Examples:
POST /config/create
   {
      "configId":"new-config",
      "params":{"anchorIndexingFilter.deduplicate":"false",... }
   }

# curl
curl -X POST -H "Content-Type: application/json" http://localhost:8081/config/create -d '{"configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false"}}'

Response is created config's id.

...

Listing all jobs

No Format
curl -X GET -H 'Content-Type: application/job json' -i http://localhost:8081/job

Response contains list of all jobs (running and history)

...

Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.

No Format


POST /curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/job/create
 --data  {
      '{"crawlId":"crawl01",
      "type":"FETCHINJECT",
      "confId":"default",
      "args": {"someParamurl_dir":"someValue"}
   }
seedFiles/seed-1641959745623", "crawldb": "crawldb"}}'

Response object is provided below

No Format

{
  "id": "crawl01-default-INJECT-1877363907",
  "type": "INJECT",
  "confId": "default
POST /job/create
   {
      "crawlId":"crawl01",
  "args": {
    "jobClassNameurl_dir":"org.apache.nutch.fetcher.FetcherJob" "seedFiles/seed-1641959745623",
      "confIdcrawldb": "defaultcrawldb",
    },
  "argsresult":{"someParam null,
  "state": "someValueRUNNING"},
   }

...

No Format
"msg": "OK", job-id-43243

...

"crawlId": "crawl01"
}

Seed Lists

Create seed list

The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. It's also worth noting that the seed

No Format


POST /seed/create
{
curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/seed/create --data '{"name":"name-of-seedlisttest", 
"seedUrls":["httphttps://wwwnutch.exampleapache.comorg",....]
}
'

Response is the relative file directory path. Note, this is relative to where the Nutch server was started. It's also worth noting that any seed lists which are created are persistent. That is to say they remain on disk even when nutch server is not running.

/var/folders/m9/hsls1krx12x968plt2brlhr00000gn/T/1443721976324-0

No Format
seedFiles/seed-1641959745623

Get seed lists

The /seed endpoint facilitates retrieval of any seedlists which were created during the current server runtime.

As of Nutch 1.18 seed lists generated by previous server runtime sessions will not be available if the server is shutdown and restarted.

Database

This point provides access to information stored in the CrawlDb.

...

Space shortcuts

Child pages

Versions Compared

Old Version 31

New Version Current

Key

Get server status

Configuration

Configuration's list

Configuration parameters

Listing all jobs

Seed Lists

Create seed list

Get seed lists

Database