Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Configuration's list


No Format

GET /config


Response contains names of available configurations.

...

Configuration parameters


No Format

GET /config/{configuration name}

Examples:
GET /config/default
GET /config/custom-config

...

Creates new Nutch configuration with given parameters.

No Format

POST /config/create

Examples:
POST /config/create
   {
      "configId":"new-config",
      "params":{"anchorIndexingFilter.deduplicate":"false",... }
   }

# curl
curl -X POST -H "Content-Type: application/json" http://localhost:8081/config/create -d '{"configId":"new-config", "params":{"anchorIndexingFilter.deduplicate":"false"}}' 


Response is created config's id.

...

Listing all jobs


No Format

GET /job
curl -X GET -H 'Content-Type: application/json' -i http://localhost:8081/job 


Response contains list of all jobs (running and history)

...

Create job with given parameters. You should either specify Job Type(like INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.

No Format

POST /curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/job/create
 --data  {
      '{"crawlId":"crawl01",
      "type":"FETCHINJECT",
      "confId":"default",
      "args": {"someParamurl_dir":"seedFiles/seed-1641959745623", "crawldb": "crawldb"}}' 

Response object is provided below

No Format
{
  "id": "crawl01-default-INJECT-1877363907",
  "type": "INJECT",
  "confId": "default",
  "args": {
someValue"}
   }

POST /job/create
   {
      "crawlId":"crawl01",
      "jobClassNameurl_dir":"org.apache.nutch.fetcher.FetcherJob" "seedFiles/seed-1641959745623",
      "confIdcrawldb": "defaultcrawldb",
   },
   "argsresult":{"someParam null,
  "state": "someValueRUNNING"},
   }

...

"msg": "OK",
  "crawlId": "crawl01"
}

Seed Lists

Create seed list

No Format

    job-id-43243

...

The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. It's also worth noting that the seed

No Format

POST /seed/create
{
curl -X POST -H 'Content-Type: application/json' -i http://localhost:8081/seed/create --data '{"name":"name-of-seedlisttest", 
"seedUrls":["httphttps://wwwnutch.exampleapache.comorg",....]
}
' 

Response is the relative file directory path. Note, this is relative to where the Nutch server was started. It's also worth noting that any seed lists which are created are persistent. That is to say they remain on disk even when nutch server is not running.

/var/folders/m9/hsls1krx12x968plt2brlhr00000gn/T/1443721976324-0
No Format
seedFiles/seed-1641959745623 

Get seed lists

The /seed endpoint facilitates retrieval of any seedlists which were created during the current server runtime.

As of Nutch 1.18 seed lists generated by previous server runtime sessions will not be available if the server is shutdown and restarted.

Database

This point provides access to information stored in the CrawlDb.

...