Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

e.g. http://www.xyz.org/ nutch.score=10 nutch.fetchInterval=2592000 userType=open_source


Nutch 1.x

Usage:


bin/nutch inject [-D...] <crawldb> <url_dir> [-overwrite|-update] [-noFilter] [-noNormalize] [-filterNormalizeAll]

  <crawldb>     Path to a crawldb directory. If not present, a new one would be created.
  <url_dir>     Path to URL file or directory with URL file(s) containing URLs to be injected.
                A URL file should have one URL per line, optionally followed by custom metadata.
                Blank lines or lines starting with a '#' would be ignored. Custom metadata must
                be of form 'key=value' and separated by tabs.
                Below are reserved metadata keys:

                        nutch.score: A custom score for a url
                        nutch.fetchInterval: A custom fetch interval for a url
                        nutch.fetchInterval.fixed: A custom fetch interval for a url that is not changed by AdaptiveFetchSchedule

                Example:
                 http://www.apache.org/
                 http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source

 -overwrite     Overwite existing crawldb records by the injected records. Has precedence over 'update'
 -update        Update existing crawldb records with the injected records. Old metadata is preserved

 -nonormalize   Do not normalize URLs before injecting
 -nofilter      Do not apply URL filters to injected URLs
 -filterNormalizeAll
                Normalize and filter all URLs including the URLs of existing CrawlDb records

 -D...          set or overwrite configuration property (property=value)
 -Ddb.update.purge.404=true
                remove URLs with status gone (404) from CrawlDb

<crawldb>: The directory containing the crawldb

...