Page History

...

Tool/Object

Metric Group

Metric Name

Description

Usage and Comments

CleaningJob

CleaningJobStatus

Deleted documents

The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s).

This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.

CrawlDbFilter

CrawlDB filter

Gone records removed

The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update.

See

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	NUTCH-1101

for more details.

CrawlDB filter

Orphan records removed

The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update.

See

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	NUTCH-1932

for more details.

CrawlDB filter

URLs filtered

The total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update.

This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time.

POSSIBLE IMPROVEMENT: This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter.

CrawlDbReducer

CrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())

With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB.

The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants.

This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation.

DeduplicationJob

DeduplicationJobStatus

Documents marked as duplicate

The total number of duplicate documents in the CrawlDB.

The process of identifying (near) duplicate documents is of vital importance within the context of a search engine. The precision of any given information retrieval system can be negatively impacted if (near) duplicates are not identified and handled correctly. This does not always mean removing them, for example maybe (near) duplicates are important for versioning purposes. In most cases however it is preferred to identify and remove (near) duplicate records.

The Deduplication algorithm in Nutch groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept.

A duplicate record will have a CrawlDatum status of CrawlDatum.STATUS_DB_DUPLICATE.

DomainStatistics

N/A

MyCounter.EMPTY_RESULT

The total count of empty (probably problematic) URL records for a given host, domain, suffix or top-level domain.

It is possible that the DomainStatistics tool may identify an empty record for a given URL. This may happen regardless of whether the tool is invoked to retrieve host, domain, suffix or top-level domain statistics. When this discovery event occurs, it it is likely that some investigation would take place to understand why. For example, the CrawlDbReader could be invoked with the -url command line argument to further debug/detail what CrawlDatum data exists.

N/A

MyCounter.FETCHED

The total count of fetched URL records for a given host, domain, suffix or top-level domain.

This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' has been achieved for a given host, domain, suffix or top-level domain. This figure can be compared to a website administrators total.

N/A

MyCounter.NOT_FETCHED

The total count of unfetched URL records for a given host, domain, suffix or top-level domain.

This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' still has to be achieved for a given host, domain, suffix or top-level domain. When combined with the fetched figure and compared to a website administrators total it can provide useful insight.

Fetcher

FetcherStatus

bytes_downloaded

The total bytes of fetched data acquired across the Fetcher Mapper task(s).

Over time, this can be used to benchmark how much data movement is occurring over the Nutch crawl network.

POSSIBLE IMPROVEMENT: This metric could be improved if a correlation could be made between the volume of data and the source it came from whether that be a given host, domain, suffix or top-level domain.

FetcherStatus

hitByThrougputThreshold

A total count of the URLs dropped across all fetch queues due to throughput dropping below the threshold too many times.

This aspect of the Nutch Fetcher configuration is designed to prevent slow fetch queues from stalling the overall fetcher throughput. However it usually has the impact of increasing the latency/timeliness of URLs actually being fetched if they are essentially dropped because of low throughput. This means that they are shelved until a future fetch operation.

The specific configuration settings is

Code Block

language	xml
title	fetcher.throughput.threshold.pages
linenumbers	true
collapse	true

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>-1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

A more thorough understanding of Fetcher configuration relating to (slow) throughput requires an understanding of the following configuration settings as well

Code Block

language	xml
title	Additional fetcher throughput configuration
linenumbers	true
collapse	true

<property>
  <name>fetcher.throughput.threshold.retries</name>
  <value>5</value>
  <description>The number of times the fetcher.throughput.threshold.pages is allowed to be exceeded.
  This settings prevents accidental slow downs from immediately killing the fetcher thread.
  </description>
</property>

<property>
<name>fetcher.throughput.threshold.check.after</name>
  <value>5</value>
  <description>The number of minutes after which the throughput check is enabled.</description>
</property>

POSSIBLE IMPROVEMENT: It would be advantageous to understand which URLs from which hosts in the queue(s) were resulting in slow throughput. This would facilitate investigation into why this was happening.

FetcherStatus

hitByTimeLimit

A total count of the URLs dropped across all fetch queues due to the fetcher execution time limit being exceeded.

This metric is valuable for quantifying the number of URLs which have been effectively timebombed e.g. shelved for future fetching due to overall fetcher runtime exceeding a predefined timeout.

Although by default the Fetcher never times out e.g. the configuration is set to -1,if a timeout is preferred then the following configuration property can be edited.

Code Block

language	xml
title	fetcher.timelimit.mins
linenumbers	true
collapse	true

<property>
  <name>fetcher.timelimit.mins</name>
  <value>-1</value>
  <description>This is the number of minutes allocated to the fetching.
  Once this value is reached, any remaining entry from the input URL list is skipped 
  and all active queues are emptied. The default value of -1 deactivates the time limit.
  </description>
</property>

POSSIBLE IMPROVEMENT: It could be useful to record the fact that a URL was staged due to it being hit by the timeout limit. This could possibly be stored in the CrawlDatum metadata.

Also see

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	NUTCH-2910

FetcherThread

FetcherStatus

AboveExceptionThresholdInQueue

The total count of URLs purged across all fetch queues as a result of the maximum number of protocol-level exceptions (e.g. timeouts) per queue being exceeded.

This metric is useful for quantifying the number of URLs shelved for future fetching due to anomalies occurring during fetcher execution exceeding a predefined ceiling.

Although by default the Fetcher never enforces this behaviour e.g. the configuration is set to -1, if this is changed then this total count will become a useful metric to track. Further information on the configuration parameter can be seen below

Code Block

language	xml
title	fetcher.max.exceptions.per.queue
linenumbers	true
collapse	true

<property>
  <name>fetcher.max.exceptions.per.queue</name>
  <value>-1</value>
  <description>The maximum number of protocol-level exceptions (e.g. timeouts) per
  host (or IP) queue. Once this value is reached, any remaining entries from this
  queue are purged, effectively stopping the fetching from this host/IP. The default
  value of -1 deactivates this limit.
  </description>
</property>

POSSIBLE IMPROVEMENT: It could be useful to record the fact that a URL was staged due to it being hit by the exception limit. Additionally, it could be useful to write metadata to all URLs which contributed towards the limit being met. This could possibly be stored in the CrawlDatum metadata.

FetcherStatus

FetchItem.notCreated.redirect

A total count of URLs across all fetch queues for which following redirect(s) resulted in no result.

Essentially, each FetcherThread attempts to understand and eliminate all redirect options e.g. duplicate redirect URL, before giving up on a redirect URL entirely. In the case of a redirect URL for which no logical fetch outcome can be produced e.g. that the FetchItem is null, redirecting is simply deactivated as it is impossible to continue.

FetcherStatus

outlinks_detected

A total count of detected outlinks for all fetch items (URLs) across all fetch queues.

This metric can be used to estimate the number in URLs to be fetched in the next fetch phase. If for example, resources were being allocated within the client at configuration/compile time (rather than dynamically at runtime) this could be used to inform resource reservations, utilization and data partitioning logic.

FetcherStatus

outlinks_following

From outlinks_detected (see directly above), this is a total count of URLs which will be followed.

This metric value may be the same as outlinks_detected or it may be less. This ultimately depends on a few things i.e.,

whether the fetcher is configured to follow external outlinks
whether a given URL is already followed
whether a given URL is already fetched

FetcherStatus

ProtocolStatus.getName()

Total counts of all the different fetched finished status' for all URLs.

For a comprehensive collection of the various fetched finish status' to expect, check out the private static final HashMap<Integer, String> codeToName defined within ProtocolStatus. This metric is useful for understanding, from across your CrawlDb, the status of certain URLs. Via different tools, you can begin to investigate further.

FetcherStatus

redirect_count_exceeded

Total count of all URLs which have exceeded the maximum configured number of redirects.

See the following configuration property in nutch-default.xml

Code Block

language	xml
title	http.redirect.max
linenumbers	true
collapse	true

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

This metric is useful for understanding how many URLs are possibly skipped due to a large number of redirects.

Also see the following property

Code Block

language	xml
title	http.redirect.max.exceeded.skip
linenumbers	true
collapse	true

<property>
  <name>http.redirect.max.exceeded.skip</name>
  <value>false</value>
  <description>
    Whether to skip the last URL in a redirect chain when when redirects
    are followed (http.redirect.max > 0) and the maximum number of redirects
    in a chain is exceeded (redirect_count > http.redirect.max).
    If not skipped the redirect target URLs are stored as `linked`
    and fetched in one of the following cycles. See also NUTCH-2748.
  </description>
</property>

FetcherStatus

redirect_deduplicated

Total count of duplicate (and hence not fetched) redirected URLs.

No fetching takes place for this class of redirect URLs as they are duplicates of other redirect URLs already fetched.

FetcherStatus

robots_denied

Total count of all URLs not fetched due to being denied by robots.txt rules.

By default Nutch is configured to respect and comply with robots.txt rules for any given site. It is useful to know how many URLs may not be fetched from a given site due to robots.txt compliance.

FetcherStatus

robots_denied_maxcrawldelay

Total count of URLs which are skipped due to the robots.txt crawl delay being above a configured maximum.

The following configuration property must be consulted for a detailed explanation behind this metric.

Code Block

language	xml
title	fetcher.max.crawl.delay
linenumbers	true
collapse	true

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>30</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>

Essentially, a delay of 5 seconds is used for fetching requests to the same host unless a crawl delay is specified within the robots.txt. Also see

Code Block

language	xml
title	fetcher.server.delay
linenumbers	true
collapse	true

<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overridden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
   </description>
</property>

ParserStatus

ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]

Total count of major codes (see right) from parsing URLs.

ParseStatus defines three major categories for the result of an URL parse operation. i.e., notparsed, success and failed. This metric is useful for debugging how many parse operations failed for a given crawl cycle. Subsequent parse attempts can then be made or the URL record can be handled appropriately.

Generator

EXPR_REJECTED

Total count of URLs rejected by Jexl expression(s) during the generate phase.

This metric is useful for determining the impact that that Jexl expressions (provided via the Generator CLI) have on filtering URLs. The expressions are evaluated during the Generator Map phase.

All of the configuration which drives this metric is read or set from Java code and not explicitly defined in nutch-default.xml.

Generator

HOSTS_AFFECTED_PER_HOST_OVERFLOW

Total count of host(s) or domain(s) affected by the number of URLs exceeding a configured fetchlist size threshold.

This configuration property is essentially turned off by default e.g. there is no defined maximum number of URLs per fetchlist. However, if a maximum is defined, then it is useful to know, how many hosts or domains included in fetchlists have more URLs than allowed. In these cases additional URLs won't be included in the fetchlist but bumped on to future crawling cycles.

The configuration property below will directly drive this metric.

Code Block

language	xml
title	generate.max.count
linenumbers	true
collapse	true

<property>
  <name>generate.max.count</name>
  <value>-1</value>
  <description>The maximum number of URLs in a single
  fetchlist. -1 if unlimited. The URLs are counted according
  to the value of the parameter generate.count.mode.
  </description>
</property>

Generator

INTERVAL_REJECTED

Total count of records rejected due to retry or fetch interval being above a configured thershold.

This configuration property is essentially turned off by default e.g. there is no minimum defined retry interval. This metric is useful for understanding the impact that changing that has on URL filtering.

The configuration property below drives this metric.

Code Block

language	xml
title	generate.min.interval
linenumbers	true
collapse	true

<property>
  <name>generate.min.interval</name>
  <value>-1</value>
  <description>Select only entries with a retry interval lower than
  generate.min.interval. A value of -1 disables this check.</description>
</property>

Generator

MALFORMED_URL

Total count of malformed URLs filtered.

In the Generator, malformed URLs are either discovered by

an URL normalizer implementation. This is turned on by default but can be toggled on or off within the Generator CLI or programmatically
Attempting to extract either the URL host or domain (depending on which one is configured). See the following property

Code Block

language	xml
title	generate.count.mode
linenumbers	true
collapse	true

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generate.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count 
  per IP in the new version of the Generator.
  </description>
</property>

Generator

SCHEDULE_REJECTED

Total count of URLs not suitable for selection (in a given crawl cycle due to the fetch time being higher than the current time.

The metric description covers the default fetch schedule case but this can change depending on the actual implementation. Of specific interest is FetchSchedule#shouldFetch(...) which explains further. This metric can be useful for comparing implementations of FetchSchedule.

Generator

SCORE_TOO_LOW

Total count of filtered URL entries with a score lower than a configured threshold.

The configuration parameter which drives this metric is

Code Block

language	xml
title	generate.min.score
linenumbers	true
collapse	true

<property>
  <name>generate.min.score</name>
  <value>0</value>
  <description>Select only entries with a score larger than
  generate.min.score.</description>
</property>

The default value for this configuration property means that all entries should be selected.

The metric can be useful to determine if a configured minimum value is too high and filters too many URLs from being included in fetchlists.

Generator

STATUS_REJECTED

Total count of URLs filtered by a CrawlDatum status filter.

The following configuration property is used to straight filter URLs depending on their CrawlDatum status

Code Block

language	xml
title	generate.restrict.status
linenumbers	true
collapse	true

<property>
  <name>generate.restrict.status</name>
  <value></value>
  <description>Select only entries of this status, see
  https://issues.apache.org/jira/browse/NUTCH-1248</description>
</property>

As an indication of the status keys which can be used, see CrawlDatum.statNames.

This metric is useful to simply see how effective the status filters are.

Generator

URLS_SKIPPED_PER_HOST_OVERFLOW

Total count of URLs skipped by the number of URLs exceeding a configured fetchlist size threshold.

This configuration property is essentially turned off by default e.g. there is no defined maximum number of URLs per fetchlist. However, if a maximum is defined, then it is useful to know, how many URLs are skipped. In these cases additional URLs won't be included in the fetchlist but bumped on to future crawling cycles.

The configuration property below will directly drive this metric.

Code Block

language	xml
title	generate.max.count
linenumbers	true
collapse	true

<property>
  <name>generate.max.count</name>
  <value>-1</value>
  <description>The maximum number of URLs in a single
  fetchlist. -1 if unlimited. The URLs are counted according
  to the value of the parameter generate.count.mode.
  </description>
</property>

IndexerMapReduce

IndexerStatus

deleted (duplicates)

IndexerStatus

deleted (IndexingFilter)

IndexerStatus

deleted (gone)

IndexerStatus

deleted (redirects)

IndexerStatus

deleted (robots=noindex)

IndexerStatus

errors (IndexingFilter)

IndexerStatus

errors (ScoringFilter)

IndexerStatus

indexed (add/update)

IndexerStatus

skipped (IndexingFilter)

IndexerStatus

skipped (not modified)

Injector

injector

urls_filtered

Total count of seed URLs filtered by the Injector.

URL normalization and then filtering operations are executed within the Injector Map task(s). They are both turned off by default in nutch-default.xml however if these values are not interpreted the Injector turns normalization and filtering operations on by default.

This metric is useful to determine the impact that normalization and filtering have on the injection of seeds contained within seed lists. For more information on configuration see

Code Block

language	xml
title	Normalization and Filtering
linenumbers	true
collapse	true

<property>
    <name>crawldb.url.normalizers</name>
    <value>false</value>
    <description>
	!Temporary, can be overwritten with the command line!
	Normalize URLs when updating crawldb
    </description>
</property>

<property>
    <name>crawldb.url.filters</name>
    <value>false</value>
    <description>
	!Temporary, can be overwritten with the command line!
	Filter URLS when updating crawldb
    </description>
</property>

injector

urls_injected

Total count of seed URLs injected by the Injector.

A useful metric for simply counting by how many URLs the CrawlDb has grown by the end of an Injector invocation.

injector

urls_merged

Total count of seed URLs merged with an existing CrawlDatum record.

This metric is useful for seeing how many existing URL records were affected by a given seed list within the Injector reduce task(s). Several configuration settings are used to determine what those affects are...

Code Block

language	xml
title	URLs merging in Injector
linenumbers	true
collapse	true

<property>
  <name>db.injector.overwrite</name>
  <value>false</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
</property>

<property>
  <name>db.injector.update</name>
  <value>false</value>
  <description>If true existing records in the CrawlDB will be updated with
  injected records. Old meta data is preserved. The db.injector.overwrite
  parameter has precedence.
  </description>
</property>

You should also consult the Injector.InjectorReducer#reduce documentation as it adequately describes the Injector merging algorithm.

injector

urls_purged_404

Total count of deleted/purged URLs due to an existing CrawlDatum.STATUS_DB_GONE

injector

urls_purged_filter

Total count of deleted/purged URLs filtered by one or more filters and/or normalizers.

ParseSegment

ParserStatus

ParseStatus.majorCodes[parseStatus.getMajorCode()]

QueueFeeder

FetcherStatus

filtered

(also QueueFeeder)

FetcherStatus

AboveExceptionThresholdInQueue

ResolverThread

UpdateHostDb

checked_hosts

UpdateHostDb

existing_known_host

UpdateHostDb

existing_unknown_host

UpdateHostDb

new_known_host

UpdateHostDb

new_unknown_host

UpdateHostDb

purged_unknown_host

UpdateHostDb

rediscovered_host

UpdateHostDb

Long.toString(datum.numFailures()) + "_times_failed"

SitemapProcessor

Sitemap

existing_sitemap_entries

Sitemap

failed_fetches

Sitemap

filtered_records

Sitemap

filtered_sitemaps_from_hostname

Sitemap

new_sitemap_entries

Sitemap

sitemaps_from_hostname

Sitemap

sitemap_seeds

UpdateHostDbMapper

UpdateHostDb

filtered_records

UpdateHostDbReducer

UpdateHostDb

total_hosts

(also UpdateHostDbReducer)

UpdateHostDb

skipped_not_eligible

WebGraph

WebGraph.outlinks

added links

(also WebGraph)

WebGraph.outlinks

removed links

WARCExporter

exception

WARCExporter

invalid URI

WARCExporter

missing content

WARCExporter

missing metadata

WARCExporter

omitted empty response

WARCExporter

records generated

...

Space shortcuts

Child pages

Versions Compared

Old Version 47

New Version 48

Key