Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tool/ObjectMetric GroupMetric NameDescriptionUsage and Comments
CleaningJobCleaningJobStatusDeleted documentsThe total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s).This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.
CrawlDbFilterCrawlDB filterGone records removedThe total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update.

See

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-1101
for more details.

CrawlDB filterOrphan records removedThe total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update.

See 

Jira
serverASF JIRA
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-1932
for more details.

CrawlDB filterURLs filteredThe total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update.

This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time.

POSSIBLE IMPROVEMENT: This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter.

CrawlDbReducerCrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())

With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB.

The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants.

This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation.

DeduplicationJobDeduplicationJobStatusDocuments marked as duplicateThe total number of duplicate documents in the CrawlDB.

The process of identifying (near) duplicate documents is of vital importance within the context of a search engine. The precision of any given information retrieval system can be negatively impacted if (near) duplicates are not identified and handled correctly. This does not always mean removing them, for example maybe (near) duplicates are important for versioning purposes. In most cases however it is preferred to identify and remove (near) duplicate records.

The Deduplication algorithm in Nutch groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept.

A duplicate record will have a CrawlDatum status of CrawlDatum.STATUS_DB_DUPLICATE.

DomainStatistics

N/AMyCounter.EMPTY_RESULTThe total count of empty (probably problematic) URL records for a given host, domain, suffix or top-level domain.It is possible that the DomainStatistics tool may identify an empty record for a given URL. This may happen regardless of whether the tool is invoked to retrieve host, domain, suffix or top-level domain statistics. When this discovery event occurs, it it is likely that some investigation would take place to understand why. For example, the CrawlDbReader could be invoked with the -url command line argument to further debug/detail what CrawlDatum data exists.
N/AMyCounter.FETCHEDThe total count of fetched URL records for a given host, domain, suffix or top-level domain.This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' has been achieved for a given host, domain, suffix or top-level domain. This figure can be compared to a website administrators total.
N/AMyCounter.NOT_FETCHEDThe total count of unfetched URL records for a given host, domain, suffix or top-level domain.This metric is particularly useful for quickly drilling down through large datasets to determine, for example, how much 'coverage' still has to be achieved for a given host, domain, suffix or top-level domain. When combined with the fetched figure and compared to a website administrators total it can provide useful insight.

Fetcher
FetcherStatusbytes_downloadedThe total bytes of fetched data acquired across the Fetcher Mapper task(s).

Over time, this can be used to benchmark how much data movement is occurring over the Nutch crawl network.

POSSIBLE IMPROVEMENT: This metric could be improved if a correlation could be made between the volume of data and the source it came from whether that be a given host, domain, suffix or top-level domain.

FetcherStatushitByThrougputThresholdA total count of the URLs dropped across all fetch queues due to throughput dropping below the threshold too many times.

This aspect of the Nutch Fetcher configuration is designed to prevent slow fetch queues from stalling the overall fetcher throughput. However it usually has the impact of increasing the latency/timeliness of URLs actually being fetched if they are essentially dropped because of low throughput. This means that they are shelved until a future fetch operation.

The specific configuration settings is

Code Block
languagexml
titlefetcher.throughput.threshold.pages
linenumberstrue
collapsetrue
<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>-1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

A more thorough understanding of Fetcher configuration relating to (slow) throughput requires an understanding of the following configuration settings as well

Code Block
languagexml
titleAdditional fetcher throughput configuration
linenumberstrue
collapsetrue
<property>
  <name>fetcher.throughput.threshold.retries</name>
  <value>5</value>
  <description>The number of times the fetcher.throughput.threshold.pages is allowed to be exceeded.
  This settings prevents accidental slow downs from immediately killing the fetcher thread.
  </description>
</property>

<property>
<name>fetcher.throughput.threshold.check.after</name>
  <value>5</value>
  <description>The number of minutes after which the throughput check is enabled.</description>
</property>

POSSIBLE IMPROVEMENT: It would be advantageous to understand which URLs from which hosts in the queue(s) were resulting in slow throughput. This would facilitate investigation into why this was happening.

FetcherStatushitByTimeLimitA total count of the URLs dropped across all fetch queues due to the fetcher execution time limit being exceeded.

This metric is valuable for quantifying the number of URLs which have been effectibely timebombed e.g. shelved for future fetching due to overall fetcher runtime exceeding a predefined timeout.

Although by default the Fetcher never times out e.g. the configuration is set to -1,if a timeout is preferred then the following configuration property can be edited.

Code Block
languagexml
titlefetcher.timelimit.mins
linenumberstrue
collapsetrue
<property>
  <name>fetcher.timelimit.mins</name>
  <value>-1</value>
  <description>This is the number of minutes allocated to the fetching.
  Once this value is reached, any remaining entry from the input URL list is skipped 
  and all active queues are emptied. The default value of -1 deactivates the time limit.
  </description>
</property>

POSSIBLE IMPROVEMENT: It could be useful to record the fact that a URL was staged due to it being hit by the timeout limit. This could possibly be stored in the CrawlDatum metadata.

Also see

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-2910









FetcherThread
FetcherStatusAboveExceptionThresholdInQueue

FetcherStatusFetchItem.notCreated.redirect

FetcherStatusoutlinks_detected

FetcherStatusoutlinks_following

FetcherStatusProtocolStatus.getName()

FetcherStatusredirect_count_exceeded

FetcherStatusredirect_deduplicated

FetcherStatusrobots_denied

FetcherStatusrobots_denied_maxcrawldelay

ParserStatusParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]








Generator















GeneratorEXPR_REJECTED

GeneratorHOSTS_AFFECTED_PER_HOST_OVERFLOW

GeneratorINTERVAL_REJECTED

GeneratorMALFORMED_URL

GeneratorSCHEDULE_REJECTED

GeneratorSCORE_TOO_LOW

GeneratorSTATUS_REJECTED

GeneratorURLS_SKIPPED_PER_HOST_OVERFLOW

IndexerMapReduce








IndexerStatusdeleted (duplicates)

IndexerStatusdeleted (IndexingFilter)

IndexerStatusdeleted (gone)

IndexerStatusdeleted (redirects)

IndexerStatusdeleted (robots=noindex)

IndexerStatuserrors (IndexingFilter)

IndexerStatuserrors (ScoringFilter)

IndexerStatusindexed (add/update)

IndexerStatusskipped (IndexingFilter)

IndexerStatusskipped (not modified)





Injector



injectorurls_filtered

injectorurls_injected

injectorurls_merged

injectorurls_purged_404

injectorurls_purged_filter

ParseSegmentParserStatusParseStatus.majorCodes[parseStatus.getMajorCode()]

QueueFeederFetcherStatusfiltered

(also QueueFeeder)FetcherStatusAboveExceptionThresholdInQueue

ResolverThread






UpdateHostDbchecked_hosts

UpdateHostDbexisting_known_host

UpdateHostDbexisting_unknown_host

UpdateHostDbnew_known_host

UpdateHostDbnew_unknown_host

UpdateHostDbpurged_unknown_host

UpdateHostDbrediscovered_host

UpdateHostDbLong.toString(datum.numFailures()) + "_times_failed"

SitemapProcessor





Sitemapexisting_sitemap_entries

Sitemapfailed_fetches

Sitemapfiltered_records

Sitemapfiltered_sitemaps_from_hostname

Sitemapnew_sitemap_entries

Sitemapsitemaps_from_hostname

Sitemapsitemap_seeds

UpdateHostDbMapperUpdateHostDbfiltered_records

UpdateHostDbReducerUpdateHostDbtotal_hosts

(also UpdateHostDbReducer)UpdateHostDbskipped_not_eligible

WebGraphWebGraph.outlinksadded links

(also WebGraph)WebGraph.outlinksremoved links

WARCExporter




WARCExporterexception

WARCExporterinvalid URI

WARCExportermissing content

WARCExportermissing metadata

WARCExporteromitted empty response

WARCExporterrecords generated

Conclusion

This document aims to provide a detailed account of Nutch application metrics such that a data-driven approach can be adopted to better manage Nutch operations. Several cells in the USAGE AND COMMENTS column in the above table offer areas for POSSIBLE IMPROVEMENT. These suggestions are targeted towards Nutch crawler administrators and developers interested in possibly improving Nutch metrics.