Page History

...

Tool/Object

Metric Group

Metric Name

Description

Usage and Comments

CleaningJob

CleaningJobStatus

Deleted documents

The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s).

This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.

CrawlDbFilter

CrawlDB filter

Gone records removed

The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update.

See

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	NUTCH-1101

for more details.

CrawlDB filter

Orphan records removed

The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update.

See

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	NUTCH-1932

for more details.

CrawlDB filter

URLs filtered

The total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update.

This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time.

This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter.

CrawlDbReducer

CrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())

With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB.

The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants.

This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation.

DeduplicationJob

DeduplicationJobStatus

Documents marked as duplicate

The total number of duplicate documents in the CrawlDB.

The process of identifying (near) duplicate documents is of vital importance within the context of a search engine. The precision of any given information retrieval system can be negatively impacted if (near) duplicates are not identified and handled correctly. This does not always mean removing them, for example maybe (near) duplicates are important for versioning purposes. In most cases however it is preferred to identify and remove (near) duplicate records.

The Deduplication algorithm in Nutch groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept.

A duplicate record will have a CrawlDatum status of CrawlDatum.STATUS_DB_DUPLICATE.

DomainStatistics

N/A

MyCounter.EMPTY_RESULT

N/A

MyCounter.FETCHED

N/A

MyCounter.NOT_FETCHED

Fetcher

FetcherStatus

bytes_downloaded

FetcherStatus

hitByThrougputThreshold

FetcherStatus

hitByTimeLimit

FetcherThread

FetcherStatus

AboveExceptionThresholdInQueue

FetcherStatus

FetchItem.notCreated.redirect

FetcherStatus

outlinks_detected

FetcherStatus

outlinks_following

FetcherStatus

ProtocolStatus.getName()

FetcherStatus

redirect_count_exceeded

FetcherStatus

redirect_deduplicated

FetcherStatus

robots_denied

FetcherStatus

robots_denied_maxcrawldelay

ParserStatus

ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]

Generator

EXPR_REJECTED

Generator

HOSTS_AFFECTED_PER_HOST_OVERFLOW

Generator

INTERVAL_REJECTED

Generator

MALFORMED_URL

Generator

SCHEDULE_REJECTED

Generator

SCORE_TOO_LOW

Generator

STATUS_REJECTED

Generator

URLS_SKIPPED_PER_HOST_OVERFLOW

IndexerMapReduce

IndexerStatus

deleted (duplicates)

IndexerStatus

deleted (IndexingFilter)

IndexerStatus

deleted (gone)

IndexerStatus

deleted (redirects)

IndexerStatus

deleted (robots=noindex)

IndexerStatus

errors (IndexingFilter)

IndexerStatus

errors (ScoringFilter)

IndexerStatus

indexed (add/update)

IndexerStatus

skipped (IndexingFilter)

IndexerStatus

skipped (not modified)

Injector

injector

urls_filtered

injector

urls_injected

injector

urls_merged

injector

urls_purged_404

injector

urls_purged_filter

ParseSegment

ParserStatus

ParseStatus.majorCodes[parseStatus.getMajorCode()]

QueueFeeder

FetcherStatus

filtered

(also QueueFeeder)

FetcherStatus

AboveExceptionThresholdInQueue

ResolverThread

UpdateHostDb

checked_hosts

UpdateHostDb

existing_known_host

UpdateHostDb

existing_unknown_host

UpdateHostDb

new_known_host

UpdateHostDb

new_unknown_host

UpdateHostDb

purged_unknown_host

UpdateHostDb

rediscovered_host

UpdateHostDb

Long.toString(datum.numFailures()) + "_times_failed"

SitemapProcessor

Sitemap

existing_sitemap_entries

Sitemap

failed_fetches

Sitemap

filtered_records

Sitemap

filtered_sitemaps_from_hostname

Sitemap

new_sitemap_entries

Sitemap

sitemaps_from_hostname

Sitemap

sitemap_seeds

UpdateHostDbMapper

UpdateHostDb

filtered_records

UpdateHostDbReducer

UpdateHostDb

total_hosts

(also UpdateHostDbReducer)

UpdateHostDb

skipped_not_eligible

WebGraph

WebGraph.outlinks

added links

(also WebGraph)

WebGraph.outlinks

removed links

WARCExporter

exception

WARCExporter

invalid URI

WARCExporter

missing content

WARCExporter

missing metadata

WARCExporter

omitted empty response

WARCExporter

records generated

...

Space shortcuts

Child pages

Versions Compared

Old Version 30

New Version 31

Key