...
Tool/Object | Metric Group | Metric Name | Description | Usage and Comments | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CleaningJob | CleaningJobStatus | Deleted documents | The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s). | This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place. | |||||||||||
CrawlDbFilter | CrawlDB filter | Gone records removed | The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update. | See
| |||||||||||
CrawlDB filter | Orphan records removed | The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update. | See
| ||||||||||||
CrawlDB filter | URLs filtered | The total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update. | This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time. This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter. | ||||||||||||
CrawlDbReducer | CrawlDB status | CrawlDatum.getStatusName(CrawlDatum().getStatus()) | With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB. | The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants. This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation. | |||||||||||
DeduplicationJob | DeduplicationJobStatus | Documents marked as duplicate | The total number of duplicate documents in the CrawlDB. | The process of identifying (near) duplicate documents is of vital importance within the context of a search engine. The precision of any given information retrieval system can be negatively impacted if (near) duplicates are not identified and handled correctly. This does not always mean removing them, for example maybe (near) duplicates are important for versioning purposes. In most cases however it is preferred to identify and remove (near) duplicate records. The Deduplication algorithm in Nutch groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. A duplicate record will have a CrawlDatum status of CrawlDatum.STATUS_DB_DUPLICATE. | |||||||||||
DomainStatistics | N/A | MyCounter.EMPTY_RESULT | |||||||||||||
N/A | MyCounter.FETCHED | ||||||||||||||
N/A | MyCounter.NOT_FETCHED | ||||||||||||||
Fetcher | FetcherStatus | bytes_downloaded | |||||||||||||
FetcherStatus | hitByThrougputThreshold | ||||||||||||||
FetcherStatus | hitByTimeLimit | ||||||||||||||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
FetcherStatus | FetchItem.notCreated.redirect | ||||||||||||||
FetcherStatus | outlinks_detected | ||||||||||||||
FetcherStatus | outlinks_following | ||||||||||||||
FetcherStatus | ProtocolStatus.getName() | ||||||||||||||
FetcherStatus | redirect_count_exceeded | ||||||||||||||
FetcherStatus | redirect_deduplicated | ||||||||||||||
FetcherStatus | robots_denied | ||||||||||||||
FetcherStatus | robots_denied_maxcrawldelay | ||||||||||||||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | ||||||||||||||
Generator | Generator | EXPR_REJECTED | |||||||||||||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | ||||||||||||||
Generator | INTERVAL_REJECTED | ||||||||||||||
Generator | MALFORMED_URL | ||||||||||||||
Generator | SCHEDULE_REJECTED | ||||||||||||||
Generator | SCORE_TOO_LOW | ||||||||||||||
Generator | STATUS_REJECTED | ||||||||||||||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | ||||||||||||||
IndexerMapReduce | IndexerStatus | deleted (duplicates) | |||||||||||||
IndexerStatus | deleted (IndexingFilter) | ||||||||||||||
IndexerStatus | deleted (gone) | ||||||||||||||
IndexerStatus | deleted (redirects) | ||||||||||||||
IndexerStatus | deleted (robots=noindex) | ||||||||||||||
IndexerStatus | errors (IndexingFilter) | ||||||||||||||
IndexerStatus | errors (ScoringFilter) | ||||||||||||||
IndexerStatus | indexed (add/update) | ||||||||||||||
IndexerStatus | skipped (IndexingFilter) | ||||||||||||||
IndexerStatus | skipped (not modified) | ||||||||||||||
Injector | injector | urls_filtered | |||||||||||||
injector | urls_injected | ||||||||||||||
injector | urls_merged | ||||||||||||||
injector | urls_purged_404 | ||||||||||||||
injector | urls_purged_filter | ||||||||||||||
ParseSegment | ParserStatus | ParseStatus.majorCodes[parseStatus.getMajorCode()] | |||||||||||||
QueueFeeder | FetcherStatus | filtered | |||||||||||||
(also QueueFeeder) | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
ResolverThread | UpdateHostDb | checked_hosts | |||||||||||||
UpdateHostDb | existing_known_host | ||||||||||||||
UpdateHostDb | existing_unknown_host | ||||||||||||||
UpdateHostDb | new_known_host | ||||||||||||||
UpdateHostDb | new_unknown_host | ||||||||||||||
UpdateHostDb | purged_unknown_host | ||||||||||||||
UpdateHostDb | rediscovered_host | ||||||||||||||
UpdateHostDb | Long.toString(datum.numFailures()) + "_times_failed" | ||||||||||||||
SitemapProcessor | Sitemap | existing_sitemap_entries | |||||||||||||
Sitemap | failed_fetches | ||||||||||||||
Sitemap | filtered_records | ||||||||||||||
Sitemap | filtered_sitemaps_from_hostname | ||||||||||||||
Sitemap | new_sitemap_entries | ||||||||||||||
Sitemap | sitemaps_from_hostname | ||||||||||||||
Sitemap | sitemap_seeds | ||||||||||||||
UpdateHostDbMapper | UpdateHostDb | filtered_records | |||||||||||||
UpdateHostDbReducer | UpdateHostDb | total_hosts | |||||||||||||
(also UpdateHostDbReducer) | UpdateHostDb | skipped_not_eligible | |||||||||||||
WebGraph | WebGraph.outlinks | added links | |||||||||||||
(also WebGraph) | WebGraph.outlinks | removed links | |||||||||||||
WARCExporter | WARCExporter | exception | |||||||||||||
WARCExporter | invalid URI | ||||||||||||||
WARCExporter | missing content | ||||||||||||||
WARCExporter | missing metadata | ||||||||||||||
WARCExporter | omitted empty response | ||||||||||||||
WARCExporter | records generated |
...