THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
Tool/Object | Metric Group | Metric Name | Description | Usage and Comments | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CleaningJob | CleaningJobStatus | Deleted documents | The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately cleaned ( deleted ) from the indexer(s). | This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place. | ||||||||
CrawlDbFilter | CrawlDB filter | Gone records removed | The total count of DB_GONE (HTTP 404) records purged during a CrawlDb deleted from the CrawlDB during an update. | See
| ||||||||
CrawlDB filter | Orphan records removed | The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update. | See https://issues.apache.org/jira/browse/NUTCH-1932 for more details. | |||||||||
CrawlDB filter | URLs filtered | |||||||||||
CrawlDbReducer | CrawlDB status | CrawlDatum.getStatusName(CrawlDatum().getStatus()) | ||||||||||
DeduplicationJob | DeduplicationJobStatus | Documents marked as duplicate | ||||||||||
DomainStatistics | N/A | MyCounter.EMPTY_RESULT | ||||||||||
N/A | MyCounter.FETCHED | |||||||||||
N/A | MyCounter.NOT_FETCHED | |||||||||||
Fetcher | FetcherStatus | bytes_downloaded | ||||||||||
FetcherStatus | hitByThrougputThreshold | |||||||||||
FetcherStatus | hitByTimeLimit | |||||||||||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | ||||||||||
FetcherStatus | FetchItem.notCreated.redirect | |||||||||||
FetcherStatus | outlinks_detected | |||||||||||
FetcherStatus | outlinks_following | |||||||||||
FetcherStatus | ProtocolStatus.getName() | |||||||||||
FetcherStatus | redirect_count_exceeded | |||||||||||
FetcherStatus | redirect_deduplicated | |||||||||||
FetcherStatus | robots_denied | |||||||||||
FetcherStatus | robots_denied_maxcrawldelay | |||||||||||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | |||||||||||
Generator | Generator | EXPR_REJECTED | ||||||||||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | |||||||||||
Generator | INTERVAL_REJECTED | |||||||||||
Generator | MALFORMED_URL | |||||||||||
Generator | SCHEDULE_REJECTED | |||||||||||
Generator | SCORE_TOO_LOW | |||||||||||
Generator | STATUS_REJECTED | |||||||||||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | |||||||||||
IndexerMapReduce | IndexerStatus | deleted (duplicates) | ||||||||||
IndexerStatus | deleted (IndexingFilter) | |||||||||||
IndexerStatus | deleted (gone) | |||||||||||
IndexerStatus | deleted (redirects) | |||||||||||
IndexerStatus | deleted (robots=noindex) | |||||||||||
IndexerStatus | errors (IndexingFilter) | |||||||||||
IndexerStatus | errors (ScoringFilter) | |||||||||||
IndexerStatus | indexed (add/update) | |||||||||||
IndexerStatus | skipped (IndexingFilter) | |||||||||||
IndexerStatus | skipped (not modified) | |||||||||||
Injector | injector | urls_filtered | ||||||||||
injector | urls_injected | |||||||||||
injector | urls_merged | |||||||||||
injector | urls_purged_404 | |||||||||||
injector | urls_purged_filter | |||||||||||
ParseSegment | ParserStatus | ParseStatus.majorCodes[parseStatus.getMajorCode()] | ||||||||||
QueueFeeder | FetcherStatus | filtered | ||||||||||
(also QueueFeeder) | FetcherStatus | AboveExceptionThresholdInQueue | ||||||||||
ResolverThread | UpdateHostDb | checked_hosts | ||||||||||
UpdateHostDb | existing_known_host | |||||||||||
UpdateHostDb | existing_unknown_host | |||||||||||
UpdateHostDb | new_known_host | |||||||||||
UpdateHostDb | new_unknown_host | |||||||||||
UpdateHostDb | purged_unknown_host | |||||||||||
UpdateHostDb | rediscovered_host | |||||||||||
UpdateHostDb | Long.toString(datum.numFailures()) + "_times_failed" | |||||||||||
SitemapProcessor | Sitemap | existing_sitemap_entries | ||||||||||
Sitemap | failed_fetches | |||||||||||
Sitemap | filtered_records | |||||||||||
Sitemap | filtered_sitemaps_from_hostname | |||||||||||
Sitemap | new_sitemap_entries | |||||||||||
Sitemap | sitemaps_from_hostname | |||||||||||
Sitemap | sitemap_seeds | |||||||||||
UpdateHostDbMapper | UpdateHostDb | filtered_records | ||||||||||
UpdateHostDbReducer | UpdateHostDb | total_hosts | ||||||||||
(also UpdateHostDbReducer) | UpdateHostDb | skipped_not_eligible | ||||||||||
WebGraph | WebGraph.outlinks | added links | ||||||||||
(also WebGraph) | WebGraph.outlinks | removed links | ||||||||||
WARCExporter | WARCExporter | exception | ||||||||||
WARCExporter | invalid URI | |||||||||||
WARCExporter | missing content | |||||||||||
WARCExporter | missing metadata | |||||||||||
WARCExporter | omitted empty response | |||||||||||
WARCExporter | records generated |
...