THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
Tool/Object | Metric Group | Metric Name | Description | Usage and Comments | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CleaningJob | CleaningJobStatus | Deleted documents | The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s). | This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place. | |||||||||||
CrawlDbFilter | CrawlDB filter | Gone records removed | The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update. | See
| |||||||||||
CrawlDB filter | Orphan records removed | The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update. | See https://issues.apache.org/jira/browse/NUTCH-1932 See
| ||||||||||||
CrawlDB filter | URLs filtered | ||||||||||||||
CrawlDbReducer | CrawlDB status | CrawlDatum.getStatusName(CrawlDatum().getStatus()) | |||||||||||||
DeduplicationJob | DeduplicationJobStatus | Documents marked as duplicate | |||||||||||||
DomainStatistics | N/A | MyCounter.EMPTY_RESULT | |||||||||||||
N/A | MyCounter.FETCHED | ||||||||||||||
N/A | MyCounter.NOT_FETCHED | ||||||||||||||
Fetcher | FetcherStatus | bytes_downloaded | |||||||||||||
FetcherStatus | hitByThrougputThreshold | ||||||||||||||
FetcherStatus | hitByTimeLimit | ||||||||||||||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
FetcherStatus | FetchItem.notCreated.redirect | ||||||||||||||
FetcherStatus | outlinks_detected | ||||||||||||||
FetcherStatus | outlinks_following | ||||||||||||||
FetcherStatus | ProtocolStatus.getName() | ||||||||||||||
FetcherStatus | redirect_count_exceeded | ||||||||||||||
FetcherStatus | redirect_deduplicated | ||||||||||||||
FetcherStatus | robots_denied | ||||||||||||||
FetcherStatus | robots_denied_maxcrawldelay | ||||||||||||||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | ||||||||||||||
Generator | Generator | EXPR_REJECTED | |||||||||||||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | ||||||||||||||
Generator | INTERVAL_REJECTED | ||||||||||||||
Generator | MALFORMED_URL | ||||||||||||||
Generator | SCHEDULE_REJECTED | ||||||||||||||
Generator | SCORE_TOO_LOW | ||||||||||||||
Generator | STATUS_REJECTED | ||||||||||||||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | ||||||||||||||
IndexerMapReduce | IndexerStatus | deleted (duplicates) | |||||||||||||
IndexerStatus | deleted (IndexingFilter) | ||||||||||||||
IndexerStatus | deleted (gone) | ||||||||||||||
IndexerStatus | deleted (redirects) | ||||||||||||||
IndexerStatus | deleted (robots=noindex) | ||||||||||||||
IndexerStatus | errors (IndexingFilter) | ||||||||||||||
IndexerStatus | errors (ScoringFilter) | ||||||||||||||
IndexerStatus | indexed (add/update) | ||||||||||||||
IndexerStatus | skipped (IndexingFilter) | ||||||||||||||
IndexerStatus | skipped (not modified) | ||||||||||||||
Injector | injector | urls_filtered | |||||||||||||
injector | urls_injected | ||||||||||||||
injector | urls_merged | ||||||||||||||
injector | urls_purged_404 | ||||||||||||||
injector | urls_purged_filter | ||||||||||||||
ParseSegment | ParserStatus | ParseStatus.majorCodes[parseStatus.getMajorCode()] | |||||||||||||
QueueFeeder | FetcherStatus | filtered | |||||||||||||
(also QueueFeeder) | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
ResolverThread | UpdateHostDb | checked_hosts | |||||||||||||
UpdateHostDb | existing_known_host | ||||||||||||||
UpdateHostDb | existing_unknown_host | ||||||||||||||
UpdateHostDb | new_known_host | ||||||||||||||
UpdateHostDb | new_unknown_host | ||||||||||||||
UpdateHostDb | purged_unknown_host | ||||||||||||||
UpdateHostDb | rediscovered_host | ||||||||||||||
UpdateHostDb | Long.toString(datum.numFailures()) + "_times_failed" | ||||||||||||||
SitemapProcessor | Sitemap | existing_sitemap_entries | |||||||||||||
Sitemap | failed_fetches | ||||||||||||||
Sitemap | filtered_records | ||||||||||||||
Sitemap | filtered_sitemaps_from_hostname | ||||||||||||||
Sitemap | new_sitemap_entries | ||||||||||||||
Sitemap | sitemaps_from_hostname | ||||||||||||||
Sitemap | sitemap_seeds | ||||||||||||||
UpdateHostDbMapper | UpdateHostDb | filtered_records | |||||||||||||
UpdateHostDbReducer | UpdateHostDb | total_hosts | |||||||||||||
(also UpdateHostDbReducer) | UpdateHostDb | skipped_not_eligible | |||||||||||||
WebGraph | WebGraph.outlinks | added links | |||||||||||||
(also WebGraph) | WebGraph.outlinks | removed links | |||||||||||||
WARCExporter | WARCExporter | exception | |||||||||||||
WARCExporter | invalid URI | ||||||||||||||
WARCExporter | missing content | ||||||||||||||
WARCExporter | missing metadata | ||||||||||||||
WARCExporter | omitted empty response | ||||||||||||||
WARCExporter | records generated |
...