...
Tool/Object | Metric Group | Metric Name | Description | Usage and Comments | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CleaningJob | CleaningJobStatus | Deleted documents | The total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s). | This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place. | |||||||||||
CrawlDbFilter | CrawlDB filter | Gone records removed | The total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update. | See
| |||||||||||
CrawlDB filter | Orphan records removed | The total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update. | See
| ||||||||||||
CrawlDB filter | URLs filtered | The total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update. | This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time. This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter. | ||||||||||||
CrawlDbReducer | CrawlDB status | CrawlDatum.getStatusName(CrawlDatum().getStatus()) | With each URL able to have only one state at any given point in time, this metric facilitates aggregated counts of the different types of CrawlDatum states for a given CrawlDB. | The state of any given URL will change as the URL transitions through a crawl cycle. Available URL states are defined in the CrawlDatum e.g., STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_FETCH_SUCCESS, etc. Practically, CrawlDatum status' are defined using byte signatures but accessed programmatically using static final constants. This metric can be used to identify the presence of undesired URL CrawlDatum status' for given URL's e.g., STATUS_DB_GONE. Such an event could then trigger a cleaning/pruning operation. | |||||||||||
DeduplicationJob | DeduplicationJobStatus | Documents marked as duplicate | |||||||||||||
DomainStatistics | N/A | MyCounter.EMPTY_RESULT | |||||||||||||
N/A | MyCounter.FETCHED | ||||||||||||||
N/A | MyCounter.NOT_FETCHED | ||||||||||||||
Fetcher | FetcherStatus | bytes_downloaded | |||||||||||||
FetcherStatus | hitByThrougputThreshold | ||||||||||||||
FetcherStatus | hitByTimeLimit | ||||||||||||||
FetcherThread | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
FetcherStatus | FetchItem.notCreated.redirect | ||||||||||||||
FetcherStatus | outlinks_detected | ||||||||||||||
FetcherStatus | outlinks_following | ||||||||||||||
FetcherStatus | ProtocolStatus.getName() | ||||||||||||||
FetcherStatus | redirect_count_exceeded | ||||||||||||||
FetcherStatus | redirect_deduplicated | ||||||||||||||
FetcherStatus | robots_denied | ||||||||||||||
FetcherStatus | robots_denied_maxcrawldelay | ||||||||||||||
ParserStatus | ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()] | ||||||||||||||
Generator | Generator | EXPR_REJECTED | |||||||||||||
Generator | HOSTS_AFFECTED_PER_HOST_OVERFLOW | ||||||||||||||
Generator | INTERVAL_REJECTED | ||||||||||||||
Generator | MALFORMED_URL | ||||||||||||||
Generator | SCHEDULE_REJECTED | ||||||||||||||
Generator | SCORE_TOO_LOW | ||||||||||||||
Generator | STATUS_REJECTED | ||||||||||||||
Generator | URLS_SKIPPED_PER_HOST_OVERFLOW | ||||||||||||||
IndexerMapReduce | IndexerStatus | deleted (duplicates) | |||||||||||||
IndexerStatus | deleted (IndexingFilter) | ||||||||||||||
IndexerStatus | deleted (gone) | ||||||||||||||
IndexerStatus | deleted (redirects) | ||||||||||||||
IndexerStatus | deleted (robots=noindex) | ||||||||||||||
IndexerStatus | errors (IndexingFilter) | ||||||||||||||
IndexerStatus | errors (ScoringFilter) | ||||||||||||||
IndexerStatus | indexed (add/update) | ||||||||||||||
IndexerStatus | skipped (IndexingFilter) | ||||||||||||||
IndexerStatus | skipped (not modified) | ||||||||||||||
Injector | injector | urls_filtered | |||||||||||||
injector | urls_injected | ||||||||||||||
injector | urls_merged | ||||||||||||||
injector | urls_purged_404 | ||||||||||||||
injector | urls_purged_filter | ||||||||||||||
ParseSegment | ParserStatus | ParseStatus.majorCodes[parseStatus.getMajorCode()] | |||||||||||||
QueueFeeder | FetcherStatus | filtered | |||||||||||||
(also QueueFeeder) | FetcherStatus | AboveExceptionThresholdInQueue | |||||||||||||
ResolverThread | UpdateHostDb | checked_hosts | |||||||||||||
UpdateHostDb | existing_known_host | ||||||||||||||
UpdateHostDb | existing_unknown_host | ||||||||||||||
UpdateHostDb | new_known_host | ||||||||||||||
UpdateHostDb | new_unknown_host | ||||||||||||||
UpdateHostDb | purged_unknown_host | ||||||||||||||
UpdateHostDb | rediscovered_host | ||||||||||||||
UpdateHostDb | Long.toString(datum.numFailures()) + "_times_failed" | ||||||||||||||
SitemapProcessor | Sitemap | existing_sitemap_entries | |||||||||||||
Sitemap | failed_fetches | ||||||||||||||
Sitemap | filtered_records | ||||||||||||||
Sitemap | filtered_sitemaps_from_hostname | ||||||||||||||
Sitemap | new_sitemap_entries | ||||||||||||||
Sitemap | sitemaps_from_hostname | ||||||||||||||
Sitemap | sitemap_seeds | ||||||||||||||
UpdateHostDbMapper | UpdateHostDb | filtered_records | |||||||||||||
UpdateHostDbReducer | UpdateHostDb | total_hosts | |||||||||||||
(also UpdateHostDbReducer) | UpdateHostDb | skipped_not_eligible | |||||||||||||
WebGraph | WebGraph.outlinks | added links | |||||||||||||
(also WebGraph) | WebGraph.outlinks | removed links | |||||||||||||
WARCExporter | WARCExporter | exception | |||||||||||||
WARCExporter | invalid URI | ||||||||||||||
WARCExporter | missing content | ||||||||||||||
WARCExporter | missing metadata | ||||||||||||||
WARCExporter | omitted empty response | ||||||||||||||
WARCExporter | records generated |
...