Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tool/ObjectMetric GroupMetric NameDescriptionUsage and Comments
CleaningJobCleaningJobStatusDeleted documentsThe total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately cleaned (deleted) from the indexer(s).This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.
CrawlDbFilterCrawlDB filterGone records removedThe total count of DB_GONE (HTTP 404) records purged during a CrawlDb update.

See

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-1101
for more details.

CrawlDB filterOrphan records removed

CrawlDB filterURLs filtered

CrawlDbReducerCrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())



DeduplicationJobDeduplicationJobStatusDocuments marked as duplicate

DomainStatistics

N/AMyCounter.EMPTY_RESULT

N/AMyCounter.FETCHED

N/AMyCounter.NOT_FETCHED


Fetcher
FetcherStatusbytes_downloaded

FetcherStatushitByThrougputThreshold

FetcherStatushitByTimeLimit









FetcherThread
FetcherStatusAboveExceptionThresholdInQueue

FetcherStatusFetchItem.notCreated.redirect

FetcherStatusoutlinks_detected

FetcherStatusoutlinks_following

FetcherStatusProtocolStatus.getName()

FetcherStatusredirect_count_exceeded

FetcherStatusredirect_deduplicated

FetcherStatusrobots_denied

FetcherStatusrobots_denied_maxcrawldelay

ParserStatusParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]








Generator















GeneratorEXPR_REJECTED

GeneratorHOSTS_AFFECTED_PER_HOST_OVERFLOW

GeneratorINTERVAL_REJECTED

GeneratorMALFORMED_URL

GeneratorSCHEDULE_REJECTED

GeneratorSCORE_TOO_LOW

GeneratorSTATUS_REJECTED

GeneratorURLS_SKIPPED_PER_HOST_OVERFLOW

IndexerMapReduce








IndexerStatusdeleted (duplicates)

IndexerStatusdeleted (IndexingFilter)

IndexerStatusdeleted (gone)

IndexerStatusdeleted (redirects)

IndexerStatusdeleted (robots=noindex)

IndexerStatuserrors (IndexingFilter)

IndexerStatuserrors (ScoringFilter)

IndexerStatusindexed (add/update)

IndexerStatusskipped (IndexingFilter)

IndexerStatusskipped (not modified)





Injector



injectorurls_filtered

injectorurls_injected

injectorurls_merged

injectorurls_purged_404

injectorurls_purged_filter

ParseSegmentParserStatusParseStatus.majorCodes[parseStatus.getMajorCode()]

QueueFeederFetcherStatusfiltered

(also QueueFeeder)FetcherStatusAboveExceptionThresholdInQueue

ResolverThread






UpdateHostDbchecked_hosts

UpdateHostDbexisting_known_host

UpdateHostDbexisting_unknown_host

UpdateHostDbnew_known_host

UpdateHostDbnew_unknown_host

UpdateHostDbpurged_unknown_host

UpdateHostDbrediscovered_host

UpdateHostDbLong.toString(datum.numFailures()) + "_times_failed"

SitemapProcessor





Sitemapexisting_sitemap_entries

Sitemapfailed_fetches

Sitemapfiltered_records

Sitemapfiltered_sitemaps_from_hostname

Sitemapnew_sitemap_entries

Sitemapsitemaps_from_hostname

Sitemapsitemap_seeds

UpdateHostDbMapperUpdateHostDbfiltered_records

UpdateHostDbReducerUpdateHostDbtotal_hosts

(also UpdateHostDbReducer)UpdateHostDbskipped_not_eligible

WebGraphWebGraph.outlinksadded links

(also WebGraph)WebGraph.outlinksremoved links

WARCExporter




WARCExporterexception

WARCExporterinvalid URI

WARCExportermissing content

WARCExportermissing metadata

WARCExporteromitted empty response

WARCExporterrecords generated

...