Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
titleTable Ordering Logic

The table is arranged

  1. by Tool column; alphabetically
  2. by the Metric Group; alphabetically for the given tool
  3. by Metric Name; alphabetically for the given metric group


Tool/ObjectMetric GroupMetric NameDescriptionUsage and Comments
CleaningJobCleaningJobStatusDeleted documentsThe total count of DB_GONE (HTTP 404) and/or DB_DUPLICATE documents ultimately deleted from the indexer(s).This metric is useful for determining whether filtering or duplicate detection needs to happen further upstream prior to indexing. Ideally DB_GONE and DB_DUPLICATE documents would not make it into production indices in the first place.
CrawlDbFilterCrawlDB filterGone records removedThe total count of DB_GONE (HTTP 404) records deleted from the CrawlDB during an update.

See

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-1101
for more details.

CrawlDB filterOrphan records removedThe total count of orphaned pages e.g. a page which have no more other pages linking to it, deleted from the CrawlDB during an update.

See 

Jira
serverASF JIRA
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyNUTCH-1932
for more details.

CrawlDB filterURLs filteredThe total count of filtered pages e.g. pages which didn't pass one or more URLFIlter implementation(s), deleted from the CrawlDB during an update.

This metric is generally useful for determining the overall effectiveness of URLFilter plugins over time.

This metric could be improved if an association could be made between the URL was filtered and the URLFilter which filtered it. This would facilitate aggregating URLFiltering results by URLFilter.

CrawlDbReducerCrawlDB status

CrawlDatum.getStatusName(CrawlDatum().getStatus())



DeduplicationJobDeduplicationJobStatusDocuments marked as duplicate

DomainStatistics

N/AMyCounter.EMPTY_RESULT

N/AMyCounter.FETCHED

N/AMyCounter.NOT_FETCHED


Fetcher
FetcherStatusbytes_downloaded

FetcherStatushitByThrougputThreshold

FetcherStatushitByTimeLimit









FetcherThread
FetcherStatusAboveExceptionThresholdInQueue

FetcherStatusFetchItem.notCreated.redirect

FetcherStatusoutlinks_detected

FetcherStatusoutlinks_following

FetcherStatusProtocolStatus.getName()

FetcherStatusredirect_count_exceeded

FetcherStatusredirect_deduplicated

FetcherStatusrobots_denied

FetcherStatusrobots_denied_maxcrawldelay

ParserStatusParseStatus.majorCodes[p.getData().getStatus().getMajorCode()]








Generator















GeneratorEXPR_REJECTED

GeneratorHOSTS_AFFECTED_PER_HOST_OVERFLOW

GeneratorINTERVAL_REJECTED

GeneratorMALFORMED_URL

GeneratorSCHEDULE_REJECTED

GeneratorSCORE_TOO_LOW

GeneratorSTATUS_REJECTED

GeneratorURLS_SKIPPED_PER_HOST_OVERFLOW

IndexerMapReduce








IndexerStatusdeleted (duplicates)

IndexerStatusdeleted (IndexingFilter)

IndexerStatusdeleted (gone)

IndexerStatusdeleted (redirects)

IndexerStatusdeleted (robots=noindex)

IndexerStatuserrors (IndexingFilter)

IndexerStatuserrors (ScoringFilter)

IndexerStatusindexed (add/update)

IndexerStatusskipped (IndexingFilter)

IndexerStatusskipped (not modified)





Injector



injectorurls_filtered

injectorurls_injected

injectorurls_merged

injectorurls_purged_404

injectorurls_purged_filter

ParseSegmentParserStatusParseStatus.majorCodes[parseStatus.getMajorCode()]

QueueFeederFetcherStatusfiltered

(also QueueFeeder)FetcherStatusAboveExceptionThresholdInQueue

ResolverThread






UpdateHostDbchecked_hosts

UpdateHostDbexisting_known_host

UpdateHostDbexisting_unknown_host

UpdateHostDbnew_known_host

UpdateHostDbnew_unknown_host

UpdateHostDbpurged_unknown_host

UpdateHostDbrediscovered_host

UpdateHostDbLong.toString(datum.numFailures()) + "_times_failed"

SitemapProcessor





Sitemapexisting_sitemap_entries

Sitemapfailed_fetches

Sitemapfiltered_records

Sitemapfiltered_sitemaps_from_hostname

Sitemapnew_sitemap_entries

Sitemapsitemaps_from_hostname

Sitemapsitemap_seeds

UpdateHostDbMapperUpdateHostDbfiltered_records

UpdateHostDbReducerUpdateHostDbtotal_hosts

(also UpdateHostDbReducer)UpdateHostDbskipped_not_eligible

WebGraphWebGraph.outlinksadded links

(also WebGraph)WebGraph.outlinksremoved links

WARCExporter




WARCExporterexception

WARCExporterinvalid URI

WARCExportermissing content

WARCExportermissing metadata

WARCExporteromitted empty response

WARCExporterrecords generated

Conclusion