Repository Scanning in Archiva

Scanning of a repository is done periodically to ascertain what has changed in the repository.

On the first scan, the entire repository is scanned.
On subsequent scans, only content that is new or changed since the last scan is picked up.
The scan is required to pick up content that arrives into the repository via non-monitored means.
- Content that arrives via a WebDAV PUT is automatically processed.
- Content that arrives via a Proxy Request is automatically processed.

The Scan Lifecycle.

All content falls into 3 categories CONSUMED, IGNORED, UNKNOWN.
- CONSUMED content is content that is managed by Archiva.
- IGNORED content consists of generated content or transient content.
- UKNOWN content is what falls throught the cracks in the above 2 categories. Typically, this means the content doesn't conform to the repository structure, or is generally unknown.

The lifecycle of a scan is as follows.

Perform a SCAN with an inclusion filter of "*/" and an exclusion filter containing those elements predetermined to be IGNORED.
On identification of a file, attempt to resolve it to an Artifact object.
1. If a valid Artifact object is created, flag as CONSUMED, store it in the Database.
2. If not able to convert to an Artifact object, flag as UNKNOWN, create report entry in ARTIFACT_HEALTH database table.

CONSUMED Files

Include Pattern

Type

Consumed By

No Format

nopanel	true

**/*.pom

MavenProject

Convert to Project Model.
Save Model to Database.
Auto Convert embedded <repositories>
Auto Convert embedded <pluginRepositories>
Lucene XML contents.
Lucene Effective POM contents.

No Format

nopanel	true

**/*.jar

Artifact (jar)

Convert to Artifact Model.
Generate Missing Hashcodes.
Compute JDK Revision.
Determine Sealed.
Save Model to Database.
Lucene Archive TOC.
Lucene Classnames.
Lucene Public Methods.

No Format

nopanel	true

**/*.ear

Artifact (ear)

(same as jar)

No Format

nopanel	true

**/*.war

Artifact (war)

(same as jar)

No Format

nopanel	true

**/*.car

Artifact (car)

(same as jar)

No Format

nopanel	true

**/*.sar

Artifact (sar)

(same as jar)

No Format

nopanel	true

**/*.mar

Artifact (mar)

(same as jar)

No Format

nopanel	true

**/*.rar

Artifact (rar)

(same as jar)

No Format

nopanel	true

**/*.dtd

Artifact (dtd)

Convert to Artifact Model.
Generate Missing Hashcodes.
Save Model to Database.
Lucene DTD contents.

No Format

nopanel	true

**/*.tld

Artifact (dtd)

Convert to Artifact Model.
Generate Missing Hashcodes.
Save Model to Database.
Lucene TLD contents.

No Format

nopanel	true

**/*.tar.gz

Artifact (distribution)

Convert to Artifact Model.
Generate Missing Hashcodes.
Save Model to Database.
Lucene Archiva TOC.

No Format

nopanel	true

**/*.tar.bz2

Artifact (distribution)

(same as *.tar.gz)

No Format

nopanel	true

**/*.zip

Artifact (distribution)

(same as *.tar.gz)

No Format

nopanel	true

**/*.sha1

Hashcode

Report on Saved Hashcode to Actual Hashcode.

No Format

nopanel	true

**/*.md5

Hashcode

Report on Saved Hashcode to Actual Hashcode.

No Format

nopanel	true

**/*.asc

Signature

Report on signature validation.

No Format

nopanel	true

**/maven-metadata.xml

Repository Metadata

Convert to Repository Model
Cross Validate listed versions to available versions in repository.
Save Model to Database.
Lucene XML contents.

No Format

nopanel	true

**/*\-site.xml

Site Metadata

Lucene file contents.

No Format

nopanel	true

**/*.xml

Xml Content

Lucene file contents.

No Format

nopanel	true

**/*.html

Html Content

Lucene file contents.

No Format

nopanel	true

**/*.block

Auto-Xml/Text Content

Lucene file contents.

No Format

nopanel	true

**/*.config

Auto-Xml/Text Content

Lucene file contents.

No Format

nopanel	true

**/*.xsd

Xml Content

Lucene file contents.

No Format

nopanel	true

**/*.txt

Text Content

Lucene file contents.

No Format

nopanel	true

**/*.TXT

Text Content

Lucene file contents.

No Format

nopanel	true

**/*.bar

Binary Content

- no direct consumption -

No Format

nopanel	true

**/*.nbm

Binary Content

- no direct consumption -

IGNORED Content

Content in this category is never indexed, nor reported as bad or unknown. It exists on disk solely for the benefit of the client using Archiva.

Pattern

Reason

No Format

nopanel	true

**/.htaccess

Web server specific content control mechanism.

No Format

nopanel	true

**/KEYS

GPG Signatures File. Not used by Archiva directly.

No Format

nopanel	true

**/*.rb

Ruby script file.

No Format

nopanel	true

**/*.sh

Shell screipt file.

No Format

nopanel	true

**/.svn/**/*

Subversion Control Directory.

No Format

nopanel	true

**/.DAV/**/*

DAV Server Control Directory.

UNKNOWN / BAD Content

Content that does not fit into the above categories are automatically placed into this category.
However, some UNKNWON / BAD Content is well understood, and can have a 'Quick Fix' associated with it.

Pattern

Type

Quick Fix Option

No Format

nopanel	true

**/*.bak

Backup File

Remove from repository

No Format

nopanel	true

**/*~

Backup File

Remove from repository

No Format

nopanel	true

**/*-

Backup File

Remove from repository

No Format

nopanel	true

**/*.distribution-tgz

Distribution Artifact from M1

Rename to *.tar.gz

No Format

nopanel	true

**/*.distributino-zip

Distribution Artifact from M1

Rename to *.zip

No Format

nopanel	true

**/*.plugin

Plugin from M1

Rename to *.jar

Child pages

Versions Compared

Old Version 2

New Version Current

Key

Repository Scanning in Archiva

The Scan Lifecycle.

CONSUMED Files

IGNORED Content

UNKNOWN / BAD Content

Child pages

Page History

Versions Compared

Old Version 2

New Version Current

Key

Repository Scanning in Archiva

The Scan Lifecycle.

CONSUMED Files

IGNORED Content

UNKNOWN / BAD Content