This is a brief, but hopefully useful documentation of Archiva. I've just put this up to organize my thoughts while I am doing MRM-409 and it might be of help to other developers too.

Repository Scanning and Indexing

Assumption:
-default-archiva.xml is used for the configuration

Classes

Below are some of the important classes of Repository Scanning:

Class	Implements	What Does it do?
DefaultRepositoryScanner	RepositoryScanner	Makes use of plexus-utils' DirectoryWalker to scan the repository
RepositoryScannerInstance	DirectoryWalkListener	Listener that sets the trigger to start the consumers.
RepositoryContentStatistics	generated by modello	Contains the stats (duration, no. of files discovered, etc.) of the respository scan.
TriggerBeginScanClosure	Closure (commons-collections)	Signals to the consumer(s) that the repository scanning will begin.
DefaultBidirectionalRepositoryLayout	BidirectionalRepositoryLayout	Default bidirectional layout used by m2 repositories.
ArchivaArtifact		Archiva artifact object
ArchivaArtifactModel	generated by modello	Contains the detailed attributes of an archiva artifact sa groupId, artifactId, version, checksums, etc.
FileContentRecord	LuceneRepositoryContentRecord	Contains the contents of the artifact to be indexed.

Repository Content Consumers (KnownRepositoryContentConsumer)

This is configured in archiva.xml, under <repositoryScanning>.

Class	Role Hint	What does it do?
ValidateChecksumConsumer	validate-checksum	Validate checksum files.
LegacyConverterArtifactConsumer	artifact-legacy-to-default-converter	Converts legacy artifacts to m2 artifacts.
ArtifactMissingChecksumConsumer	create-missing-checksums	Creates checksum if it is missing.
AutoRemoveConsumer	auto-remove	Removes files in the repository being scanned if the file type matches any of the configured file types to be removed.
AutoRenameConsumer	auto-rename
ArtifactUpdateDatabaseConsumer	update-db-artifact	Save the artifact (in the form of ArchivaArtifact) to the database.
IndexContentConsumer	index-content	Processes the artifact's content into a FileContentRecord that is used for indexing.
RepositoryPurgeConsumer	repository-purge	Removes old snapshots from the repository either by the number of days old or by the retention count. (See Repository Purge section below)

The Process

User clicks 'Scan Repository Now' in the Repositories page.
Repository scanning is triggered.
Start scanning:
- DefaultRepositoryScanner gathers the consumers (KnownContentConsumers and InvalidContentConsumers) from the config file. RepositoryScannerInstance is added as a DirectoryWalkListener to the plexus-utils DirectoryWalker. Start of scan is fired.
- Every file discovered will be checked if it is in the includes or excludes patterns that is set. If it doesn't exist in both, then it would be excluded. If it is included, then it will be processed by the consumers. Each consumer performs a different action in its processFile(...) method.
Saving the artifact to the database is performed in the ArtifactUpdateDatabaseConsumer. An ArchivaArtifact, which has an ArchivaArtifactModel attribute, is constructed. The attributes of the ArchivaArtifactModel are gathered from the artifact itself e.g. groupId, artifactId, version came from the artifact's filepath.
Indexing the artifact happens in the IndexContentConsumer, wherein an index record which contains the details of the artifact plus its contents. Please note that in the default-archiva.xml, the bundled files are not included in the indexable-content fileType pattern.¹
Once the repository scanning is finished, the scan statistics (number of files discovered, the consumers used, duration of the scan, the repository scanned, etc.) is listed or displayed in the console.
User performs a search:
- User types the query string and hits the Search button.
- Archiva then searches its indices for the query string and returns the search results.
- The user can click on an artifact to browse it. Actually, what the user browses is the pom. At the back-end, Archiva checks if the project model is already in the database. If it is not, then archiva constructs the ArchivaProjectModel object and saves it to the database.¹ Once it is already in the database, the pom info or artifact is displayed.

¹ This causes the problem of different values when the actual pom file is read. The pom file may be invalid (e.g. it might have different versions as in the case of commons-dbcp-1.0 in MRM-376) and wasn't detected when it was added to the database (MRM-409).

Finding an Artifact

The user browses for an artifact he/she wants to locate in the repositories.
Archiva calculates the checksum for the artifact to be searched.
The database is searched for the matching checksum using the ArtifactsByChecksumConstraint (search all artifacts where the calculated checksum matches either a SHA1 or MD5 checksum of an artifact in the database)

Registry Listeners

A RegistryListener (plexus-registry) is an interface that receives notification for every change in the Registry. There are a handlful of classes in Archiva that implements this and performs some processes every time there's a change in the configuration.

Class	What does it do?
DuplicateArtifactsConsumer	Looks or checks for duplicate artifacts using SHA1 checksum.
LocationArtifactsConsumer	Validates if the location of the artifact in the repository is correct based on the groupId, artifactId and version specified in the pom.
ArtifactMissingChecksumConsumer	Create missing checksum for the artifact.
ArtifactUpdateDatabaseConsumer
AutoRemoveConsumer
IndexArtifactConsumer	Index the artifact checksums for 'Find Artifact' functionality. It stores the data as hashcodes in the index (HashcodesRecord).
IndexContentConsumer
ProjectModelToDatabaseConsumer	Update database with project model info.
ActiveManagedRepositories	Provides a real-time listing of the active managed repositories within Archiva.
ConfigurationSynchronization	Synchronizes the repositories in the configuration file with the database.
DefaultArchivaConfiguration	Configuration holder that retrieves the configuration from the registry.
DefaultCrossRepositorySearch	Search across repositories in Lucene indices. It gets or filters which are the managed and indexed repositories.
DefaultRepositoryProxyConnectors	Handlers for potential repository proxy connectors.
DefaultArchivaTaskScheduler	Default scheduling component for Archiva.
BidirectionalRepositoryLayoutFactory	Creates a BidirectionalRepositoryLayout.
RepositoryProjectModelResolverFactory	Creates ProjectModelResolver objects.
RepositoryServlet

Database Scanning

The database is scanned and specific consumers process these artifacts.

Database Consumers

There are 2 types of database consumers:

Unprocessed consumers - consumers for those artifacts already in the index that haven't been processed yet, meaning the details about the artifact are not yet processed and stored in the database
Cleanup consumers - consumers for cleaning up the database

These consumers are configured in archiva.xml, under <databaseScanning>. Below are the different types of Database Consumers:

Class	Role Hint	Type	What does it do?
ProjectModelToDatabaseConsumer	updated-db-project	unprocessed consumer	Gets the details of the artifact from the pom and saves it into the database (as a project model)
DatabaseCleanupArtifactConsumer	not-present-remove-db-artifact	cleanup consumer	Cleans the database of artifacts that are no longer in the repository
DatabaseCleanupProjectConsumer	not-present-remove-db-project	cleanup consumer	Cleans the database of project models of artifacts that are no longer in the repository
DatabaseCleanupLuceneConsumer	not-present-remove-indexed	cleanup consumer	Cleans up the index of artifacts that are no longer in the repository

Repository Purge

Remove old snapshots from the managed repository based on a criteria: By Number of Days Old and By Retention Count. There is also the option to enable or disable the cleanup of released snapshots from the repository.

Classes

Below are the classes for Repository Purge:

Class	Implements	What Does it do?
RepositoryPurgeConsumer	KnownContentConsumer	Consumer for removing old snapshots from the managed repository
DaysOldRepositoryPurge	RepositoryPurge	Remove old snapshots by the number of days old.
RetentionCountRepositoryPurge	RepositoryPurge	Remove old snapshots but retaining a specific number of it.
CleanupReleasedSnapshotsRepositoryPurge	RepositoryPurge	Remove old snapshots that have already been released.
ArtifactFilenameFilter	FilenameFilter (java.io)	Filter the filenames from the directory listing by checking if it matches a specific filename.

Configuration (for Archiva Users)

To enable repository purge, add "repository-purge" in the <knownContentConsumers> section of the archiva.xml. The RepositoryPurgeConsumer will be executed when repository scanning is started.
The user can choose whether to purge the repository of snapshots older by a specific number of days OR to purge the repository of snapshots but retaining a specific number of that snapshot. This can be configured by specifying specific values in the "Repository Purge By Days Older Than" or "Repository Purge By Retention Count" fields in the Add/Edit Repository page. By default, these has "100" and "2" values respectively. If "Repository Purge By Days Older" is NOT EQUAL TO 0 (zero), then that would be the criteria used for the repository purge. Otherwise, if it is EQUAL TO 0 (zero) then the "Repository Purge By Retention Count" criteria is used instead.
To enable/disable the cleanup of released snapshots in the repository, the user can opt to check or uncheck the "Delete Released Snapshots" option in the Add/Edit Repository page.

The Process

RepositoryPurgeConsumer is executed during repository scanning. Only those "artifact" file types are consumed (<fileType> with "artifact" id in archiva.xml).
The consumer will check the if the deleteReleasedSnapshots field (in RepositoryConfiguration) is enabled. If so, then it will execute CleanupReleasedSnapshotsRepositoryPurge.
- CleanupReleasedSnapshotsRepositoryPurge will remove all released snapshots from the repository. For example: 1.2, 1.3-SNAPSHOT and 1.3 exists for artifactX in the repo. 1.3-SNAPSHOT will be removed since 1.3 already exists (therefore it has already been released). All metadata files are updated based on the remaining versions of the artifact in the repository.
The consumer will also check the value of the daysOlder field in the configuration of the repository being scanned. If it is not set to 0 (zero), then the consumer will execute the DaysOldRepositoryPurge. Otherwise, it would execute the RetentionCountRepositoryPurge.
- DaysOldRepositoryPurge checks when the discovered SNAPSHOT artifact was last modified and if it is older by X (daysOlder value) days then the artifact will be removed from the repository.
- RetentionCountRepositoryPurge on the other hand, checks if the number of "unique versioned" snapshot artifacts in the directory where the discovered artifact resides is LESS THAN the retentionCount value. If the contents are greater than the retention count, then the oldest snapshot artifact (including associated poms, source jars, javadoc jars, etc.) are removed until the total # of unique versioned artifacts is EQUAL TO the retention count. For example, the discovered artifact is ../artifactX/2.0-SNAPSHOT/artifactX-2.0-SNAPSHOT.jar. RetentionCountRepositoryPurge will get a list of the files in ../artifactX/2.0-SNAPSHOT directory. Lets say, ../artifactX/2.0-SNAPSHOT has the ff. contents: artifactX-2.0-1111111-1.jar, artifactX-2.0-1111111-1.pom, artifactX-2.0-1111100-2.jar, artifactX-2.0-1111100-2.pom, artifactX-2.0-SNAPSHOT.jar and artifactX-2.0-SNAPSHOT.pom. If the retention count is 2, then artifactX-2.0-1111111-1.jar and artifactX-2.0-1111111-1.pom are removed from the repo and the 2 newest artifacts (and its associated files, in this case the poms) are retained.
For all these RepositoryPurge implementations, all removed artifacts from the repository are also removed from the database.¹

¹ There is an open issue related to this, please see MRM-455. Aside from this, there is also an open issue regarding the index update after repo purge MRM-454.

Child pages

Repository Scanning and Indexing

Classes

Repository Content Consumers (KnownRepositoryContentConsumer)

The Process

Finding an Artifact

Registry Listeners

Database Scanning

Database Consumers

Repository Purge

Classes

Configuration (for Archiva Users)

The Process

Repository Browse

Repository Search

Reporting

Repository Configuration (Managed and Remote Repository)

Proxy Connectors

Network Proxies

Child pages

Archiva 1.0.x Developers Notes

Repository Scanning and Indexing

Classes

Repository Content Consumers (KnownRepositoryContentConsumer)

The Process

Finding an Artifact

Registry Listeners

Database Scanning

Database Consumers

Repository Purge

Classes

Configuration (for Archiva Users)

The Process

Repository Browse

Repository Search

Reporting

Repository Configuration (Managed and Remote Repository)

Proxy Connectors

Network Proxies