Definition
Goals
- independent metadata, not dependent on Maven
- avoid Maven metadata limitations (both of POM and maven-metadata.xml)
- extensibility
- multiple representations, space efficient
- used information caching that avoids need for database for non-query requests, and is the cached state for artifacts that other systems (Lucene, database, etc) are based on
- ability to store metadata separate to original repository as well as within it, and one metadata repo can represent multiple source repos, possibly on a different server with updates pushed to another server over JMS/REST
Attach arbitrary metadata to artifacts
- other repository information, including some current repository configuration elements from archiva.xml
- Maven metadata as it does now (POM, remains as an artifact and metadata)
- OSGi information extracted from the JAR (eg package import, export)
- Ivy metadata so we could bridge those repos and vice-versa
- references continuous build results
- references to historical coverage, test, PMD, etc results
- allowing users to add their own metadata types and attach them
- Maven archetype information
- Maven plugin information
- indicies (Archiva format, Nexus format)
- resolved dependencies, not just declared dependencies
- DOAP (http://trac.usefulinc.com/doap)
- signatures can be included
- RAT information
- License information
- APT/RPM information
- Gump (http://gump.apache.org/metadata/index.html)
- Eclipse p2 Installable Units (http://wiki.eclipse.org/Installable_Units)
- Buckminster CSPEC (http://wiki.eclipse.org/Buckminster_Component_Specification)
Proposal
Research
Mercury
Have looked at this for reading and writing, but metadata seems still wound up in the Maven way of things and is not immediately obvious how an alternate implementation could be made without significant refactoring. There are open questions as to whether this can be used for the repository layer in its current form as well. Perhaps in the future - best investigation point for now is the use of the Jetty HTTP client for outgoing requests.
Basic artifact model is in place but doesn't contain a way to annotate with additional metadata.
Atom, AtomPub
Since everything is resource based AtomPub is a suitable format for publishing information about resources and exposing alternate representations as a REST-based API.
Atom can suitably describe the metadata through the use of extensions however it is only of marginal benefit. It would make a suitable external representation, and can envelope the existing metadata format described here while expressing the common elements.
Both appear like good enhancements to Archiva, but not required as part of the foundation.
Kepler
Adequately describes the goals of the metadata, particularly through facets (as an original inspiration), however the current representation may be too verbose. Will be used to describe initial model and similar considerations taken into account when storing the data.
Examples:
- http://wiki.eclipse.org/Kepler_Project
- http://wiki.eclipse.org/Org.eclipse.jdt.junit_in_Kepler
- http://wiki.eclipse.org/Org.eclipse.equinox.common_in_Kepler
Design
Definitions
Artifact (Resource)
An artifact is a single managed resource within the repository. It is always a single file and can be treated independently as far as Archiva is concerned. It may express relationships to other artifacts.
In the case of Maven, a POM is an artifact as much as the JAR that it corresponds to, however by processing the the POM, Archiva will populate information in the JAR's metadata. As much as possible Archiva will attempt to store this common information efficiently.
Artifact Collection
An artifact collection is any collection of artifacts as defined above. In general, these will be related in some way, however it is not necessary. A collection will often be represented together in the API that uses the artifacts, but may not be necessarily represented by the physical storage of the artifacts or metadata.
Collections can contain other collections to facilitate other types of aggregation.
For example, a new Product type could be defined that aggregates projects within a Maven multi-module project and has information related to a group of individual projects.
Build
A build is an artifact collection that represents a unit of concurrently built artifacts with a matching version.
A build may represent a release or a "snapshot" - the permanence and history is tracked by other associated metadata. The build number must be unique for a project, but can take any form (1.1.3, 1.2-SNAPSHOT, timestamped snapshot, subversion revision number, incremental build number). Other pieces of version information can be attached as metadata even if not the primary identifier (eg, 1.1.3 is subversion rev XYZ).
If a build already exists, depending the on the policy a history of information may be kept, it may be replaced, or addition to the repository may be rejected. This facilitates accommodating normal release versions, non-unique snapshots and builds over time.
Project
A project is a collection of 0 or more builds, and a set of associated metadata universal to all builds of the project. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).
Repository
A repository is a collection of 0 or more projects and a set of metadata about the repository itself. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).
The repository is a canonical representation of all artifacts contained within it, however it may not represent a physical storage unit. A physical repository may at any time only contain a portion of the logical contents of the repository.
Identifiers
Each artifact must be uniquely identified by the following components:
- the unique identifier of the artifact within a build
- the unique identifier of the build within a project
- the unique identifier of the project within a repository
- the unique identifier of the repository
How each identifier is determine is up to the implementation of the repository.
For Maven 2, it is as follows:
- project = groupId.artifactId
- build = version
- artifact = filename
Note: identifiers do not need to be able to be reverse engineered into their source components since that metadata is also stored (eg, groupId and artifactId).
Repository identifiers should be the base of an URI. This is not required to be, but is recommended to be, the base URI of the canonical location of the repository.
It is the responsibility of the creator of the repository to ensure the location is sufficiently unique. Any security measures based on the repository must take into account that the URI may be arbitrary.
Each artifact also may have a UUID generated, however this is only recommended if the canonical store of the repository is being operated on so that UUIDs do not change over time or differ between identical repositories. For repositories that will coordinate changes over multiple locations it is recommended that a master be identified to generate UUIDs for published artifacts that are kept permanently.
Each repository should have a scheme that can use either the identifier or required metadata (eg, the Maven identifier) to determine a URI for the artifact.
Therefore, an artifact has two potential unique references:
<RepositoryURI>/<unique-path-to-artifact-in-repository-scheme>
urn:uuid:<uuid>
These are not expected to be needed in initial iterations.
Common model
Metadata will be described through a common object model, derived from the elements of the Maven POM and metadata as well as common elements of other systems, with a bare minimum provided for identification and the rest left to additional models provided through an extension mechanism.
The model is versioned, but is expected to be forwards and backwards compatible. Any unrecognised elements should be ignored, and deprecated elements should be retained though they may be migrated to a new internal representation.
Most of the model is provided by facets that are individually versioned, and should similarly be forwards and backwards compatible. Should a compatibility change need to be made, a facet should replaced by an entirely new version.
Representations on disk and over the wire will be documented but not intrinsic to the design of the model.
The metadata should be able to be easily translated into other formats such as POM, DOAP, Ivy from a single internal representation (including extensions).
The model will be represented as a simple, persisted Java model to begin with. maven-shared-model should be considered as a candidate for future use cases such as model conversion and any needs for inheritence and merging.
Basic Information
Facet { created : Date updated : Date } Repository { uri : String name : String // unversioned facets : RepositoryFacet[0..*] // projects are not stored in the metadata } abstract RepositoryFacet : Facet { } Project { id : String created : Date updated : Date // if omitted, use created. Not all elements can be updated name : String description : String facets : ProjectFacet[0..*] builds : ProjectBuild[0..*] } abstract ProjectFacet : Facet { } Organization : ProjectFacet { name : String websiteUrl : String } ArtifactCollection { artifacts : Artifact[1..*] } ProjectBuild : ArtifactCollection { id : String created : Date updated : Date label : String // public label for the build, if omitted it matches id facets : ProjectFacet[0..*] relationships : Relationship[0..*] } Artifact { id : String created : Date updated : Date // should match last file modification timestamp sha1 : String uuid : String // optional facets : ArtifactFacet[0..*] } abstract ArtifactFacet : Facet { } Relationship { created : Date updated : Date optional : boolean facets : RelationshipFacet[0..*] } ArtifactRelationship : Relationship { projectId : String releaseId : String artifactId : String } BuildRangeRelationship : Relationship { projectId : String releaseRange : String } UUIDArtifactRelationship : Relationship { uuid : String } abstract RelationshipFacet : Facet { }
Maven
MavenIdentifier : ProjectFacet { groupId : String artifactId : String } MavenResolvedDependencyTree : ProjectFacet { href : String // {artifact.id}-tree.xml }
Licensing
LicensingFacet : ProjectFacet { licenses : License[1..*] // allowed to choose any one of the following to // use } License { name : String url : String }
Collaboration Information
MailingListsFacet : ProjectFacet { mailingLists : MailingList[1..*] } MailingList { name : String unsubscribeEmailAddress : String subscribeEmailAddress : String postEmailAddress : String archiveUrls : String[0..*] } ParticipantsFacet : ProjectFacet { developers : Developer[0..*] contributors : Contributor[0..*] } Contributor { name : String emailAddress : String timezone : String } Developer : Contributor { id : String // unique within project namespace, may have various // applications such as unix ID/subversion ID }
Source Information
SourceStructureFacet : ProjectFacet { sourceDirectory : String testSourceDirectory : String }
Maven Plugin Information
MavenPluginFacet : ProjectFacet { prefix : String goals : String[0..*] }
Maven Archetype Information
MavenArchetypeFacet : ProjectFacet { ... }
Repository Indexes
ArchivaRepositoryIndexFacet : RepositoryFacet { path : String lastIndexUpdate : Date } NexusRepositoryIndexFacet : RepositoryFacet { path : String }
OSGi
OSGiMetadataFacet : ProjectFacet { importedPackages : String[0..*] exportedPackages : String[0..*] ... }
Ivy
IvyRelationship : Relationship { ... }
Relocation
RelocationRelationship : Relationship { previousProjectId : String }
Signatures
PGPSignatureFacet : ArtifactFacet { username : String href : String // default to {artifact.id}.asc }
Build results
BuildResultFacet : ProjectFacet { duration : long outputHref : String // OS details, etc }
PMDResultFacet : ProjectFacet { numErrors : long resultHref : String }
Implementation
The toolkit for manipulating the metadata model should be independent of Archiva for use in other applications.
The API should be designed as a virtual repository interface to isolate all manipulation of the metadata directory, and the corresponding repository manipulation.
Storage in Archiva
All metadata is to be stored outside of the physical repository storage for repositories where that exists (eg maven2). At present, all repositories types will be considered to have one canonical repository and one derived metadata directory that is updated and repaired based on inconsistencies with the original storage.
In future, it may be possible to have additional repository types where the metadata repository is the canonical storage and the artifact storage is kept separate (perhaps on a different server entirely), with the possibility of cleaning up the storage based on the metadata repository state.
This will not affect Archiva's ability to define a virtual repository layout that can be used for requesting resources despite the backend storage.
The Archiva metadata directory is not required to be able to be operated on by concurrent processes so minimal locking and effecient caching can be employed.
Metadata Repository Format
The format for the metadata will be independent of the type of repository it represents.
The metadata directory will appear as follows:
- metadata.xml - repository metadata
- projectId/metadata.xml - project metadata
- projectId/buildId/metadata.xml - build/artifact metadata
Each may be merged with the parent directory for a complete metadata picture for any given artifact. The format will be identical in each directory to make merging easier.
Note that the metadata version is used across versions of the application, and is not the same as the version of Archiva.
Extensions may choose to store their model external to the metadata and reference it through an identifier. For example, see the dependency tree example earlier.
Each facet of the metadata will be timestamped inside the file so as not to rely entirely on the filesystem timestamp (though it may be used for additional efficiency in some use cases).
This will allow more efficient partial updates, for example:
- adding/updating/removing metadata for plugins that were not previously present when an artifact was processed, and using a phased approach to initial population
- publishing changesets for the repository with a minimal amount of information for more efficiency
Repository Example
<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://archiva.apache.org/metadata/repository/1.0" xmlns:nexus="http://www.eclipse.org/metadata/repository/facets/nexus"> <uri>http://repo1.maven.org/maven2/</uri> <name>Maven Central Repository</name> <facet xsi:type="nexus:nexusIndex" created="2007-10-01T14:28:10.000+08:00"> <nexus:path>.index</nexus:path> </facet> </repository>
Project Example
<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://archiva.apache.org/metadata/repository/1.0" xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers"> <project created="2007-10-01T14:28:10.000+08:00"> <id>org.apache.commons.commons-io</id> <name>Commons IO</name> <description>Commons-IO contains utility classes, stream implementations, file filters, and endian classes.</description> </project> </repository>
Build Example
<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://archiva.apache.org/metadata/repository/1.0" xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers"> <project> <id>org.apache.commons.commons-io</id> <builds> <build> <id>1.3.1</id> <facet xsi:type="mavenidentifiers:mavenIdentifiers" created="2007-10-01T14:28:10.000+08:00"> <mavenidentifiers:groupId>commons-io</mavenidentifiers:groupId> <mavenidentifiers:artifactId>commons-io</mavenidentifiers:artifactId> </facet> <relationships> <relationship xsi:type="artifactRelationship"> <optional>false</optional> <projectId>junit.junit</projectId> <releaseId>3.8.1</releaseId> <artifactId>junit-3.8.1.jar</artifactId> </relationship> </relationships> <artifacts> <artifact updated="2007-10-01T14:28:10.000+08:00"> <id>commons-io-1.3.1.jar</id> <sha1>2e55c05d3386889af97caae4517ac9df</sha1> </artifact> <artifact updated="2007-10-01T14:28:10.000+08:00"> <id>commons-io-1.3.1.pom</id> <sha1>e3a7d29f7784a5b151cc40fe8a7270a9</sha1> </artifact> </artifacts> </build> </builds> </project> </repository>
Future enhancements
Elements that change may be tracked via a history within a particular metadata file for versioning that does not match the version of artifacts.
Additional Security Considerations
Existing security infrastructure is in place (eg, PGP signatures of individual artifacts).
It will be possible to sign the metadata itself to build trust over the content of the repository rather than individual artifacts. This is left for a separate, later proposal.