Definition

Goals

independent metadata, not dependent on Maven
avoid Maven metadata limitations (both of POM and maven-metadata.xml)
extensibility
multiple representations, space efficient
used information caching that avoids need for database for non-query requests, and is the cached state for artifacts that other systems (Lucene, database, etc) are based on
ability to store metadata separate to original repository as well as within it, and one metadata repo can represent multiple source repos, possibly on a different server with updates pushed to another server over JMS/REST

Attach arbitrary metadata to artifacts

other repository information, including some current repository configuration elements from archiva.xml
Maven metadata as it does now (POM, remains as an artifact and metadata)
OSGi information extracted from the JAR (eg package import, export)
Ivy metadata so we could bridge those repos and vice-versa
references continuous build results
references to historical coverage, test, PMD, etc results
allowing users to add their own metadata types and attach them
Maven archetype information
Maven plugin information
indicies (Archiva format, Nexus format)
resolved dependencies, not just declared dependencies
DOAP (http://trac.usefulinc.com/doap)
signatures can be included
RAT information
License information
APT/RPM information
Gump (http://gump.apache.org/metadata/index.html)
Eclipse p2 Installable Units (http://wiki.eclipse.org/Installable_Units)
Buckminster CSPEC (http://wiki.eclipse.org/Buckminster_Component_Specification)

Proposal

Research

Mercury

Have looked at this for reading and writing, but metadata seems still wound up in the Maven way of things and is not immediately obvious how an alternate implementation could be made without significant refactoring. There are open questions as to whether this can be used for the repository layer in its current form as well. Perhaps in the future - best investigation point for now is the use of the Jetty HTTP client for outgoing requests.

Basic artifact model is in place but doesn't contain a way to annotate with additional metadata.

Atom, AtomPub

Since everything is resource based AtomPub is a suitable format for publishing information about resources and exposing alternate representations as a REST-based API.

Atom can suitably describe the metadata through the use of extensions however it is only of marginal benefit. It would make a suitable external representation, and can envelope the existing metadata format described here while expressing the common elements.

Both appear like good enhancements to Archiva, but not required as part of the foundation.

Kepler

Adequately describes the goals of the metadata, particularly through facets (as an original inspiration), however the current representation may be too verbose. Will be used to describe initial model and similar considerations taken into account when storing the data.

Examples:

Design

Definitions

Artifact (Resource)

An artifact is a single managed resource within the repository. It is always a single file and can be treated independently as far as Archiva is concerned. It may express relationships to other artifacts.

In the case of Maven, a POM is an artifact as much as the JAR that it corresponds to, however by processing the the POM, Archiva will populate information in the JAR's metadata. As much as possible Archiva will attempt to store this common information efficiently.

Artifact Collection

An artifact collection is any collection of artifacts as defined above. In general, these will be related in some way, however it is not necessary. A collection will often be represented together in the API that uses the artifacts, but may not be necessarily represented by the physical storage of the artifacts or metadata.

Collections can contain other collections to facilitate other types of aggregation.

For example, a new Product type could be defined that aggregates projects within a Maven multi-module project and has information related to a group of individual projects.

Build

A build is an artifact collection that represents a unit of concurrently built artifacts with a matching version.

A build may represent a release or a "snapshot" - the permanence and history is tracked by other associated metadata. The build number must be unique for a project, but can take any form (1.1.3, 1.2-SNAPSHOT, timestamped snapshot, subversion revision number, incremental build number). Other pieces of version information can be attached as metadata even if not the primary identifier (eg, 1.1.3 is subversion rev XYZ).

If a build already exists, depending the on the policy a history of information may be kept, it may be replaced, or addition to the repository may be rejected. This facilitates accommodating normal release versions, non-unique snapshots and builds over time.

Project

A project is a collection of 0 or more builds, and a set of associated metadata universal to all builds of the project. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).

Repository

A repository is a collection of 0 or more projects and a set of metadata about the repository itself. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).

The repository is a canonical representation of all artifacts contained within it, however it may not represent a physical storage unit. A physical repository may at any time only contain a portion of the logical contents of the repository.

Identifiers

Each artifact must be uniquely identified by the following components:

the unique identifier of the artifact within a build
the unique identifier of the build within a project
the unique identifier of the project within a repository
the unique identifier of the repository

How each identifier is determine is up to the implementation of the repository.

For Maven 2, it is as follows:

project = groupId.artifactId
build = version
artifact = filename

Note: identifiers do not need to be able to be reverse engineered into their source components since that metadata is also stored (eg, groupId and artifactId).

Repository identifiers should be the base of an URI. This is not required to be, but is recommended to be, the base URI of the canonical location of the repository.

e.g. http://repo1.maven.org/maven2/

It is the responsibility of the creator of the repository to ensure the location is sufficiently unique. Any security measures based on the repository must take into account that the URI may be arbitrary.

Each artifact also may have a UUID generated, however this is only recommended if the canonical store of the repository is being operated on so that UUIDs do not change over time or differ between identical repositories. For repositories that will coordinate changes over multiple locations it is recommended that a master be identified to generate UUIDs for published artifacts that are kept permanently.

Each repository should have a scheme that can use either the identifier or required metadata (eg, the Maven identifier) to determine a URI for the artifact.

Therefore, an artifact has two potential unique references:

<RepositoryURI>/<unique-path-to-artifact-in-repository-scheme>
urn:uuid:<uuid>

These are not expected to be needed in initial iterations.

Common model

Metadata will be described through a common object model, derived from the elements of the Maven POM and metadata as well as common elements of other systems, with a bare minimum provided for identification and the rest left to additional models provided through an extension mechanism.

The model is versioned, but is expected to be forwards and backwards compatible. Any unrecognised elements should be ignored, and deprecated elements should be retained though they may be migrated to a new internal representation.

Most of the model is provided by facets that are individually versioned, and should similarly be forwards and backwards compatible. Should a compatibility change need to be made, a facet should replaced by an entirely new version.

Representations on disk and over the wire will be documented but not intrinsic to the design of the model.

The metadata should be able to be easily translated into other formats such as POM, DOAP, Ivy from a single internal representation (including extensions).

The model will be represented as a simple, persisted Java model to begin with. maven-shared-model should be considered as a candidate for future use cases such as model conversion and any needs for inheritence and merging.

Basic Information

Facet {
    created : Date
    updated : Date
}

Repository {
    uri : String
    name : String // unversioned
    facets : RepositoryFacet[0..*]
    // projects are not stored in the metadata
}

abstract RepositoryFacet : Facet {
}

Project {
    id : String
    created : Date
    updated : Date // if omitted, use created. Not all elements can be updated
    name : String
    description : String
    facets : ProjectFacet[0..*]
    builds : ProjectBuild[0..*]
}

abstract ProjectFacet : Facet {
}

Organization : ProjectFacet {
    name : String
    websiteUrl : String
}

ArtifactCollection {
    artifacts : Artifact[1..*]
}

ProjectBuild : ArtifactCollection {
    id : String
    created : Date
    updated : Date
    label : String // public label for the build, if omitted it matches id
    facets : ProjectFacet[0..*]
    relationships : Relationship[0..*]
}

Artifact {
    id : String
    created : Date
    updated : Date // should match last file modification timestamp
    sha1 : String
    uuid : String // optional
    facets : ArtifactFacet[0..*]
}

abstract ArtifactFacet : Facet {
}

Relationship {
    created : Date
    updated : Date
    optional : boolean
    facets : RelationshipFacet[0..*]
}

ArtifactRelationship : Relationship {
    projectId : String
    releaseId : String
    artifactId : String
}

BuildRangeRelationship : Relationship {
    projectId : String
    releaseRange : String
}

UUIDArtifactRelationship : Relationship {
    uuid : String
}

abstract RelationshipFacet : Facet {
}

Maven

MavenIdentifier : ProjectFacet {
    groupId : String
    artifactId : String
}

MavenResolvedDependencyTree : ProjectFacet {
    href : String // {artifact.id}-tree.xml
}

Licensing

LicensingFacet : ProjectFacet {
    licenses : License[1..*] // allowed to choose any one of the following to 
                             // use
}

License {
    name : String
    url : String
}

Collaboration Information

MailingListsFacet : ProjectFacet {
    mailingLists : MailingList[1..*]
}

MailingList {
    name : String
    unsubscribeEmailAddress : String
    subscribeEmailAddress : String
    postEmailAddress : String
    archiveUrls : String[0..*]
}

ParticipantsFacet : ProjectFacet {
    developers : Developer[0..*]
    contributors : Contributor[0..*]
}

Contributor {
    name : String
    emailAddress : String
    timezone : String
}

Developer : Contributor {
    id : String // unique within project namespace, may have various 
                // applications such as unix ID/subversion ID
}

Source Information

SourceStructureFacet : ProjectFacet {
    sourceDirectory : String
    testSourceDirectory : String
}

Maven Plugin Information

MavenPluginFacet : ProjectFacet {
    prefix : String
    goals : String[0..*]
}

Maven Archetype Information

MavenArchetypeFacet : ProjectFacet {
    ...
}

Repository Indexes

ArchivaRepositoryIndexFacet : RepositoryFacet {
    path : String
    lastIndexUpdate : Date
}

NexusRepositoryIndexFacet : RepositoryFacet {
    path : String
}

OSGi

OSGiMetadataFacet : ProjectFacet {
    importedPackages : String[0..*]
    exportedPackages : String[0..*]
    ...
}

Ivy

IvyRelationship : Relationship {
    ...
}

Relocation

RelocationRelationship : Relationship {
    previousProjectId : String
}

Signatures

PGPSignatureFacet : ArtifactFacet {
    username : String
    href : String // default to {artifact.id}.asc
}

Build results

BuildResultFacet : ProjectFacet {
    duration : long
    outputHref : String
    // OS details, etc
}

PMDResultFacet : ProjectFacet {
    numErrors : long
    resultHref : String
}

Implementation

The toolkit for manipulating the metadata model should be independent of Archiva for use in other applications.

The API should be designed as a virtual repository interface to isolate all manipulation of the metadata directory, and the corresponding repository manipulation.

Storage in Archiva

All metadata is to be stored outside of the physical repository storage for repositories where that exists (eg maven2). At present, all repositories types will be considered to have one canonical repository and one derived metadata directory that is updated and repaired based on inconsistencies with the original storage.

In future, it may be possible to have additional repository types where the metadata repository is the canonical storage and the artifact storage is kept separate (perhaps on a different server entirely), with the possibility of cleaning up the storage based on the metadata repository state.

This will not affect Archiva's ability to define a virtual repository layout that can be used for requesting resources despite the backend storage.

The Archiva metadata directory is not required to be able to be operated on by concurrent processes so minimal locking and effecient caching can be employed.

Metadata Repository Format

The format for the metadata will be independent of the type of repository it represents.

The metadata directory will appear as follows:

metadata.xml - repository metadata
projectId/metadata.xml - project metadata
projectId/buildId/metadata.xml - build/artifact metadata

Each may be merged with the parent directory for a complete metadata picture for any given artifact. The format will be identical in each directory to make merging easier.

Note that the metadata version is used across versions of the application, and is not the same as the version of Archiva.

Extensions may choose to store their model external to the metadata and reference it through an identifier. For example, see the dependency tree example earlier.

Each facet of the metadata will be timestamped inside the file so as not to rely entirely on the filesystem timestamp (though it may be used for additional efficiency in some use cases).

This will allow more efficient partial updates, for example:

adding/updating/removing metadata for plugins that were not previously present when an artifact was processed, and using a phased approach to initial population
publishing changesets for the repository with a minimal amount of information for more efficiency

Repository Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:nexus="http://www.eclipse.org/metadata/repository/facets/nexus">
  <uri>http://repo1.maven.org/maven2/</uri>
  <name>Maven Central Repository</name>
  <facet xsi:type="nexus:nexusIndex" created="2007-10-01T14:28:10.000+08:00">
    <nexus:path>.index</nexus:path>
  </facet>
</repository>

Project Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers">
  <project created="2007-10-01T14:28:10.000+08:00">
    <id>org.apache.commons.commons-io</id>
    <name>Commons IO</name>
    <description>Commons-IO contains utility classes, stream implementations, file filters, and endian classes.</description>
  </project>
</repository>

Build Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers">
  <project>
    <id>org.apache.commons.commons-io</id>
    <builds>
      <build>
        <id>1.3.1</id>
        <facet xsi:type="mavenidentifiers:mavenIdentifiers" created="2007-10-01T14:28:10.000+08:00">
          <mavenidentifiers:groupId>commons-io</mavenidentifiers:groupId>
          <mavenidentifiers:artifactId>commons-io</mavenidentifiers:artifactId>
        </facet>
        <relationships>
          <relationship xsi:type="artifactRelationship">
            <optional>false</optional>
            <projectId>junit.junit</projectId>
            <releaseId>3.8.1</releaseId>
            <artifactId>junit-3.8.1.jar</artifactId>
          </relationship>
        </relationships>
        <artifacts>
          <artifact updated="2007-10-01T14:28:10.000+08:00">
            <id>commons-io-1.3.1.jar</id>
            <sha1>2e55c05d3386889af97caae4517ac9df</sha1>
          </artifact>
          <artifact updated="2007-10-01T14:28:10.000+08:00">
            <id>commons-io-1.3.1.pom</id>
            <sha1>e3a7d29f7784a5b151cc40fe8a7270a9</sha1>
          </artifact>
        </artifacts>
      </build>
    </builds>
  </project>
</repository>

Future enhancements

Elements that change may be tracked via a history within a particular metadata file for versioning that does not match the version of artifacts.

Additional Security Considerations

Existing security infrastructure is in place (eg, PGP signatures of individual artifacts).

It will be possible to sign the metadata itself to build trust over the content of the repository rather than individual artifacts. This is left for a separate, later proposal.

Discussions

http://markmail.org/message/dordp67rfu5ozgkb

Child pages

Metadata storage