Definition

Goals

  • independent metadata, not dependent on Maven
  • avoid Maven metadata limitations (both of POM and maven-metadata.xml)
  • extensibility
  • multiple representations, space efficient
  • used information caching that avoids need for database for non-query requests, and is the cached state for artifacts that other systems (Lucene, database, etc) are based on
  • ability to store metadata separate to original repository as well as within it, and one metadata repo can represent multiple source repos, possibly on a different server with updates pushed to another server over JMS/REST

Attach arbitrary metadata to artifacts

  • other repository information, including some current repository configuration elements from archiva.xml
  • Maven metadata as it does now (POM, remains as an artifact and metadata)
  • OSGi information extracted from the JAR (eg package import, export)
  • Ivy metadata so we could bridge those repos and vice-versa
  • references continuous build results
  • references to historical coverage, test, PMD, etc results
  • allowing users to add their own metadata types and attach them
  • Maven archetype information
  • Maven plugin information
  • indicies (Archiva format, Nexus format)
  • resolved dependencies, not just declared dependencies
  • DOAP (http://trac.usefulinc.com/doap)
  • signatures can be included
  • RAT information
  • License information
  • APT/RPM information
  • Gump (http://gump.apache.org/metadata/index.html)
  • Eclipse p2 Installable Units (http://wiki.eclipse.org/Installable_Units)
  • Buckminster CSPEC (http://wiki.eclipse.org/Buckminster_Component_Specification)

Proposal

Research

Mercury

Have looked at this for reading and writing, but metadata seems still wound up in the Maven way of things and is not immediately obvious how an alternate implementation could be made without significant refactoring. There are open questions as to whether this can be used for the repository layer in its current form as well. Perhaps in the future - best investigation point for now is the use of the Jetty HTTP client for outgoing requests.

Basic artifact model is in place but doesn't contain a way to annotate with additional metadata.

Atom, AtomPub

Since everything is resource based AtomPub is a suitable format for publishing information about resources and exposing alternate representations as a REST-based API.

Atom can suitably describe the metadata through the use of extensions however it is only of marginal benefit. It would make a suitable external representation, and can envelope the existing metadata format described here while expressing the common elements.

Both appear like good enhancements to Archiva, but not required as part of the foundation.

Kepler

Adequately describes the goals of the metadata, particularly through facets (as an original inspiration), however the current representation may be too verbose. Will be used to describe initial model and similar considerations taken into account when storing the data.

Examples:

Design

Definitions

Artifact (Resource)

An artifact is a single managed resource within the repository. It is always a single file and can be treated independently as far as Archiva is concerned. It may express relationships to other artifacts.

In the case of Maven, a POM is an artifact as much as the JAR that it corresponds to, however by processing the the POM, Archiva will populate information in the JAR's metadata. As much as possible Archiva will attempt to store this common information efficiently.

Project

A project is a collection of project versions, and a set of associated metadata universal to all builds of the project. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).

Repository

A repository is a collection of 0 or more projects and a set of metadata about the repository itself. This metadata may just represent the latest state and can change over time (and it may be revisioned accordingly).

The repository is a canonical representation of all artifacts contained within it, however it may not represent a physical storage unit. A physical repository may at any time only contain a portion of the logical contents of the repository.

Identifiers

Each artifact must be uniquely identified by the following components:

  • the unique identifier of the artifact within a project version
  • the unique identifier of the project version within a project
  • the unique identifier of the project within a repository
  • the unique identifier of the project namespace within a repository
  • the unique identifier of the repository

How each identifier is determined is up to the implementation of the repository.

For Maven 2, it is as follows:

  • namespace = groupId
  • project = artifactId
  • project version = version
  • artifact = filename

Note: identifiers do not need to be able to be reverse engineered into their source components since that metadata is also stored (eg, groupId and artifactId).

Repository identifiers should be the base of an URI. This is not required to be, but is recommended to be, the base URI of the canonical location of the repository.

It is the responsibility of the creator of the repository to ensure the location is sufficiently unique. Any security measures based on the repository must take into account that the URI may be arbitrary.

Each artifact also may have a UUID generated, however this is only recommended if the canonical store of the repository is being operated on so that UUIDs do not change over time or differ between identical repositories. For repositories that will coordinate changes over multiple locations it is recommended that a master be identified to generate UUIDs for published artifacts that are kept permanently.

Each repository should have a scheme that can use either the identifier or required metadata (eg, the Maven identifier) to determine a URI for the artifact.

Therefore, an artifact has two potential unique references:

  • <RepositoryURI>/<unique-path-to-artifact-in-repository-scheme>
  • urn:uuid:<uuid>

These are not expected to be needed in initial iterations.

Common model

Metadata will be described through a common object model, derived from the elements of the Maven POM and metadata as well as common elements of other systems, with a bare minimum provided for identification and the rest left to additional models provided through an extension mechanism.

The model is versioned, but is expected to be forwards and backwards compatible. Any unrecognised elements should be ignored, and deprecated elements should be retained though they may be migrated to a new internal representation.

Most of the model is provided by facets that are individually versioned, and should similarly be forwards and backwards compatible. Should a compatibility change need to be made, a facet should replaced by an entirely new version.

Representations on disk and over the wire will be documented but not intrinsic to the design of the model.

The metadata should be able to be easily translated into other formats such as POM, DOAP, Ivy from a single internal representation (including extensions).

The model will be represented as a simple, persisted Java model to begin with. maven-shared-model should be considered as a candidate for future use cases such as model conversion and any needs for inheritence and merging.

Basic Information

Facet {
    created : Date
    updated : Date
}

Repository {
    uri : String
    name : String // unversioned
    facets : RepositoryFacet[0..*]
    // projects are not stored in the metadata
}

abstract RepositoryFacet : Facet {
}

Project {
    id : String
    created : Date
    updated : Date // if omitted, use created. Not all elements can be updated
    name : String
    description : String
    facets : ProjectFacet[0..*]
    builds : ProjectBuild[0..*]
}

abstract ProjectFacet : Facet {
}

Organization : ProjectFacet {
    name : String
    websiteUrl : String
}

ArtifactCollection {
    artifacts : Artifact[1..*]
}

ProjectBuild : ArtifactCollection {
    id : String
    created : Date
    updated : Date
    label : String // public label for the build, if omitted it matches id
    facets : ProjectFacet[0..*]
    relationships : Relationship[0..*]
}

Artifact {
    id : String
    created : Date
    updated : Date // should match last file modification timestamp
    sha1 : String
    uuid : String // optional
    facets : ArtifactFacet[0..*]
}

abstract ArtifactFacet : Facet {
}

Relationship {
    created : Date
    updated : Date
    optional : boolean
    facets : RelationshipFacet[0..*]
}

ArtifactRelationship : Relationship {
    projectId : String
    releaseId : String
    artifactId : String
}

BuildRangeRelationship : Relationship {
    projectId : String
    releaseRange : String
}

UUIDArtifactRelationship : Relationship {
    uuid : String
}

abstract RelationshipFacet : Facet {
}

Maven

MavenIdentifier : ProjectFacet {
    groupId : String
    artifactId : String
}

MavenResolvedDependencyTree : ProjectFacet {
    href : String // {artifact.id}-tree.xml
}

Licensing

LicensingFacet : ProjectFacet {
    licenses : License[1..*] // allowed to choose any one of the following to 
                             // use
}

License {
    name : String
    url : String
}

Collaboration Information

MailingListsFacet : ProjectFacet {
    mailingLists : MailingList[1..*]
}

MailingList {
    name : String
    unsubscribeEmailAddress : String
    subscribeEmailAddress : String
    postEmailAddress : String
    archiveUrls : String[0..*]
}

ParticipantsFacet : ProjectFacet {
    developers : Developer[0..*]
    contributors : Contributor[0..*]
}

Contributor {
    name : String
    emailAddress : String
    timezone : String
}

Developer : Contributor {
    id : String // unique within project namespace, may have various 
                // applications such as unix ID/subversion ID
}

Source Information

SourceStructureFacet : ProjectFacet {
    sourceDirectory : String
    testSourceDirectory : String
}

Maven Plugin Information

MavenPluginFacet : ProjectFacet {
    prefix : String
    goals : String[0..*]
}

Maven Archetype Information

MavenArchetypeFacet : ProjectFacet {
    ...
}

Repository Indexes

ArchivaRepositoryIndexFacet : RepositoryFacet {
    path : String
    lastIndexUpdate : Date
}

NexusRepositoryIndexFacet : RepositoryFacet {
    path : String
}

OSGi

OSGiMetadataFacet : ProjectFacet {
    importedPackages : String[0..*]
    exportedPackages : String[0..*]
    ...
}

Ivy

IvyRelationship : Relationship {
    ...
}

Relocation

RelocationRelationship : Relationship {
    previousProjectId : String
}

Signatures

PGPSignatureFacet : ArtifactFacet {
    username : String
    href : String // default to {artifact.id}.asc
}

Build results

BuildResultFacet : ProjectFacet {
    duration : long
    outputHref : String
    // OS details, etc
}
PMDResultFacet : ProjectFacet {
    numErrors : long
    resultHref : String
}

Implementation

The toolkit for manipulating the metadata model should be independent of Archiva for use in other applications.

The API should be designed as a virtual repository interface to isolate all manipulation of the metadata directory, and the corresponding repository manipulation.

Storage in Archiva

All metadata is to be stored outside of the physical repository storage for repositories where that exists (eg maven2). At present, all repositories types will be considered to have one canonical repository and one derived metadata directory that is updated and repaired based on inconsistencies with the original storage.

In future, it may be possible to have additional repository types where the metadata repository is the canonical storage and the artifact storage is kept separate (perhaps on a different server entirely), with the possibility of cleaning up the storage based on the metadata repository state.

This will not affect Archiva's ability to define a virtual repository layout that can be used for requesting resources despite the backend storage.

The Archiva metadata directory is not required to be able to be operated on by concurrent processes so minimal locking and effecient caching can be employed.

Metadata Repository Format

The format for the metadata will be independent of the type of repository it represents.

The metadata directory will appear as follows:

  • metadata.xml - repository metadata
  • projectId/metadata.xml - project metadata
  • projectId/buildId/metadata.xml - build/artifact metadata

Each may be merged with the parent directory for a complete metadata picture for any given artifact. The format will be identical in each directory to make merging easier.

Note that the metadata version is used across versions of the application, and is not the same as the version of Archiva.

Extensions may choose to store their model external to the metadata and reference it through an identifier. For example, see the dependency tree example earlier.

Each facet of the metadata will be timestamped inside the file so as not to rely entirely on the filesystem timestamp (though it may be used for additional efficiency in some use cases).

This will allow more efficient partial updates, for example:

  • adding/updating/removing metadata for plugins that were not previously present when an artifact was processed, and using a phased approach to initial population
  • publishing changesets for the repository with a minimal amount of information for more efficiency

Repository Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:nexus="http://www.eclipse.org/metadata/repository/facets/nexus">
  <uri>http://repo1.maven.org/maven2/</uri>
  <name>Maven Central Repository</name>
  <facet xsi:type="nexus:nexusIndex" created="2007-10-01T14:28:10.000+08:00">
    <nexus:path>.index</nexus:path>
  </facet>
</repository>

Project Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers">
  <project created="2007-10-01T14:28:10.000+08:00">
    <id>org.apache.commons.commons-io</id>
    <name>Commons IO</name>
    <description>Commons-IO contains utility classes, stream implementations, file filters, and endian classes.</description>
  </project>
</repository>

Build Example

<repository xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://archiva.apache.org/metadata/repository/1.0"
  xmlns:mavenidentifiers="http://www.eclipse.org/metadata/repository/project/mavenidentifiers">
  <project>
    <id>org.apache.commons.commons-io</id>
    <builds>
      <build>
        <id>1.3.1</id>
        <facet xsi:type="mavenidentifiers:mavenIdentifiers" created="2007-10-01T14:28:10.000+08:00">
          <mavenidentifiers:groupId>commons-io</mavenidentifiers:groupId>
          <mavenidentifiers:artifactId>commons-io</mavenidentifiers:artifactId>
        </facet>
        <relationships>
          <relationship xsi:type="artifactRelationship">
            <optional>false</optional>
            <projectId>junit.junit</projectId>
            <releaseId>3.8.1</releaseId>
            <artifactId>junit-3.8.1.jar</artifactId>
          </relationship>
        </relationships>
        <artifacts>
          <artifact updated="2007-10-01T14:28:10.000+08:00">
            <id>commons-io-1.3.1.jar</id>
            <sha1>2e55c05d3386889af97caae4517ac9df</sha1>
          </artifact>
          <artifact updated="2007-10-01T14:28:10.000+08:00">
            <id>commons-io-1.3.1.pom</id>
            <sha1>e3a7d29f7784a5b151cc40fe8a7270a9</sha1>
          </artifact>
        </artifacts>
      </build>
    </builds>
  </project>
</repository>

Future enhancements

Elements that change may be tracked via a history within a particular metadata file for versioning that does not match the version of artifacts.

Additional Security Considerations

Existing security infrastructure is in place (eg, PGP signatures of individual artifacts).

It will be possible to sign the metadata itself to build trust over the content of the repository rather than individual artifacts. This is left for a separate, later proposal.

Discussions

  • No labels

1 Comment

  1. Here's some information from back in the KEPLER days.

    http://joakim.erdfelt.com/kepler/common_project_model_freemind.html

    and

    http://joakim.erdfelt.com/kepler/common_models_matrix.html

    Its likely a bit out of date now, but at least it should allow us to form a more clear picture of the model(s) we are likely to encounter.