General description of current (2.x) resolution

(this is a super-simplified description that focuses on background relevant to the specific problem being analyzed)

Today Maven roughly follows the current recursive process to resolve dependencies:
Read and interpolate the current pom (or current dependency). Then pull out the list of dependencies and repositories from the pom. The repositories found are added to the existing list which comes from the superpom (central) and any in the settings.xml and any pulled from previous poms during this maven execution. These repositories are then used to resolve the next level of dependencies. Repeat.

If the artifact being resolved is a plugin, then only "pluginRepositories" are considered for resolution. Repositories and Plugin Repositories are filtered based on the declared snapshot/release flags and if the current artifact being located is a snapshot or release.

Benefits of the current approach

  1. If dependencies of your build aren't contained in central, you can add a repository entry to your pom and Maven will find them. The practical benefit of this is two-fold
    1. that someone with a normal maven install can checkout and build your code without knowing where your snapshots are, or where any of your dependencies might be located ahead of time. It just works.
    2. that someone _depending_ on your artifacts doesn't have to know where your downstream dependencies are. (Note: this may be a bad thing in some cases because of the repository pollution and non-deterministic resolution based on repos declared elsewhere)

Problems with current approach

  1. Having repositories in the poms is usually bad. Released poms are immutable, but urls change over time. This eventually leads to poms that point to incorrect urls and dependencies can't be located automatically.
  2. The repositories introduced by dependencies and transitive dependencies "pollute" the build. That is because once they get added to the list of repos, it stays their for the remainder of the build. Maven doesn't know which dependencies the pom author intended to introduce via the repository declaration. A very concrete problem is that there is no guarantee that a given artifact present in two different repositories are not the same. People take Apache artifacts and modify them so that if a different repository was used in a subsequent calculation you might get something different. Even with OSGi where a bundle can theoretically be sourced from any repository the behavior of that bundle is not guaranteed to be the same. So people want to know the set of artifacts they retrieved from a given set of repositories is more or less immutable. So here again you need a central authority to provide signatures if you want to absolutely guarantee this characteristic. We know from practice people change things all the time and it has dire effects on users.
  3. The current implementation of pluginRepositories is troublesome and doesn't do what it intended. Currently the dependencies of a plugin are resolved via the regular repositories (since they themselves aren't a plugin). This has the practical effect that for every pluginRepo you introduce you must also introduce it as a repo. The valid use case that this attempted to solve is being able to separate the dependencies needed by the build from those needed for the build. You will often want different policies on things that are used by Maven plugins than you would for things allowed into your build. (think GPL artifacts)
  4. The nature of walking the tree and discovering repositories as you go makes it difficult to do a bounded SAT range calculation. Each decision may introduce yet more repositories that may have affected the previous calculations. It's not that this is not solvable it just takes too much time to be practical by a system like Maven. If you had to wait the length of time it takes P2 to figure itself out that would be unacceptable in standard Maven CLI use.

Requirements of a final 3.x solution

  1. maintain the ability for a user to checkout your code and run mvn install and have it work with no prior setup on their part.
  2. be able to depend on some jar and not worry about any repositories required for transitive resolution (ie discover the repositories transitively as dependencies are processed) (this is controversial and may be eliminated. First it contributes to the Problem #4 above in that SAT can't be done on a bounded list of repositories. It also doesn't work normally behind a repository manager because the list of repos is usually controlled in the repo manager and thus autodiscovery is intentionally blocked, usually via a mirrorOf * to circumvent the repos maven finds in the poms.)
  3. be able to separate the dependencies needed by maven plugins from those needed by the build. This means not only where they are resolved from, but also how they are stored locally to prevent cross-contamination.
  4. Repository identification: at this point we are pretty much in agreement that the URL should be the unique identifier for a repository. People who care about what they are publishing either need to use canonical repositories like Maven central or need to guarantee the existence of the repositories or have decent pointers. In a fully distributed system the relocation mechanism we have does not work in a fully distributed system without a master to manage relocations.

Proposed Solution Details

To be determined and populated AFTER the benefits / drawbacks and specifically requirements are gathered.

  • No labels

7 Comments

  1. Another related issue is the identication of repositories. There is no standard for ids and maven doesn't compare urls, which leads to have the same repository several times in the list but with different ids.

  2. I agree with Arnaud. There is no URL to Id mapping which is what causes everyone to create a new Id. While URLs change slowly over time they change much less slowly than the Ids.

    The mirrorOf element would still be useful in this case. For those repos that have moved you can configure a mirror for the incorrect URL to point to the correct URL.

    Problem #4 of the current approach for bounded SAT calculation: Isn't this theoretically unbound but in practice is bounded? i.e there is only ever going to be a limited number of repositories defined?

    Problem #2 - can you explain why "polluting" the build is bad? What problems are caused when repositories are introduced that the pom author did not intend.
    If artifact versions are defined then it should not matter if the artifact is retrieved from repo A or repo B - as long as it is the same artifact.

  3. Arnaud, Bae, I addressed the unique repository API in 4) for the requirements of a Maven 3.x system. Bae, I addressed answers to your questions in #2 and #4 above. If these are sufficient I will remove the comments. I am going to try and merge comments from the Wiki and mailing list into the proposal so people don't have to walk over 50 comments to see what's been processed or not. Please let me know. I want to keep the noise out of the proposal as much as possible.

  4. Fair enough. Comments may be cleaned up.

    For 4) I'm still not convinced that the bounds are very large. This may be a symptom of the fact that people don't use their own repos that much.
    e.g my work has 12 external repos defined in its repo manager. A couple of them are snapshots.

    Doesn't this just mean for the initial check it will need to go and locate these 12 repos (maybe recursively) and after finding no more repos do the SAT calculation?
    Then on subsequent invocations no more repos are added so there is no penalty.
    With something like a nexus index available on each repo this reduces the burden further?

    Jason, I must be missing something obvious if you are saying it takes too long.

  5. For repo identification, we employed some sort of "metadata" already, and it uses URL, and other properties like "id", "name" are just "proposals" for other tools to accept and use it or let users edit as it suits them (kinda "default value" but edit it if you want). Some examples:

    hosted repo: http://repo2.maven.org/maven2/.meta/repository-metadata.xml http://repository.sonatype.org/content/repositories/releases/.meta/repository-metadata.xml

    proxy repo: http://repository.sonatype.org/content/repositories/central/.meta/repository-metadata.xml

    group repo: http://repository.sonatype.org/content/repositories/public/.meta/repository-metadata.xml

    More docos about this: https://docs.sonatype.com/display/NX/Repository+Metadata+%28proposal%29

    This is a "proposal", with some very basic implementation in nexus SVN. Nexus already uses it, to resolve the "published" (downstream) mirror URLs automatically. There is no magic here (and in implementation), it is mostly the model only that's solely purpose is to make repo "self described". Future planned features are "nearest mirror by geoip" resolution service in transparent way, etc.

  6. Problem #2 is definitely a problem for me. A transitive dependency should not be able to introduce new repositories into the build because this makes the build less predictable. The more repositories that are introduced, the more chance that one of them could be unavailable or renamed. In my opinion, the <repositories> section should be removed from the POM and should be specified only in settings.xml. In order to make builds easier, maybe a settings.xml file could be added alongside the POM. This would keep the distinction clear, but would still allow for a simple svn checkout and build with no additional setup.

    Regarding requirement #4, one issue with using the repository URL as the ID can be seen in MANTTASKS-142. I don't have a problem with using the URL as an ID, as long as something else is used for the names of local repo metadata files.

  7. Unknown User (kenyth)

    *First I'm relatively new to Maven 2 and I used to use ivy in my projects. *

    >In my opinion, the <repositories> section should be removed from the POM and should be specified only in settings.xml.
    >In order to make builds easier, maybe a settings.xml file could be added alongside the POM.

    This is exactly how ivy works.

    And I've a question to ask since I can't find any related documentation except for this page.

    Does "./.meta/repository-metadata.xml" (the .meta directory and the underlying xml file) matter in deciding whether the "repository" a URL pointed to is a real maven repository.  For my weekend project I set up a very simple internal repository based on the file://URL. I've tried to added and removed the ".meta" directory and the artifact resolution works in either way.

    So this is confusing me. If this is the case then what is the ".meta" directory for? FYI, a native ivy repository works without a counterpart of the ".meta" directory.

    Thanks in advance(smile)