Status

Current state: Under discussion

Discussion thread: https://mail-archives.apache.org/mod_mbox/lucene-dev/201912.mbox/%3cC75FB0F5-3397-4175-B4D6-8E31120C795A@gmail.com%3e

JIRAs: Before carrying this forward, please look at the discussions at these JIRAs.

Released: None

Motivation

There are certain situations in which modifying an existing index is necessary and re-indexing from the system-of-record is either undesirable or impossible. This is a proposal to allow certain "safe" operations to be added to Solr, possibly as a package or contrib out of the box.

Public Interfaces

TBD.

Proposed Changes

The process can be summarized as follows:

  • Provide a merge policy that pre-configures “safe” operations.
    • TBD: how to configure? Expand the collection property that is used to specify the custom merge policy?
  • Allow a collection property to be specified that overrides the merge policy used by a particular collection.
  • The merge policy must rewrite all segments, respecting the maximum segment size. As the segments are rewritten, the transformations will be applied.
    • TBD: does it make sense to merge segments to maxSegSizeMB by default?
    • This is really a forceMerge that doesn’t skip segments with no deleted documents.
  • This is not a Lucene-level operation in the sense that it must be “schema aware”. Any transformations that can be performed by inspection should be. For instance:
    • The new schema may have docValues added to 5 fields. They should all be added in one pass.
    • We need to be cautious about what we support. For instance, removing fields not found in the schema has too many ways it could go wrong.
  • Nice to have: The ability to see progress per collection and per cluster

Compatibility, Deprecation, and Migration Plan

  • There should be no compatibility issues here, the index changes must be safe if they're done at all. "Safe" here means identical to if the index was written by the usual indexing process with whatever changes we can perform.
  • This will allow users to perform various maintenance operations without having to re-index from the original content, which is sometimes impossible.
  • Should this idea be adopted, we need to define exactly how to include it. A package? (who maintains?) Part of core Solr?

Test Plan

Test plans should be built from existing test plans for functionality. For instance:

  • We currently have many test plans for insuring docValues work correctly. We could build a test that started with an index without docValues, do the rewrite, and run selected docValues tests.
  • Similarly for raw searching. Build an index in a test case that has a stored but not indexed field. Search on that field, finding zero documents. do the rewrite and search again, this time finding documents.

Design Goals

In order to work well with large collections, the following criteria must be met:

  • OOB, we will only support “safe” transformations.
  • Solr must be able to modify collections individually, even collections that share a configset.
  • Solr must be able to use a custom merge policy. 
    • NOTE: this is especially true for very sophisticated users, use at your own risk!
  • We should be able to do multiple transformations in a single pass. For instance, add docValues to several fields at once.
  • TBD whether/how to support stand-alone.

Supported (safe) operations

Some operations are safe as well as frequently requested. Here are few ideas.

  • Add or remove docValues for indexed fields.
  • Remove data associated with unused fields.
  • Totally remove a field.
  • Transform singleValued <-> multiValued.
    • multiValued->singleValued is possible, but would need rules about what to do if there were more than one value defined for a field.
  • Adding field(s) that could be computed from values available in the index.
  • Indexing data originally added as stored=”true”, indexed=”false”.

Unsupported functionality

This process allows the index to be transformed in any way whatsoever. Any OOB functionality must be safe enough to be confident the index will be functional and correct. Therefore, the project will not support any functionality that is a-priori unsafe. Some examples:

  • “Spoofing” the Lucene version number to allow Lucene to open an index touched by version X-2.
  • Upgrading an index X -> X+1 -> X+2
  • Changing any value in the index computed from f(x) where “x” is not available.
  • Performing this while indexing is active, see "Prior art".
  • Since any alteration in the underlying indexes can be done, if someone wishes to code up a custom unsafe operation they're responsible for the consequences.

Prior art

Andrzej Bialecki and Erick Erickson implemented a version of this for an older Solr instance. We’ve gotten permission to open-source that work so there are a number of lessons learned:

  • Indexing at the same time as these operations are performed is difficult to support; we were unable to figure out the reasons in the time allotted.
  • The use-case we worked on was adding docValues to fields that didn’t have them, which was built on existing work.
  • The problem space quickly explodes and we don’t think Solr should go overboard here. For instance:
    • We could enrich the documents by adding data from outside Solr. While this is possible, trying to build in something that, say, reached out to a DB and pulled in data would be DIH all over again.
    • This is certainly an expert-level process, not to be undertaken lightly.
  • Our work was exclusively for SolrCloud. How to expand to stand-alone is TBD.

Rejected Alternatives

Not sure these qualify as "rejected", more discussion of whether this is a good idea at all.

  • The primitive types default to docValues="true", so the likelihood of supporting that transformation is much less by the time we'd release this. If that's the primary motivation, is this worth the effort?
  • Is it sufficient/reasonable to provide just the infrastructure (overriding merge policies per collection) and the specific transformations are the user's responsibility?
  • Should we provide individual transformations or allow configuration of all "safe" operations in one go? Complexity .vs. efficiency argument here. Perhaps we can allow transformations to chain together so each one does a single operation but all of them can be executed in a single pass?
  • No labels