Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current state: Under Discussion

Discussion threads:

...

...

...


JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keySOLR-14726
, many others, TBD

Released: TBD (target 9.0)

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

...

Table of Contents

Motivation

As Solr has grown, the examples have become a mix of ancient documents, kitchen-sink additions with complicated - and often confusing - interplay of definitions, left-over configurations conflicting or out of sync with documentation. The "more info" links mostly point into legacy wiki that is two generations of redirects behind the current Reference Guide. Solutions were introduced with a fix-the-pain approach, that have also caused magic paths or pushed demonstration configurations into consolidated defaults. The new features are often not demonstrated as adding new example requires understanding the existing one.

...

  1. Go through the default configuration files line by line.
    1. Ensure that any documentation and explanation not yet in the Reference Guide are moved there. Delete any significant passage and replace them with Ref Guide links to ensure a single-source of truth (
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-11875
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14841
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14834
      )
    2. Delete any default blocks that do not use parameter substitutions and point them to RefGuide for the section and to the API to get the real defaults as appropriate
    3. Delete legacy sections that 'no longer work' (e.g. jmx, possibly EditorialMarkerFactory)
    4. Delete workaround explanations for those migration from Solr prior to Solr 7? (Document them on RefGuide ?)
  2. Review directory layouts current state
    1. Compare:
      1. Out-of-the-box for default install
      2. Out-of-the-box example install and hacks (e.g. in bin/solr)
      3. serviceinstall scripts
      4. docker setup (
        Jira
        serverASF JIRA
        serverId5aa69414-a9e9-3523-82ec-879b028fb15b
        keySOLR-11245
        )
      5. Existing issues:
        Jira
        serverASF JIRA
        serverId5aa69414-a9e9-3523-82ec-879b028fb15b
        keySOLR-13035
         
        Jira
        serverASF JIRA
        serverId5aa69414-a9e9-3523-82ec-879b028fb15b
        keySOLR-6671
         
    2. Clarify naming for locations of:
      1. Static O/S global part of running solr
      2. Writable O/S global part of running solr (only pid file or more?)
      3. Server/Node level information (start.in.sh?, logs? configsets? solr.xml) - there may be several of this on a physical server, such as in cloud example. Or put all those in solr.home and have cores one level lower under coreRootDirectory (in solr.xml, but see
        Jira
        serverASF JIRA
        serverId5aa69414-a9e9-3523-82ec-879b028fb15b
        keySOLR-14097
      4. Collection/Core level information (core.properties)
      5. Individual directories per core (conf, data) - some of these already can be in other locations
  3. Refactor example directory and associated commands to reduce magic
    1. This mainly affects log configuration and logging directory locations and figuring out what is the directory above solr home
    2. May also involve exploration about configsets and environmental override directories
  4. Create new examples (
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keySOLR-10329
    , testable?
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keySOLR-11352
    )
    1. Create a base learning config that is either based on default or has even simpler its own (
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-13652
      )
    2. Setup new dataset (https://www.fakenamegenerator.com can generate 100k records with many interesting fields under CC license (https://creativecommons.org/licenses/by-sa/3.0/us/, similar to CC license used by films example already)
      1. Split records into different formats to demonstrate XML, CSV, multiple JSONs, nested records, etc
    3. Create a number of additive configurations+examples, that augment base configuration to demonstrate specific features with point precision
    4. Move non-essential schema definitions (e.g. languages) from default into alternative schema (new kitchen-sink). Should it be copy/paste XML or API commands, To Be Explored (
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-11033
      )
    5. Update documentation to use new examples to demonstrate features that used to use older configsets
    6. Use short names for analyzer/filter/tokenizer wherever possible (
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-13691
      ) - make sure they are easily discoverable in documentation as well
  5. Rewrite Getting Started guide that focuses on simplest path through
    1. Start from standalone mode
    2. Explain what is happening with cross-references for more details (teach troubleshooting skills early)
    3. Use API as much as possible, but not at a cost of readability/comprehension
    4. Demonstrate recent APIs/features
    5. Build up to the cloud example
  6. Bigger changes that needs further discussion
    1. Delete ALL DIH examples in bulk - DONE (JIRA TBC
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14066
      ,
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14783
      )
    2. Delete Tika configuration and refer to the manual for configuration and warning (JIRA TBC
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-13973
      )
    3. Move schemaless mode into learning chain (JIRA TBC
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14701
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-11741
      )
    4. Delete (refactor) techproducts example and its files (but what about tests?)
    5. Delete Velocity example (discussed somewhere else?
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14065
      )
    6. V2 vs V1 API for examples (V2 is not available for standalone mode in 8.6.1)
    7. post tool vs curl
    8. Interplay with Admin UI changes in progress (e.g. how much to leverage/demonstrate it)
    9. Neither default nor techproducts are realistic production schemes - a whole separate but related discussion (Jira exists?)
    10. It seems that even though Velocity/DIH/others have been deprecated, they have not actually been removed from code/documentation for 9.0 yet. Are there Jiras for that already?
  7. Other cleanup
    1. Fix the dead/legacy wiki.apache.org links (
      Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-14834
      )

Compatibility, Deprecation, and Migration Plan

...

It may be possible to create just a minimal learning schema and/or a couple of examples, but this would still not address that, once the person tries to add new functionality or test new features, they are not supported. Nor will it address kitchen-sink production deploys.

Related previous explorations and feature tests

Learning vs Production vs kitchen sink setup

Learning config

  • Should be as small as possible and still load in both standalone and cloud configurations
  • Should have every line to have a purpose and be explained with RefGuide references
  • managed-schema should be ordered in the order of reading comprehension (fieldType, related fields, uniqueKey declaration next to ID)
  • Additional examples should layer on top of learning schema to demonstrate different features
  • schemaless mode (to be rewritten to be learning mode) is a separate example
  • Related issues:
    • Jira
      serverASF JIRA
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keySOLR-13652

Production config

  • managed-schema should be minimal to allow users to include what is actually needed
  • solrconfig.xml
    • should be fairly comprehensive, but obscure defaults and detailed explanation should live in RefGuide. From experience, nobody updates the schema files unless forced to (it still points to wiki)
    • there should be some easy way to tell solrconfig.xml nested structure where a new configuration needs to go (or focus on configoverlay and config API if it is fixed )

Kitchen sink config

  • Is there a point to have a kitchen sink config that is basically a reference of field type definitions? That's where all the language variants could go.
  • managed-schema points
    • having kitchen-sink default configset allows us to put some inline comments that make no sense in either production or learning schema as their files may get rewritten on use
    • may be write locked to clearly indicate it is not for real use
    • kitchen sink may be the only one with commented out analyzer lines


Lessons learned

From DIH Cleanup (
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keySOLR-14783
)

  • To get DIH to work, we had to add permissions into solr/server/etc/security.policy, which is very low-level. Is it going to be an issue? Do we need a way for packages to explain such needs on install? Are there more examples like that? Also, it is great that somebody commented it properly, otherwise it would just be sitting there forever