Status

Current state: Under Discussion

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/lucene-dev/)

JIRA: Unable to render Jira issues macro, execution error. , many others, TBD

Released: TBD (target 9.0)

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

Motivation

As Solr has grown, the examples have become a mix of ancient documents, kitchen-sink additions with complicated - and often confusing - interplay of definitions, left-over configurations conflicting or out of sync with documentation. The "more info" links mostly point into legacy wiki that is two generations of redirects behind the current Reference Guide. Solutions were introduced with a fix-the-pain approach, that have also caused magic paths or pushed demonstration configurations into consolidated defaults. The new features are often not demonstrated as adding new example requires understanding the existing one.

The default configuration files have grown to the ridiculous sizes with a lot of that size being commented out out-of-date defaults and explanations (that should be or even already are in Reference Guide) or comments that will go away on first API-driven rewrite:

Line Counts	as shipped	No comments	% reduction
solrconfig.xml	1226	212	82%
managed-schema	1031	523	49%

For solrconfig.xml, this bloated configuration is both confusing for people trying to identify significant configuration entries and potentially dangerous, such as remote streaming enabled by default until recently.

For managed-schema, all comments go away on first rewrite, making them completely unsuitable for any significant education purposes.

Similarly, our out-of-the-box filesystem layout has legacy/incremental setup different from our lessons learned in docker/service/3rd party/production layouts (logical locations for solr.in.sh, logs, pid, solr.home, live vs non-live directories). We also have magic syntax around creating examples that hides just enough internal machinery to make it very hard to run those examples multiple times and to understand when things go wrong. This will especially hit those that try to run multiple Solr instances on the same machine.

The examples themselves are out of date, demonstrate legacy features (techproducts) and sometimes (films) became less viable because the external source of interesting data has disappeared. The examples are also not providing enough records/fields to show advanced Solr capabilities or even basic nested ones. One example (schemaless) is no longer different from a standard core created with one or two commands, apart from the behind-the-scenes logging magic that will not work outside of the example directory.

Some of the other examples are going away as part of other initiatives (DIH). Some other examples demonstrate the features that we strongly do not recommend in production and spend a lot of time advising people on the mailing list to undo what they learned from our own default schemas (Tika integration, schemaless mode as default chain).

Finally, the recent attempt to do getting started guide with initial focus on the cloud setup may have made the comprehension of what Solr is actually doing more complex and - again, because of magical nature of examples directory - not easily reproducible.

All together, this makes new users confused about getting examples running, understanding what they are actually running, learning about latest features of Solr and knowing how they can apply that learning from example configurations to their own. They are also going into production with kitchen-sink configurations that everybody is afraid to modify.

Public Interfaces

This will affect all the examples. It may affect some of the directories, startup scripts, documentation, and tests.

Proposed Changes

Go through the default configuration files line by line.
1. Ensure that any documentation and explanation not yet in the Reference Guide are moved there. Delete any significant passage and replace them with Ref Guide links to ensure a single-source of truth
2. Delete any default blocks that do not use parameter substitutions and point them to RefGuide for the section and to the API to get the real defaults as appropriate
3. Delete legacy sections that 'no longer work' (e.g. jmx, possibly EditorialMarkerFactory)
4. Delete workaround explanations for those migration from Solr prior to Solr 7? (Document them on RefGuide ?)
Review directory layouts current state
1. Compare:
  1. Out-of-the-box for default install
  2. Out-of-the-box example install and hacks (e.g. in bin/solr)
  3. serviceinstall scripts
  4. docker setup ( Unable to render Jira issues macro, execution error. )
  5. Existing issues: Unable to render Jira issues macro, execution error. Unable to render Jira issues macro, execution error.
2. Clarify naming for locations of:
  1. Static O/S global part of running solr
  2. Writable O/S global part of running solr (only pid file or more?)
  3. Server/Node level information (start.in.sh?, logs? configsets? solr.xml) - there may be several of this on a physical server, such as in cloud example
  4. Collection/Core level information (core.properties)
  5. Individual directories per core (conf, data) - some of these already can be in other locations
Refactor example directory and associated commands to reduce magic
1. This mainly affects log configuration and logging directory locations and figuring out what is the directory above solr home
2. May also involve exploration about configsets and environmental override directories
Create new examples
1. Create a base learning config that is either based on default or has even simpler its own
2. Setup new dataset (https://www.fakenamegenerator.com can generate 100k records with many interesting fields under CC license (https://creativecommons.org/licenses/by-sa/3.0/us/, similar to CC license used by films example already)
  1. Split records into different formats to demonstrate XML, CSV, multiple JSONs, nested records, etc
3. Create a number of additive configurations+examples, that augment base configuration to demonstrate specific features with point precision
4. Move non-essential schema definitions (e.g. languages) from default into alternative schema (new kitchen-sink). Should it be copy/paste XML or API commands, To Be Explored
5. Update documentation to use new examples to demonstrate features that used to use older configsets
6. Use short names for analyzer/filter/tokenizer wherever possible ( Unable to render Jira issues macro, execution error. ) - make sure they are easily discoverable in documentation as well
Rewrite Getting Started guide that focuses on simplest path through
1. Start from standalone mode
2. Explain what is happening with cross-references for more details (teach troubleshooting skills early)
3. Use API as much as possible, but not at a cost of readability/comprehension
4. Demonstrate recent APIs/features
5. Build up to the cloud example
Bigger changes that needs further discussion
1. Delete ALL DIH examples in bulk ( Unable to render Jira issues macro, execution error. , Unable to render Jira issues macro, execution error. )
2. Delete Tika configuration and refer to the manual for configuration and warning ( Unable to render Jira issues macro, execution error. )
3. Move schemaless mode into learning chain ( Unable to render Jira issues macro, execution error. Unable to render Jira issues macro, execution error. )
4. Delete (refactor) techproducts example and its files (but what about tests?)
5. Delete Velocity example ( Unable to render Jira issues macro, execution error. )
6. V2 vs V1 API for examples (V2 is not available for standalone mode in 8.6.1)
7. post tool vs curl
8. Interplay with Admin UI changes in progress (e.g. how much to leverage/demonstrate it)
9. Neither default nor techproducts are realistic production schemes - a whole separate but related discussion (Jira exists?)
10. It seems that even though Velocity/DIH/others have been deprecated, they have not actually been removed from code/documentation for 9.0 yet. Are there Jiras for that already?

Compatibility, Deprecation, and Migration Plan

Existing users will only be affected when they look at examples again to learn additional features
The directory locations may change, but possibly in a very minor way. If 3rd party tools hardcode paths, this may need a call-out
Tests use both default and techproducts scheme. They would need to be migrated

Security considerations

This proposal should not affect or possibly improve the security.

Test Plan

All existing tests should run. Additional tests may be needed?

Rejected Alternatives

The current status is broken in 100 different small ways. The discussions and attempts to fix them are happening in parallel efforts, but they do it from a functional (rather than critical path) point of view. Being separate efforts, their priority and impact is often not fully appreciated without a higher-level critical path discussion.

It may be possible to create just a minimal learning schema and/or a couple of examples, but this would still not address that, once the person tries to add new functionality or test new features, they are not supported. Nor will it address kitchen-sink production deploys.

Related previous explorations and feature tests

Lessons learned

From DIH Cleanup ( Unable to render Jira issues macro, execution error. )

To get DIH to work, we had to add permissions into solr/server/etc/security.policy, which is very low-level. Is it going to be an issue? Do we need a way for packages to explain such needs on install? Are there more examples like that? Also, it is great that somebody commented it properly, otherwise it would just be sitting there forever

Space shortcuts

Page tree

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Security considerations

Test Plan

Rejected Alternatives

Related previous explorations and feature tests

Lessons learned

From DIH Cleanup ( Unable to render Jira issues macro, execution error. )

Space shortcuts

Page tree

SIP-10 Improve Getting Started experience

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Security considerations

Test Plan

Rejected Alternatives

Related previous explorations and feature tests

Lessons learned

From DIH Cleanup ( Unable to render Jira issues macro, execution error. )