Overview


The main scope of this work is what I call a remaster of Apache Solr/SolrCloud. The current work is based on a commit point from July 2020 (master/9.x) with a small number of backports since.

The main work in making SolrCloud solid, scalable and performant is performing a systematic introspection and review of the system. This is currently an immense task and difficult to balance coverage, correctness, systematic, regressions, etc. My solution is to use the common thread of performance and test / bench / stress test results, each attacked from multiple angles, as a guide. Pinning the code commit point for a period of time is also a crucial component given the task and it's related difficulties as well as the number of developers and amount of fairly independent code churn.

This approach starts with a fresh head space assuming three things:

  1. Modern hardware is extremely fast and more and more parallel.
  2. SolrCloud is supposed to be a lightweight, mostly event driven system.
  3. Most of our tests and dev testing is done with very light data that Lucene can gobble up in no time.

Early on, there is a lot of low hanging fruit that pops up when you start applying this assumption and watching it fail.

The work requires digging, improvements, profilers, logs, debuggers, etc. Often things lead to great other things as this work goes on. As you become familiar with tests and their run times are pushed down, it also starts to become apparent which tests look like they would/should have similar run times or resource usage as similar tests, but for some reason stand out. Investigation often leads to bugs, inefficiencies, good improvements, etc. An intense and repeating loop of improvement, fixing, and introspection steadily push the system towards greater stability, efficiency, and performance. Mix in a very huge amount of test debugging and hardening.

The driving overall philosophy around tests is to work towards having an extremely fast and solid core test run, a much more vigorous and inclusive “Jenkins” or “Nightly” run, as well as other higher level benchmarks and stress tests at some point to help prevent production regressions.

An extremely fast core test run that is also high coverage is extremely developer friendly over time (faster iterations, shorter logs, fewer interactions, fewer resources, often less forgiving) and provides a reliable and solid base comparison for more complex Nightly or higher level test/stress/bench failure debugging.

Some of the goals of the Solr Reference Branch are:

  • Convince the developer to trust the tests.
  • Make SolrCloud highly stable and scalable.
  • Move to a world where developers are much more confident in larger changes without needing production feedback.
  • Put Solr standard test runs in competition with the best open source distributed java tests out there in terms of coverage, stability, and speed.
  • Put the system in a place that can be transitioned to a more modular and understandable architecture (largely this is initially accomplished via the above, though I do have some experience playing around more in this area over the past few years - that work is out of scope here).

Some of the key focuses have been:

  • Test performance and reliability.
  • Lifecycle, lifecycle, lifecycle.
  • Resource usage (Threads, connections, and on and on).
  • Logging for understanding and debugging.
  • Locking and blocking.

Some of the guiding principles have been:

  • If it was added for "theory safety" or my original mistakes or over defensiveness due to lack of experience/knowledge, remove it. Some of it was reasonable at the time, some of it embarrassing.
  • If it is not modern collection api based, with no legacy mode type stuff, remove it.
  • By removing slow behavior, polling, and blocking, you can clear the way so that more valuable faults and problems can more easily be seen and monitored.
  • More parallel when parallel makes sense.
  • All features and functions should be able to pass pretty much all of our base tests in some form in almost no time flat.
  • Low end hardware should not only work with our tests, but work very well.

Locations

Getting Started

Building the Ref Branch in a stable, known, repeatable Docker environment

git clone https://github.com/apache/lucene-solr.git --branch reference_impl --single-branch reference_impl \
&& docker build -f reference_impl/solr/reference-branch/docker/solr-build/Dockerfile -t solr-ref-branch . \
&& docker run --name solr-ref-branch-1 solr-ref-branch \
&& docker cp solr-ref-branch-1:/opt/solr/ . \
&& ls -la solr/

Building the Ref Branch in your local environment

Building a basic, stripped down Ref Branch Docker image

Building a Ref Branch development or testing cluster on your machine

Some Notable Changes

Some Notable Limitations

Editorial Notes Around Remaining Work




  • No labels