Moving compactions to the manager

For the elasticity changes compactions need to be able to run against tablets that are hosted on tablets sever and those that are not hosted . Currently most compaction code is in the tablet server and Accumulo 2.1 only support compacting tablets that are hosted on a tablet server. This document outlines how compaction functionality might move from tablet server to manager.

In Accumlo 2.1 the tablet server uses in memory locking and in memory data structures for data related to compactions, this will be replaced with storing the data in the metadata table and using conditional mutations to safely update it. Each tablet sever constantly scans the tablets its hosting to look for tablets that need to compact, this will be replaced by the manager scanning the metadata table. At a high level, those are the changes that need to be made. The rest of this document explores the needed changes in more details.

Concepts

The follow are some concepts that are relevant to the rest of the document.

System compaction : A compaction initiated by Accumulo on a tablet to reduce the number of files in the tablet. System compactions use the per table compaction ratio by default.
User compaction : A compaction initiated by a user that can include iterators and can select a subset of a tablets files. User compactions can perform data maintenance task like age off, data correction, data removal, etc.
Selector compaction : A compaction initiated by a compaction selector configured on a table. These support compacting tablets for reasons other than reducing the number of files. Configuring these support use cases like compacting tablets that have a large ratio of delete entries, see TooManyDeletesSelector.java.
Compaction Job : A set of files to compact for a tablet plus a priority and destination queue. The priority is used to determine which compaction job to run next in the destination queue.
Selected files : A user compaction must select files within a tablet to compact. The following are important concepts for selected files.
- Creating the initial set of selected files for a tablet must be done when no other compactions are runnings. Its important to have a mechanism that prevents system compaction from starving user file selection for a tablet. This is done by preventing new system compactions from starting, waiting for any currently running system compaction to complete, and then selecting the set of files. In 2.1 this coordination is done in tserver memory.
- The selected set of files for a tablet is remembered. In 2.1 this is done in tserver memory.
- One or more compaction jobs will run over the selected files. Compactions on disjoint sets of selected files could run concurrently.
- Accumulo will keep scheduling compaction jobs until all selected files are processed.
- System compactions will never include files in the selected set of files for a tablet.
- System compactions can concurrently process tablet files that are not in the selected set of files for a tablet. This allows system compactions to process files that arrive after a user compaction selects files, which is important for long running compactions.
- One reason to select a set of files for a user compaction is so that it can be broken into multiple compaction jobs and still complete. We do not want to incorporate new files that arrived after the user compaction started as this could lead to a situation where a user compaction never completes for a tablet.
- User compactions must compact down to one file. So if a compaction job compacts a subset of the selected files, then it will be an intermediate job and its output file will replace its input files in the selected set of files.

Example compaction scenario

The following is an example of a compaction scenario that could unfold in Accumulo 2.1. In Accumulo 1.x only a single compaction could ever run per tablet, so if a user compaction took 5 hours it would prevent any system compactions of files that arrived after the user compaction started. The scenario below shows how Accumulo 2.1 can handle concurrent system and user compactions. This scenario must be handled by Accumulo 4.0, so its presented to help understand the compaction changes proposed for 4.0.

Tablet A has 200 files F1...F200
A system compaction job for tablet A is running on 30 files, F171...F200.
A user compaction is initiated on tablet A and wants to select files. However it must wait for the system compaction to complete. The user compaction places a hold on the tablet that prevents future system compaction from starting.
The system compaction job for tablet A of files F171...F200 completes producing file F201. The tablet now has files F1...F170,F201.
The thread that looks for compaction work ask about starting another system compaction for Tablet A, however the hold placed by the user compaction prevents it.
The user compaction selects files F1...F170,F201 for tablet A. These files are no longer candidates for system compactions.
The compaction planner plugin queues 5 jobs to compact the files selected for tablet A. The jobs cover all selected files and each job has a disjoint set of files.
40 new files F301..F340 arrive for tablet A while the user compaction is running. These new files are candidates for system compactions.
A new system compaction job is queued for tablet A on files F321...F340.
The five user compaction jobs complete creating files F401...F405. User compaction must compact down to one file. These five jobs were intermediate compactions. The selected set of files for tablet A becomes F401...F405.
A user compaction job for tablet A is queued for files F401...F405
The system compaction for tablet A on files F321...F340 completes producing file F341
A system compaction for tablet A on files F301...F320,F341 is queued. These are files outside the selected set.
The user compaction for tablet A on files F401...F405 completes producing F406. Now that a single file was produced from the initial set of selected files, the user compaction is marked as complete in the metadata table and the set of selected files is nullified.
There are no selected files for tablet A now, system compaction can incorporate any of the tablets files.

Compaction metadata

In order to plan compaction jobs, the following information is needed per tablet.

A flag indicating if a user compaction is waiting to select files.
The tablets current set of files.
The current set of running compactions.
The current set of selected files.
Current tablet configuration.
User compaction specific configuration.

In 2.1 most of the above information (excluding configuration) is stored in tablet sever memory. Since compaction functionality is being moved out of tablet server, this information will need to be stored else where. The two candidates are manager memory or the metadata table. Storing this information in manager memory could lead to out of memory conditions, the tablet server had a lot of aggregate memory and could do this. Therefore its seems like the best place to stores this information is in the metadata table.

In 2.1 there are two types of compaction jobs, internal and external. An internal compaction is one that runs inside the tablet server. An external compaction is one that runs outside of the tablet server in a compactor process. Internal compaction will be going away as a concept and this will likely require a breaking change to the SPI. We could deprecate SPI items related to this in 3.X just as a heads up, but user may not be able to really make a change in 3.X based on the deprecation unless they wanted to switch to only external compactions. Information about running external compactions already exists in a tablets metadata and this could be directly leveraged in 4.0. For 4.0 the following information would need to be added to the tablet metadata.

A column indicating a user compaction is waiting to select files. With conditional mutations this column can prevent any system compactions from starting after the column is set.
A column that holds the selected set of files for tablet.

Storing the selected set of files in the metadata table has an advantage over storing it memory (tserver or manager). Currently when a tablet server dies the selection process starts over for a tablet. There is not a correctness problem with starting over, but it could lead to redundant work. Storing this data in the metadata table avoids redundant work in the case of faults. Accumulo 1.x and 2.x both have the redundant work problem for user compactions in the case of a tserver processes dying.

In order to handle concurrent user compactions, the selected set of files in the metadata table will probably need to store the FATE transaction id with the set of selected files. This allows a user compaction to know if a set of selected files in the metadata table is from itself or another concurrent user compaction.

Queueing

Currently compaction jobs ready to run are stored in one or more priority queues on tablet servers. Users can configure multiple thread pools (or external compaction groups) to run compactions and each one has a priority queue. Compaction jobs are placed on these priority queues. Tablet servers have more memory in aggregate than the manager and therefore it reasonable to assume they can store all the compaction jobs in memory. However the manager can not make this assumption and we will need an in memory priority queue that supports the following functionality.

Only stores the top N compaction jobs that has the following behavior for adding new jobs.
- When the queue size is less than N, add the job.
- When the queue size is greater than N do the following
  - if the job being added is less than the lowest priority then ignore the add request
  - If the job being added is higher than the lowest priority than add the drop the lowest priority job and add the new one.
Supports the ability to delete any job in the priority queue. When Accumulo ask the planner plugin to create compaction jobs, if it creates different jobs than are currently queued then the currently queued jobs are deleted before adding the new ones to the queues.

Guava's MinMaxPriorityQueue seemed like it may be a candidate for implementing the above priority queue, however its javadoc says following, so it may not be a good candidate.

If you only access one end of the queue, and do use a maximum size, this class will perform significantly worse than a PriorityQueue with manual eviction above the maximum size. In many cases Ordering.leastOf(java.lang.Iterable<E>, int) may work for your use case with significantly improved (and asymptotically superior) performance.

Plugins/SPI

The existing plugins related to compaction may not need to change much because the functionality they support is still needed. We still need to delegate the following functionality.

Deciding which compaction service a table uses.
Selecting files for user compactions.
Breaking compaction work into jobs (files+priority+queue).
Configuring the compression and other output file settings for a compaction.

The existing plugins can provide this functionality, however they may execute in a different place (manager vs tserver for example). Some of these plugins are configured via tablet server properties, so the plugins would need to be configured via a different prefix. Wondering if we should move away from manager and tserver prefixes in Accumulo configuration and towards functional prefixes like compaction and split. That could be attempted with this changes, moving the system plugin config related to compactions from a tserver prefix to a "compaction" prefix (intead of a manager prefix).

Runtime

The following is a possible execution path for how a compaction might be found and run.

In the manager the TabletGroupWatcher sets up a batch scanner with the TabletManagementIterator to look for tablets that need compaction, split, merge, assignment. Using the batch scanner this iterator can run on many tservers in parallel allowing tablet inspection to execute in parallel. Note the TabletManagementIterator is completely new in the elasticity branch, see #3409.
The TabletManagementIterator calls the CompactionDispatcher and CompactionPlanner plugins for each tablet to see if any compaction jobs are produced. If any jobs are produced, then the tablet is returned to the manager as needing compaction. This assumes the selected set of files is stored in the metadata table.
On the manager, for each tablet returned from the TabletManagementIterator needing compaction the compaction jobs are recomputed.
If the tablets has any jobs in the designated priority queue that differ from the jobs produced, then the old jobs in the queue are removed and the new ones enqueued.
Once the job is in a compaction queue, compactors can take the highest priority job. The manager will need to create an entry for the compaction running on a compactor using a conditional mutation, this is currently done in the tserver (using in memory locks instead of conditional mutations).
Once a compactor is finished with a compaction it will need to report this to the manager. The manager can use a conditional mutation to commit the compaction. The manager will need to reliably inform hosted tablets that the tablets files changed like bulk import does in the elasticity branch, this notification can not be lost AND must be delivered for before user compaction API calls returns.

User compaction will need to select files in the FATE op and then wait for the above process to actually complete the compactions. The following are the actions the user compaction fate op would take.

For each tablet in the compaction range do the following until all tablets have a set of selected files
- If no compactions are running then select the set of files to compact and write this to the metadata table. This would require calling CompactionSelector plugin which could be resource intensive (in terms of manager CPU, memory, network) if it decides to do something like examine summary information stored in a file. This metadata update can be made using conditional mutation to properly handle race condtions with compactions starting during file selection.
- If compactions are running then write a flag to the tablet that prevents new compactions from starting. This flag can be written using a conditional mutation to avoid race condtions. Come back to this tablet later.
As soon as tablet has selected set of files above, then the manager can start scheduling compaction jobs to process the selected set via the TabletGroupWatcher. Need to wait for all tablets in the range to compact using the current mechanism that tablets use which is a one up compaction counter.

Selector compactions

The tablet server is inefficient when processing tablets with a selector plugin configured, it just keeps asking the plugin if there it wants to select files. Tablet server have a lot more CPU to do this processing and memory to cache file summary information. The manager needs to be more efficient when processing these. The manager will need to avoid constantly running the selection plugin and instead only run it when one or more of the following conditions are met.

The tablets set of files has changed since the last time the plugin was run.
The tablets selector configuration has changes since the last time the plugin was run.

A hash of the above information could be recorded in the metadata table for each tablet. The TabletManagementIterator could detect changes in the hash and notify the manager which could then run the compaction selection plugin and record a new hash in the metadata table.

This 2.1 feature could probably not be supported without the changes to make it more efficient. 2.1 could probably benefit from making the feature more efficient.

Selector compactions would have a selected set of files stored in the metadata table like user compactions.

Page tree