Background

On-Demand tablet hosting and Resource Groups, described in A More Cost Efficient Accumulo, are two new features that aim to lower the cost of using Accumulo and increase efficiency of compute resources. On-Demand tablet hosting could allow users to reduce their overall TabletServer footprint if they determine that they do not need immediate access to the data in the tablets of a table. Resource Groups will enable users to assign workloads for tables to specific groups of compute resources. Metrics will enable third-party schedulers (e.g. HPA, KEDA, etc.) to dynamically scale the compute resources within a resource group as demand changes. These features will require that users analyze their application requirements and perform some level of resource planning.

Given the analysis and planning required, we assume that resource group definitions will change infrequently.

The diagram on the left is an example layout of an Accumulo instance with three resource groups.

The resource group on the left is the default resource group, which is mandatory. The default resource group must, at a minimum, contain one TabletServer and one Compactor to host the root and metadata tables. Tablets for other tables are hosted in the TabletServers in the default group unless configured otherwise*. Applications using the default resource group can insert data via BatchWriter or BulkImport and can access their data using the IMMEDIATE consistency level on the Scanner. Because there are TabletServers in this group, users can assign ranges of their tables to always be hosted, instead of being hosted on-demand (new default behavior in 4.0). The Compactors in this resource group will perform major compactions for all tables that have not been configured to use a different group**.
The resource group in the middle contains no TabletServers, only ScanServers and Compactors. Users would configure a group like this when their application does not need immediate consistency for their data. Applications using a resource group like this would only insert data via BulkImport and would access their data using the EVENTUAL consistency level on the Scanner***. Tablets will never be hosted in this resource group because there are no TabletServers. The Compactors in this resource group will perform major compactions for all tables that have been configured to use this group.
The resource group on the right contains a mix of TabletServers and ScanServes. Users would configure a group like this when their application requires immediate consistency in some cases and eventual consistency in others.

* See A More Cost Efficient Accumulo, Resource Groups, bullet #1
** See A More Cost Efficient Accumulo, Resource Groups, bullet #3
*** See A More Cost Efficient Accumulo, Resource Groups, bullet #2

The Single Manager Problem

With the introduction of On-Demand tablets most tablet management functions (split, merge, delete, compact, etc.) had to be moved out of the TabletServer and into the Manager. Prior to Accumulo 4.0 TabletServers were constantly scanning the tablets under their control looking for what tablets needed to be compacted, split, etc. This activity was highly parallelized. Moving this functionality to the Manager, there is now a single process continually scanning all metadata looking for work, using the TabletManagementIterator on the metadata tablets to try and offload the computation as much as possible. User initiated tablet management functions are still FaTE operations but they run within the Manager instead of being delegated to the TabletServer hosting the tablet. Additionally, the CompactionCoordinator functionality was moved into the Manager. Because of the increased tablet management, FaTE, and CompactionCoordinator workloads there is a concern that a single Manager will not be able manage an Accumulo instance like it could in earlier Accumulo versions.

Multiple Managers - Attempt #1

PR #3262 is an incomplete solution for allowing multiple active Manager processes. One of the Manager's would grab the lock, making it the primary Manager, and be responsible for executing FaTE operations and being the CompactionCoordinator. The primary and other Managers would determine which tables they are responsible for managing via some stable algorithm that recomputes the distribution as Managers are added and removed. Recent comments on PR #3262 suggest that maybe the primary manager should calculate the distribution of tables/tablets and publishes it for other Managers to consume.

Positives with Attempt #1

With the primary Manager only advertising the FATE and CompactionCoordinator Thrift services, clients only need to communicate with a single Manager.

Negatives with Attempt #1

FATE and CompactionCoordinator operations are not currently distributed, although there is a WIP PR (3964) for distributing FATE.
As noted, the current distribution algorithm may not be stable, depending on the final implementation it could introduce a layer of complexity with a ZooKeeper or RPCs.
There is an issue with the Monitor and multiple Managers, does the Monitor server need to communicate with all of them?
Do we also have a single GC problem? We have talked about partitioning the GC process, potentially by namespace, to reduce memory pressure on the GC process and to make it faster.

If the FATE distribution problem is solved, then it may be possible to create a FATE server component. The Fate class in the Manager the contains the TransactionRunners and is very modularized. It would be trivial to create a FateServer with this. The Manager would retain the FATE API and create transactions, but the execution of them could be scaled independently.

Possible Solution #2

One possible solution to the Single Manager problem is to push the Management layer (Monitor, Manager, GC) into the resource group. From a management perspective, this appears as different Accumulo databases, but they are not. They are different Accumulo server process groups that are serving up data from the same Accumulo instance. This may, or may not, solve the single Manager problem depending on the number of tablets that the Manager has to manage within a single resource group. However, it solves the Monitor issue and GC and CompactionCoordinator partitioning problems. To make this work I believe that:

There would need to be a table property for the resource group. When the Manager for a group starts up, it would find all of the tables with the same group name and manage them.
The Accumulo client code would also need to determine the group name for the table that it needs to operate with so that it communicates with the correct Manager.

The ZooKeeper paths for some components may need to change. For example, the paths for FATE transactions won't have any identifying information in the path as to which resource group the transaction is related to.

/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f/TX_NAME
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f/repo_0000000000
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f/repo_0000000002
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f/repo_0000000003
/accumulo/01af27af-ad10-416e-96e2-5fc35217022d/fate/tx_1fbc9ea5cc0b473f/repo_0000000004

It may not be beneficial to put the resource group name into the ZK path as cleanup will need to be done for old resource group names.

Possible Solution #3

Modify functional services in the manager to know how to distribute their self and make have multiple manager processes create these services and link them. This was partially described in this document. For example the following are services that could run in a manager process and know how to distribute work and coordinate with other instances of their-self running in other manager processes. How each of the below services distributes itself is encapsulated within the service.

Tablet management : In the PR 3262 work was started on distributing the tablet management component of the manager.
FATE : In PR 3964 work was started on distributing FATE as a WIP commit.
Compaction coordination : It would probably be worthwhile to create a WIP PR that experiments with distributing the compaction coordinator functionality.

It would be very helpful to reorganize the manager code as described in 4005 in order to make progress on this possible solution. Implementing 4005 would be a nice improvement to the maintainability of the Accumulo manager even if this solution is not pursued.

These functional services depend on each other. For example the tablet management functionality finds tablets that needs to compact and then passes that information to the compaction coordinator. If a distributed manager, this could happen in the following way

Tablet management service 1 (TS1) running on manager process 1 (MP1) is scanning a subset of the metadata table and finds a tablet that needs compaction.
TS1 passes this tablet to compaction coordinator service 1 (CCS1) running in the same process.
CCS1 examines the tablet and determines that the queue for that compaction is owned by compaction coordinator service 2 (CCS2). CCS2 is running in manager process 2 (MP2)
CCS1 sends an RPC to CCS2 asking it add a compaction job to a queue it owns

In the example above when TS1 only interacts with CCS1, it does not know or care that CCS1 passed data to CCS2 running in another manager process. This illustrates the concept that each functional service knows how to distribute itself. Below is another example of how this could work.

A user initiated compaction is running in FATE instance 2 (FATE2), which is running in MP2.
The fate operation has selected files on each tablet and notifies the tablet management service 2 TS2 to example tablets in range X,Y for table Z.
TS2 looks at the request and determines that TS3 is responsible for that range.
TS2 send and RPC to TS3 running on MP3 to examine range X,Y for table Z.
TS3 finds multiple tablets that need to compact and notifies CCS3

Its also possible that some service in the manger will not distribute. In this case each manager process could still have an instance of that service that always forwards to the single instance. For example assume that compaction coordinate was not a distributed service. One possible way to implement this would be to have an instance of the compaction coordinator service that just forwards. For example assume that the compaction coordination service is only running in one manager process, MP2. Then the following could happen.

TS1 running on MP1 finds a tablet that needs compaction.
TS1 passes this tablet to CCS2 running in the same process.
CCS2 does not do anything itself, it forwards everything to CCS1. All CCS2 know how to do is find the primary and forward.

Even though there is only a single compaction coordinator, hopefully all service can use the same compaction coordinator interface to interact with it locally or remotely.

Page tree

Using Resource Groups as an implementation of Multiple Managers