This document serves multiple purposes:

  1. This is a living document that describes the decisions that we have made in support of the goals. It is expected to change as decisions are made / changed.
  2. This document is intended to help get people up to speed on what is happening in the elasticity branch
  3. On release, this document serves as the basis for a blog post


Accumulo's tables are partitioned into smaller, more manageable pieces called tablets. Partitioning a table into multiple parts allows Accumulo to interact with the table in a parallel fashion making some operations faster. Prior to version 2.1.0 to interact with a tablet it had to be hosted by a Tablet Server. Because the Accumulo server-side components do not know which tablets a client will interact with, it tries to keep all tablets hosted for all online tables. This can be costly for users that don't own their own hardware as they will need to have enough tablet servers running to keep all of the tablets hosted all the time. To make Accumulo more cost efficient we need to remove this constraint (having tablets hosted all the time). Doing so will have some trade-offs with respect to consistency and latency. This document will provide an overview of the goals, design, and major changes being made to Accumulo in support of this goal.

Goal

Our goal is to provide a more cost efficient Accumulo that:

  1. Has minimal changes to the client API
  2. Allows the user to choose how their tablets are hosted
  3. Provides sufficient information for the user to spin up or down more server resources
  4. Enable many operations to work on tablets that are not hosted on a tablet server.

Design

The high-level design is to modify Accumulo to allow for tablets in an online table to be unhosted, but still usable for operations that don't require immediate consistency. This requires

  1. A mechanism for users to specify which tablets should not be hosted
  2. A mechanism for tablets to be hosted when immediate consistency operations are performed
  3. Tablet management functions (bulk import, split, merge, etc.) be moved out of the Tablet Server so that these functions continue to happen on all tablets for online tables
  4. A mechanism for users to assign table operations to compute resources


Accumulo Consistency Guarantees

In earlier versions Accumulo prioritized data consistency with immediate consistency being the only option. To achieve this consistency, Accumulo:

  • Assigned tablets to one TabletServer
  • Ensured that writes to a tablet are directed to its TabletServer
  • Ensured that writes to a tablet are durable (written to a WAL, recovered upon failure)
  • Ensured that reads from a tablet are consistent (read from in-memory map and tablet files)
  • Ensured that reads from a tablet wait until it is fully online (hosted, recovered, etc.)

In version 2.1.0 we introduced eventually consistent reads using the ScanServer component (see https://accumulo.apache.org/docs/2.x/getting-started/design#scan-server-experimental for more information).

Major Changes

Below is a summary of the major changes in the elasticity branch, which is currently targeting a version 4.0.0 release.


OnDemand Tablets

Tablets now have an availability attribute that must be one of the following.

  • HOSTED : these tablets are always hosted and immediately rehosted when there are tablet server faults.
  • ONDEMAND : these tablets are not hosted by default.  Any immediately consistent API operation against these tablets that requires hosting will cause the tablet to be temporarily hosted.
  • UNHOSTED : these are tablets are never hosted.  Any API operation against these tablets that requires hosting will now cause an exception in the client.

The default value for a tablet's availability is ONDEMAND and this value is assigned to all user table tablets on an upgrade to version 4.0.0. The root and metadata tables have an availability of HOSTED which cannot be changed by the user.  Users can use the setavailability and getavailability shell commands or new client API methods to change or view a tablet's availability.

Immediate Consistency Operations

Operations that require immediate consistency use a BatchWriter for inserting Mutations into tables (aka Live Ingest) or use Scanner or BatchScanner with a consistency level of IMMEDIATE (aka Immediate Scans).

The TabletServer uses a pluggable OnDemandTabletUnloader to determine which on-demand tablets to unload and when. A default implementation exists which will unload a tablet when it has been inactive for a configurable amount of time. The table to the right describes which operations are allowed based on table state and tablet availability.

Because tablets are now hosted on-demand, the tablet management functions have been dispersed to other components. A majority of the functions will now be performed by the Manager. The CompactionCoordinator server component that was introduced in version 2.1.0 has been merged into the Manager. Compactor processes will now perform all Major Compactions for all tables regardless of hosting state (see https://accumulo.apache.org/blog/2021/07/08/external-compactions.html for more information on CompactionCoordinator and Compactor).


Allowed Tablet Operation by Table State and Tablet Availability

Table StateENABLEDDISABLED
Tablet AvailabilityHOSTEDONDEMANDUNHOSTED
Live Ingest(tick)(tick)(error)(error)
Bulk Import v2(tick)(tick)1(tick)(error)
Immediate Scans(tick)(tick)(error)(error)
Eventual Scans(tick)(tick)1(tick)(error)
Map Reduce direct read of files(error)(error)(error)(error)
Split(tick)(tick)1(tick)(error)
Merge(tick)(tick)1(tick)(error)
Compact(tick)(tick)1(tick)(error)
Clone(tick)(tick)1(tick)(tick)
Export(error)(error)(error)(tick)
Summaries(tick)(tick)(tick)(error)
Flush(tick)(tick)1,2(error)(error)
Balance(tick)(tick)2(error)(error)
Set/Get Table Properties(tick)(tick)(tick)(error)

This action will not cause the tablet to be hosted
2 This action will only occur on currently hosted tablets

Multiple Managers

The OnDemand Tablets feature requires that tablet management functions move out of the TabletServer. These functions have moved into the Manager and have been integrated into the process that the Manager uses to determine if tablets needs to be assigned or balanced. This process now checks to see if tablets need to be split, merged or compacted based on the tables configuration. User initiated tablet management functions are still FaTE operations but they run within the Manager instead of being delegated to the TabletServer hosting the tablet. Because a single Manager is now doing more work we may need to support multiple active Managers (instead of an active and backup Manager). This work is currently in https://github.com/apache/accumulo/pull/3262.



Resource Groups

Having the ability to dedicate resources to tables for various operations like compaction, eventual scan, and immediate scans offers users the ability to dynamically react to changing needs.  Accumulo implements this through a combination of being able to start server processes with a resource group option and user configurable plugins that maps tablet execution needs to resource groups. The Compactor, Scan Server, and Tablet Server each support a group property, compactor.group, sserver.group, and tserver.group, respectively. These properties can be passed to the process as an argument when started (e.g. -o tserver.group=App1). These group properties on the server side are used in the following manner:

  1. The TableLoadBalancer gets the table configuration property table.custom.assignment.group and will pass only the subset of Tablet Servers with the matching tserver.group to the table's tablet balancer.
  2. The Accumulo client, when performing an eventually consistent scan and using the ConfigurableScanServerSelector, will use a Scan Server that has a sserver.group that matches the group name in the client configuration.
  3. The Manager will use a Compactor that has a compactor.group that matches the Compaction Service configuration for the table. See DefaultCompactionPlanner for more information.

These properties will allow the user to create a group of Compactors and Scan Servers and/or Tablet Servers to support the needs of one or more tables. Using the metrics emitted by Accumulo, the user should then be able to scale up or down the servers in each group to support the applications needs. Below is an example of what could be done with this capability.

  1. A compactor resource group CRG1 is created that auto scales up to 10 compactors. This group is intended for all tables to use. This is done by starting compactor processes with a resource group option in a Kubernetes auto scaled pod.
  2. Another compactor resource group CRG2 is created that auto scales up to 100 compactors.  This group is intended for important tables.
  3. Tables T1 and T2 are configured to use CRG2 by editing Accumulo compaction plugin configuration.
  4. Later it is determined that table T3 that is using CRG1 is falling behind in its compaction needs and this impacting query performance.  Configuration is adjusted to have T3 use CRG2 giving it access to more resources, which may cost more.

As the root and metadata tables have an availability of HOSTED 1 Manager, Tablet Server, and Compactor server processes need to be running. This is the new minimum required footprint for a functioning Accumulo database. Given the distribution of functions (table to the right) and the tablet availabilities, different deployment strategies can be used depending on the users application requirements. For example, if you have an application that does not need immediate consistency, then its tables can use the UNHOSTED availability and the application can insert data using Bulk Import and read data using Scan Servers. On the other hand, if you have an application that requires low latency and immediate consistency, then its tables can use the HOSTED availability and the application can insert data using the BatchWriter and read data from the Tablet Servers.

Tablet Function Distribution

Tablet OperationSupporting Server Components
Live IngestTablet Server
Bulk Import v1Tablet Server
Bulk Import v2Manager
Immediate ScansTablet Server
Eventual ScansScan Server
SplitManager
MergeManager
CompactManager+Compactor
CloneManager
ExportManager
SummariesTablet Server
FlushTablet Server

Metadata consistency model

Prior to the elasticity changes, Accumlo's metadata consistency model was that only a hosted tablet could edit a tablets metadata.  One important thing in the tablet metadata is the tablets list of files. For example when a bulk import operation wanted to add files to a tablet, it would find the hosted tablet and request it update the tablet metadata to add the files. This old model is not workable for the goals of elasticity because it requires tablets to be hosted to make any changes to tablet metadata. A new model of was adopted in elasticity of using conditional mutations to edit a tablets metadata. This model allows any server process in the cluster to safely edit a tablets metadata. For the case when a tablets metadata was edited and the tablet is hosted, a new reliable refresh mechanism was also added.  This mechanism ensures that hosted tablets read any metadata updates. For example when a compaction finishes it will use a conditional mutation to update the tablets set of files. If the tablet is hosted it will also queue a refresh request that will work even in the case of faults. The refresh request ensures scans on the tablet see the tablets new set of files.

These new refresh request are guaranteed to be processed before any user API operations returns. For example if a bulk import or compaction is initiated using Accumulo's public API, after the API operation returns an immediate scan is guaranteed to see the effect of those operations. This refresh model preserves the consistency behavior that Accumulo had for these operations prior to the elasticity changes.  Below is an example of this for compactions.

  1. A user compaction is initiated that filters data using iterators using the public API.
  2. Files are compacted, tablet metadata is updated, and refresh request are processed.
  3. The API operation completes and returns
  4. An immediate scan is started and it will not see any of the data that was filtered by the compaction.

Below is an example of this for bulk import.

  1. A bulk import is initiated using the public API
  2. Tablet metadata is updated to add files and refresh request are processed.
  3. The API operation completes and returns
  4. An immediate scan is started and it will see any data that was added via the bulk import




  • No labels