IDIEP-101
Author
Sponsor

Alexander  

Mirza Aliev 

Sergey Utsel 

Created

 

Status

IN PROGRESS


Motivation

Generally speaking, Apache Ignite replicates data over multiple cluster nodes in order to achieve goals of high availability and performance. Data partitioning is controlled by the affinity function that determines the mapping both between keys and partitions and partitions and nodes. Please consider following article [1] about data distribution as a prerequisite for the following proposal.  Specifying an affinity function along with replica factor and partitions count sometimes is not enough, meaning that explicit fine grained tuning is required in order to control what data goes where. Distribution zones provides aforementioned configuration possibilities that eventually makes it possible to achieve following goals: 

  • The ability to trigger data rebalance upon adding and removing cluster nodes.
  • The ability to delay rebalance until the topology stabilizes.
  • The ability to conveniently specify data distribution adjustment rules for tables.

Description

Let’s start with instituting data nodes term - a concept that defines a list of cluster nodes for their subsequent transfer to an affinity function for calculating assignments:

Data Nodes - is one of the data distribution control constructs, evaluated per intersection of table (group of table linked to the distribution zone) and storage engine that exclusively define the bunch of nodes to pass to the affinity function on which storage-specific partitions of this table (group of table linked to the distribution zone) can be located.

Further, unless otherwise stated, table will be used instead of intersection of table (group of table linked to the distribution zone) and storage engine.

Thus, a new attribute is added to the table state:

  • primaryDistributionZone - link to the distribution zone that defines data nodes set for a primary storage. 

and new entity DistributionZone is introduced to hold, manage and atomically switch dataNodes[] attribute. 

Such a dataNodes attribute is evaluated according to some data nodes configuration rules, let’s check them up.

Distribution zones configuration syntax

Table
CREATE ZONE 
	{ database_name.schema_name.distribution_zone_name | schema_name.distribution_zone_name | distribution_zone_name }
	[WITH 
		[
			<data_nodes_auto_adjust> |
			DATA_NODES_FILTER = filter |
			(<data_nodes_auto_adjust>, DATA_NODES_FILTER = filter)
		],
		[PARTITIONS = partitions],
    	[REPLICAS = replicas],
		[AFFINITY_FUNCTION = function]
	]
[;]

<data_nodes_auto_adjust> ::= [
	DATA_NODES_AUTO_ADJUST_SCALE_UP = scale_up_value |
	DATA_NODES_AUTO_ADJUST_SCALE_DOWN = scale_down_value |
	(DATA_NODES_AUTO_ADJUST_SCALE_UP = scale_up_value & DATA_NODES_AUTO_ADJUST_SCALE_DOWN = scale_down_value) | DATA_NODES_AUTO_ADJUST  = auto_adjust_value
]

Few examples:

Table
Ex 1.1: CREATE ZONE zone1 WITH DATA_NODES_AUTO_ADJUST_SCALE_UP = 300, PARTITIONS = 1024, REPLICAS = 3, AFFINITY_FUNCTION = rendezvous; // Scale up only.

Ex 1.2: CREATE ZONE zone2 WITH DATA_NODES_AUTO_ADJUST_SCALE_UP = 300, DATA_NODES_AUTO_ADJUST_SCALE_DOWN = 300; // Both scale up and scale down. Default partitions count, replicas count and affinity function will be used.

Ex 1.3: CREATE ZONE zone2 WITH DATA_NODES_AUTO_ADJUST = 100; // Same value for both scale up and scale down. Default partitions count, replicas count and affinity function will be used.

Similar syntax is expected for altering and dropping distribution zones. Let's now check the way to bind distribution zones to tables. 

Table
CREATE TABLE ... WITH PRIMARY_ZONE = primary_zone

Similar syntax is expected for altering table in order to modify primary zone.

Distribution zones configuration internals

Let's now take a closer look at two lexemes were introduced:

  • DATA_NODES_AUTO_ADJUST<SCALE_UP>|<SCALE_DOWN> - that specifies timeout in seconds between topology event itself and table.dataNodes switch. As was mentioned above, in many cases immediate reaction to a topology event isn’t a preferable action because there might be an intention to wait for the topology to stabilize, e.g. waiting for all cluster nodes to start on cluster restart. For more details of node start and join please check IEP-77: Node Join Protocol and Initialization for Ignite 3 [2]

There are three auto adjust timeouts

    1. DATA_NODES_AUTO_ADJUST_SCALE_UP - for react to adding nodes
    2. DATA_NODES_AUTO_ADJUST_SCALE_DOWN - for react to leaving nodes
    3. DATA_NODES_AUTO_ADJUST - same timeout for both scale up and scale down.
  •  DATA_NODES_FILTER - simply filter which determines possible set of nodes suitable for a given table.

Let’s check the AUTO_ADJUST rules in more detail.

DATA_NODES_AUTO_ADJUST rules

The easiest way to understand auto_adjust semantics is to take a look at a few examples. Let’s start with a simplest one - scale_up example:

-1

start Node A;

start Node B;

start Node C;

CREATE ZONE zone1 WITH   DATA_NODES_AUTO_ADJUST_SCALE_UP = 300;

CREATE TABLE Accounts … WITH  PRIMARY_ZONE = zone1 

User starts three Ignite nodes A, B, C and creates table Accounts specifying scale up auto adjust timeout as 300 seconds. Accounts table is created on current topology, meaning that <Transaction>.dataNodes = [A,B,C]

0

start Node D ->

    Node D join/validation ->

    D enters logical topology ->

    logicalTopology.onNodeAdded(Node D) ->

    scale_up_auto_adjust(300) timer is

    scheduled for the <Accounts> table.

At time 0 seconds the user starts one more Ignite node D, that joins the cluster. On entering logical topology the onNodeAdded event is fired. This event schedules a timer of 300 seconds for table Accounts after which the dataNodes of that table transitively through the distribution zone will be recalculated from [A,B,C] to [A,B,C,D]



250

start Node E -> 

    scale_up_auto_adjust(300) is

    rescheduled for the <Accounts> table.

At 250 seconds one more node is added, that action reschedules scale_up_auto_adjust timer for another 300 seconds.

550

scale_up_auto_adjust fired ->

    set table.<Accounts>.dataNodes = 

    [NodeA, NodeB, NodeC, Node D, Node E]

At 550 seconds scale_up_time is fired, that leads to <Transaction>dataNodes recalculation by attaching the nodes that were added to logical topology - Nodes D and E in the given example.

600

start Node F ->

     <Accounts> table schedules   

     scale_up_auto_adjust(300);

At 600 seconds one more node is added, there are no active scale_up_auto_adjust timers, so given events schedules new one.

Now it’s time to expand the example above with node that exits the cluster topology:

-1

start Node A;

start Node B;

start Node C;

CREATE ZONE zone1 WITH   DATA_NODES_AUTO_ADJUST_SCALE_UP = 300, DATA_NODES_AUTO_ADJUST_SCALE_DOWN= 300_000;

CREATE TABLE Accounts … WITH  PRIMARY_ZONE = zone1 

User starts three Ignite nodes A, B, C and creates table Accounts specifying scale up auto adjust timeout as 300 seconds. Accounts table is created on current topology, meaning that <Transaction>.dataNodes = [A,B,C]

0

start Node D ->

    Node D join/validation ->

    D enters logical topology ->

    logicalTopology.onNodeAdded(Node D) ->

    scale_up_auto_adjust(300) timer is

    scheduled for the <Accounts> table.

At time 0 seconds the user starts one more Ignite node D, that joins the cluster. On entering logical topology the onNodeAdded event is fired. This event, schedules a timer of 300 seconds for table Accounts after which the dataNodes of that table will be recalculated from [A,B,C] to [A,B,C,D]

100

stop Node C -> 

    scale_down_auto_adjust(300_000) timer

    is scheduled for the <Accounts> table.

At 100 seconds the user stops Node C (or it painfully dies). TableManager detects onNodeLeft(Node C) event and starts scale_down time for 300_000 seconds for table <Accounts>. Please pay attention that the node left doesn’t affect the scale_up timer.

250

start Node E ->

    scale_up_auto_adjust(300) timer is

    re-scheduled for the <Accounts> table.

At 250 seconds Node E is added, that re-schedules scale_up_auto_adjust timer for another 300 seconds. The important part here is that adding the node doesn’t change scale_down time only scale_up one.

550

scale_up_auto_adjust fired ->

    set table.<Accounts>.dataNodes = 

    [NodeA, NodeB, NodeC, Node D, Node E]

At 550 seconds scale_up_time is fired, that leads to <Transaction>dataNodes recalculation by attaching the nodes that were added to logical topology - Nodes D and E in the given example. Please pay attention that despite the fact there's no Node C in logical topology it still takes its place in <Transaction>.dataNodes. 

300100

scale_down_auto_adjust fired -> 

    set table.<Accounts>.dataNodes = 

    [NodeA, NodeB, Node D, Node E]

At 300_100 seconds scale_down_auto_adjust timer is fired, that leads to removing Node C from <Transaction>.dataNodes.

At this point we’ve covered DATA_NODES_AUTO_ADJUST_SCALE_UP DATA_NODES_AUTO_ADJUST_SCALE_DOWN and its combination. Let's take a look at the last remaining auto adjust property - DATA_NODES_AUTO_ADJUST. The reason we have one is to eliminate excessive rebalance in case of users intention on having the same value for both scale_up and scale_down. As we saw in the example above, the events of adding and removing nodes fall into the corresponding frames with a dedicated timer each: one for expanding the topology (adding nodes), and another for narrowing it (removing nodes), which in turn leads to two rebalances - one per each frame. If the user however wants to put both types of events (adding and removing nodes) in one frame with only one dataNodes recalculation and one rebalance, he should use the DATA_NODES_AUTO_ADJUST property.

DATA_NODES_FILTER rule

Often, a user may want to distribute table data on a specific set of nodes that satisfy a particular requirement. E.g. in the case of a geo-distributed cluster, a user may want to place the table closer to the business logic applications in a particular region, or, on the contrary, stretch it over several regions. For hot tables, a user may want to spread data across “fast” nodes, for example nodes with SSD disks.

In order to move further we’ll need to introduce two new terms:

  • Node Attributes
  • Data Nodes Filter

Node Attributes - is a union of semantically filled set of key-value node properties specified by the user and dedicated node attributes such as node name.

Data Nodes Filter - is an expression that specifies rules that nodes should match in order to be included in zones data nodes.

Let’s assume that there are three nodes with following attributes:

  • Node A: Attributes =[“Node A”, “EU”, “SSD”]
  • Node B: Attributes =[“Node B”, “COMPUTE_ONLY”, “HDD”]
  • Node C: Attributes =[“Node C”, “US”, “SSD”]

Let’s now check several filters:

  • DATA_NODES_FILTER = ‘(“US” || “EU”) && “SSD”’ - all nodes with attributes (tags in given case) (“US” or “EU”) and “SSD” match, meaning that nodes Node A and Node C match the filter. 
  • DATA_NODES_FILTER = ‘“Node B” || “Node C”’ - surprisingly nodes B and C match the filter.

We might use several approaches for filter syntax and semantics:

  • Regexp.
  • Simple predicate rules
    • lists, e.g. “US”, “EU”
    • groups ()
    • or ||
    • and &&
    • not !
    • contains()

Data nodes evaluation rules

Compatibility Matrix

AUTO_ADJUST_SCALE_UPAUTO_ADJUST_SCALE_DOWNAUTO_ADJUSTFILTER
AUTO_ADJUST_SCALE_UP



AUTO_ADJUST_SCALE_DOWN



AUTO_ADJUST



FILTER



As you can see, the only properties that are incompatible with each other are DATA_NODES_AUTO_ADJUST with any of the properties DATA_NODES_AUTO_ADJUST_SCALE_UP or DATA_NODES_AUTO_ADJUST_SCALE_DOWN or a combination of them.

Override rules

In addition to specifying DATA_NODES properties on table creation it’s also possible to update them with alter table/tableGroup queries. The set of properties specified in the next request will override the corresponding previously specified properties.

Table
CREATE ZONE Account WITH
	DATA_NODES_AUTO_ADJUST_SCALE_UP = 300,
	DATA_NODES_AUTO_ADJUST_SCALE_DOWN = 300_000;

ALTER ZONE Accounts WITH
	DATA_NODES_AUTO_ADJUST_SCALE_UP = 500

Will only override the DATA_NODES_AUTO_ADJUST_SCALE_UP property, SCALE_DOWN one still will be 300_000.

Table
ALTER ZONE Accounts WITH DATA_NODES_AUTO_ADJUST = 1000

Will override both DATA_NODES_AUTO_ADJUST_SCALE_UP and DATA_NODES_AUTO_ADJUST_SCALE_DOWN because DATA_NODES_AUTO_ADJUST isn’t compatible with ones.

Table
ALTER ZONE Accounts WITH DATA_NODES_FILTER = ‘(“US” || “EU”) && “SSD”’

Will add the filter without affecting  DATA_NODES_AUTO_ADJUST properties.

Retrieve distribution zones configuration

It should be possible to retrieve distribution zone both through table views and through a new sql command DESCRIBE:

  • DESCRIBE TABLE <tableName>;
  • DESCRIBE TABLE GROUP <tableGroupName>;

Risks and Assumptions

Some aspects of node attributes configuration aren't well designed. Besides that, manual setting of data nodes is intentionally skipped. There's corresponding extension point in aforementioned design that will allow user to specify set of nodes explicitly using special APPLY keyword. However, currently it's not clear whether we really need it or not. 

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

[1] https://ignite.apache.org/docs/latest/data-modeling/data-partitioning

[2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-77%3A+Node+Join+Protocol+and+Initialization+for+Ignite+3

Tickets

key summary type created updated due assignee reporter customfield_12311032 customfield_12311037 customfield_12311022 customfield_12311027 priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels