Fault Tolerance

Current Data Replication Design:

When an index is created, each NC creates a LocalResource object per partition on that NC. Each LocalResource object has a unique resource ID that is generated using ResourceIdFactory. This ResourceIdFactory is maintained by its NC and its value is initialized by traversing all LocalResource objects on disk during NC startup. In other words, this LocalResource object holds the translation (index name <-> Resource ID) on this NC.

We use this translation primarily for an index in a dataset and logging. Instead of serializing the index name in the log, we use its unique id. For this reason, those LocalResource objects need to be replicated to remote replicas to know the translation in case of failures.

Example:

A log record on NC1 could look like this:

LSN	Job ID	Type	Resource ID	Dataset ID	PKValue
100	3	ENTITY_COMMIT	3	1	12345

With the introduction of data replication, new fields had to be added and that log record now looks like:

*Source*	*Node ID*	LSN	Job ID	Type	Resource ID	Dataset ID	PKValue
*Local*	*NC1*	100	3	ENTITY_COMMIT	3	1	12345

Which would be logged in remote replicas NC2 and NC3 as (assuming replication factor = 3):

*Source*	*Node ID*	LSN	Job ID	Type	Resource ID	Dataset ID	PKValue
*Remote*	*NC1*	X	3	ENTITY_COMMIT	3	1	12345

Where X is whatever log tail offset in NC2 or NC3.

Now when NC1 crashes and wakes up, it performs remote recovery by asking NC2 or NC3 for its replicated LocalResource objects as well as its log records. Using this information, NC1 can replay these logs and recover to its lost state and reconnect to the cluster which makes the cluster active again.

Road to Fault Tolerance:

What if NC1 never comes back and we would like NC2 to take over its workload?

In this case, NC2 needs to read the LocalResource objects that belong to NC1 and replay its logs. However, this would be a problem since the resource ids in these logs have different translations (referencing different indexes) in NC2. We can over come this problem by issuing new LocalResource objects on NC2 for these NC1 indexes. But that means we also need to replicate these new LocalResource objects to NC3, which already has corresponding LocalResource objects and logs for the same indexes from NC1 before failure.

We could probably design a complicated way to get around this issue. However, when NC2 takes over NC1 and starts generating new logs and LSM disk components for these newly created indexes, NC3 wouldn’t know where to replicate them. Do they belong to the old NC1 indexes or the newly created NC2 indexes?

This can be broken down to two problems:

Problem 1:

New Resource IDs need to be generated for failed nodes’ indexes which would result in overwriting old ones on other replicas and indexes confusion.

Proposed Solution:

Assign a global unique resource id per index across all NCs that is coordinated by CC. This would guarantee that a node could just use the same resource id that was originally assigned to the index.

Problem 2:

There is no unique storage path to an index.

Example:

NC1 generates LSM_Component1 locally as:

Iodevice1/AsterixStorage/TinySocial_DataVerse/Twitter_Idx/LSM_Component1

Which is replicated on NC2 and NC3:

Iodevice1/AsterixStorage/NC1_Replica/TinySocial_DataVerse/Twitter_Idx/LSM_Component1

After NC1 fails and NC2 takes over that index and generates a new LSM_Component2, where should it store it and where should NC3 replicate it?

This would be a wrong path on NC2:

Iodevice1/AsterixStorage/TinySocial_DataVerse/Twitter_Idx/LSM_Component2

and this would be a wrong path on NC3:

Iodevice1/AsterixStorage/NC2_Replica/TinySocial_DataVerse/Twitter_Idx/LSM_Component2

Proposed Solution:

Assign AsterixDB cluster unique partition ids and store the index partition id on its LocalResource object. If we have 3 NCs each with 2 IO Devices, the partitions would look like this:

NC ID	IO Device ID	Partition ID
NC1	0	0
NC1	1	1
NC2	0	2
NC2	1	3
NC3	0	4
NC3	1	5

This would make the above example look something like this:

NC1 generates LSM_Component1 locally as:

Iodevice1/AsterixStorage/patition_1/TinySocial_DataVerse/Twitter_Idx/LSM_Component1

Which is replicated on NC2 and NC3:

Iodevice1/AsterixStorage/patition_1/TinySocial_DataVerse/Twitter_Idx/LSM_Component1

After NC1 fails and NC2 takes over that index and generates a new LSM_Component2, it is stored and replicated in the same partition in which it was generated on both NC2 and NC3:

Iodevice1/AsterixStorage/patition_1/TinySocial_DataVerse/Twitter_Idx/LSM_Component2

This way query plans could reference an index in any of these nodes just by knowing its partition name.

This also might eliminate the need to have the Node ID serialized in the log records and replace it by Partition ID since the mapping between nodes and partitions is known.

Page tree

Fault Tolerance

Current Data Replication Design:

Road to Fault Tolerance: