ID	IEP-59
Author	Nikolay Izhikov
Sponsor	Anton Vinogradov
Created	14.10.2020
Status	DRAFT

Motivation

Many use-cases build on observation and processing changed records.

These use-cases include but not limited by

Export data into some warehouse, full-text search system, or distributed log system.
Online statistics and analytics
Wait and respond to some specific events or data changes.

For now, such scenarios are hard to implement in Ignite.

The only solution that can help with it, for now, is a Continuous Query.

Disadvantages of the CQ in described scenarios:

CQ requires data to be sent over the network
CQ parts (filter, transformer) live inside server node JVM so issues in it may affect server node stability.
Slow CQ listener leads to increasing of the memory consumption of the server node.
Fails of the CQ listener lead to the loss of the events.

The convenient solution should be:

Independence from the server node process (JVM) - issues and failures of the consumer shouldn't lead to server node instability.
Notification guarantees and failover - i.e. track and save a pointer to the last consumed record. Continue notification from this pointer in case of restart.
Resilience for the consumer - it's not an issue when a consumer temporarily consumes slower than data appear.

Description:

Ignite CDC is a new utility that should be run on the server node host. CDC utility watches by the appearance of the WAL archive segments.

On the segment archiving, the utility iterates it using the existing WAL Iterator and notifications CDC Consumer of each record from the segment.

Design choices:

CDC application works as a separate process.
CDC relies on the existing Ignite mechanism - WAL.
IEP Scope - deliver local data change events to a local consumer.
CDC keeps consumer offset in a special file.
WAL process will start from this offset on restart.
To prevent interference between the WAL archive process and CDC Ignite will create a hard link to the newly created segment in a special folder.
After success processing, CDC will delete this link.
Note, data will be removed from the disk only after CDC and Ignite will remove the link to a segment from both corresponding folders.
To manage minimal event gap new configuration timeout introduced - WalForceArchiveTimeout.
Flag to distinguish DataEntry on primary and backup nodes introduced.
All public APIs market with @IgniteExperimental to be able to improve it based on real-world usage feedback.
CDC consumer will be notified about binary metadata changes (Phase 2).
Configuration parameter "Maximum CDC folder size" will be implemented to prevent disk volume exceed.
CDC folder resolved using the logic as Ignite node does.
CDC application should be restarted by the OS mechanism in case of any error (destination unavailability, for example)
Initially, single CDC consumer supported. Support of several concurrently running consumers will be implemented in Phase2.

Public API:

Java API

/** Consumer of data change events. */
@IgniteExperimental
public interface ChangeDataCaptureConsumer {
    /**
     * Starts the consumer.
     */
    public void start();

    /**
     * Handles entry changes events.
     * If this method return {@code true} then current offset will be stored
     * and ongoing notifications after CDC application fail/restart will be started from it.
     *
     * @param events Entry change events.
     * @return {@code True} if current offset should be saved on the disk
     * to continue from it in case any failures or restart.
     */
    public boolean onEvents(Iterator<ChangeDataCaptureEvent> events);

    /**
     * Stops the consumer.
     * This methods can be invoked only after {@link #start()}.
     */
    public void stop();
}

/**
 * Event of single entry change.
 * Instance presents new value of modified entry.
 *
 * @see IgniteCDC
 * @see CaptureDataChangeConsumer
 */
@IgniteExperimental
public interface ChangeDataCaptureEvent extends Serializable {
    /**
     * @return Key for the changed entry.
     */
    public Object key();

    /**
     * @return Value for the changed entry or {@code null} in case of entry removal.
     */
    @Nullable public Object value();

    /**
     * @return {@code True} if event fired on primary node for partition containing this entry.
     * @see <a href="
     * https://ignite.apache.org/docs/latest/configuring-caches/configuring-backups#configuring-partition-backups">
     * Configuring partition backups.</a>
     */
    public boolean primary();

    /**
     * Ignite split dataset into smaller chunks to distribute them across the cluster.
     * {@link ChangeDataCaptureConsumer} implementations can use {@link #partition()} to split changes processing
     * in the same way as it done for the cache.
     *
     * @return Partition number.
     * @see Affinity#partition(Object)
     * @see Affinity#partitions()
     * @see <a href="https://ignite.apache.org/docs/latest/data-modeling/data-partitioning">Data partitioning</a>
     * @see <a href="https://ignite.apache.org/docs/latest/data-modeling/affinity-collocation">Affinity collocation</a>
     */
    public int partition();

    /**
     * @return Version of the entry.
     */
    public CacheEntryVersion version();

    /**
     * @return Cache ID.
     * @see org.apache.ignite.internal.util.typedef.internal.CU#cacheId(String)
     * @see CacheView#cacheId()
     */
    public int cacheId();
}

/**
 * Entry event order.
 * Two concurrent updates of the same entry can be ordered based on {@link ChangeEventOrder} comparsion.
 * Greater value means that event occurs later.
 */
@IgniteExperimental
public interface CacheEntryVersion extends Comparable<CacheEntryVersion>, Serializable {
    /**
     * Order of the update. Value is an incremental counter value. Scope of counter is node.
     * @return Version order.
     */
    public long order();

    /** @return Node order on which this version was assigned. */
    public int nodeOrder();

    /**
     * Cluster id is a value to distinguish updates in case user wants to aggregate and sort updates from several
     * Ignite clusters. {@code clusterId} id can be set for the node using
     * {@link GridCacheVersionManager#dataCenterId(byte)}.
     *
     * @return Cluster id.
     */
    public byte clusterId();

    /** @return Topology version plus number of seconds from the start time of the first grid node. */
    public int topologyVersion();

    /**
     * If source of the update is "local" cluster then {@code null} will be returned.
     * If updated comes from the other cluster using {@link IgniteInternalCache#putAllConflict(Map)}
     * then entry version for other cluster.
     * @return Replication version.
     * @see IgniteInternalCache#putAllConflict(Map)
     * @see IgniteInternalCache#removeAllConflict(Map)
     */
    public CacheEntryVersion otherClusterVersion();
}

Risks and Assumptions

CDC utility will be started and automatically restarted in the case of failure by the OS or some external tools to provide stable change event processing.
CDC feature may be used for the deployment that has WAL only.
At the start of the CDC first consumed event will be the first event available in the WAL archive.
The lag between the record change and CDC consumer notification will depend on segment archiving timeout and requires additional configuration from the user.
CDC failover depends on the WAL archive segment count. If the CDC application will be down a relatively long time it possible that Ignite deletes certain archive segments,
therefore consumer can't continue to receive changed records and must restart from the existing segments.

Online (real-time) CDC

In case of using separate process for capturing data changes from WAL archives makes the lag between CDC event happens and consumer notified about it is relatively big. It's proposed to provide opportunity to capture data and notify consumers directly from Ignite process. It helps minimize the lag by cost of additional memory usage.

User paths

Enable OnlineCDC on cluster:

Configure Ignite to enable OnlineCDC.
Configure ignite-cdc process (set BACKUP mode in CdcConfiguration)
Start Ignite cluster in ACTIVE_READ_ONLY mode.
Start background process ignite-cdc in BACKUP mode.

./ignite-cdc.sh cdc-config.xml

Move Ignite cluster to ACTIVE state.

Note, that ignite-cdc.sh can be run in 2 modes - BACKUP, ACTIVE:

BACKUP is used as backup process for OnlineCDC, and then such process may fetch CDC configuration from IgniteConfiguration. Case is async replication between master and stand-by clusters.
ACTIVE is used as independent process that doesn’t rely on OnlineCDC, has its own configuration. Case is filling a cold data lake.

Ignite node restart after failure:

Start Ignite node as usual (OnlineCDC automatically recovers itself, ignite-cdc waits for the recovering)

Stop OnlineCDC and use ignite-cdc instead:

Explicit stop OnlineCDC on Ignite (ignite-cdc automatically switches to active mode and starts capturing)

./control.sh –cdc online –stop

Stop both CDC - Online and ignite-cdc:

Explicit stop ignite-cdc.sh
Explicit stop OnlineCDC

User interface

Ignite

IgniteConfiguration#OnlineCdcConfiguration - CdcConsumer, keepBinary.
DataStorageConfiguration#onlineCdcBufSize - by default (walSegments * walSegmentSize). it’s now 640 MB by default.
1. All non-archived segments are fitted in memory. If OnlineCDC requires more space than it, it looks like ordinary CDC process should be used instead.
DataRegionConfiguration#cdcMode - BACKGROUND, ONLINE (default is BACKGROUND)
1. BACKGROUND - make hard links of archived segments into cdc directory, that is watched by the background ignite-cdc process.
2. ONLINE - OnlineCDC enabled + still do BACKGROUND mode job.
Logs:
1. initialization (amount of records read during the restore)
2. failure
3. buffer is full
Metrics:
1. ordinary cdc metrics (count of wal segments, wal entries)
2. current buffer size
3. status of CDC (on/off)
4. last committed WALPointer
5. lag between buffer and WAL archive (segments)
6. lag between WAL and CDC consumer (milliseconds).

ignite-cdc

CdcConfiguration#mode - ACTIVE, BACKUP (default ACTIVE if OnlineCDC is not configured, and BACKUP otherwise)
Logs:
1. clearing cdc dir, switch state.
Metrics:
1. current state.

control.sh

CdcOnline subcommand - enable/disable.

Segments

Note, there is a confusion of using “segment” word:

WAL segments are represented as numerated files. Size of WAL segments is configured with DataStorageConfiguration#walSegmentSize.
ReadSegment is a slice of the mmap WAL segment. It contains WAL records to sync with the actual file. Size of the segment differs from time to time and its maximum can be configured with DataStorageConfiguration#walBuffSize.

Initialization

On Ignite start during memory restore (in the main thread):

If DataRegionConfiguration#cdcMode == ONLINE, then create CdcProcessor.
CdcProcessor read from the Metastorage the last persisted CdcConsumerState.

CdcState#enabled is false then skip initialization.
If CdcState == null then initialize.

Initialization - collect logical updates from the CdcState#committedPtr until the end of WAL. See GridCacheDatabaseSharedManager#performBinaryMemoryRestore.

Online capturing of WALRecords

Entrypoint for WALRecords to be captured by CDC. Options are:

During read of SegmentedRingByteBuffer after fsync is invoked. It is a multi-producer/single-consumer data structure, then the only place to built-in is read operations (invoked at the moment of fsync).
1. + Relying on the consumer workflow we can guarantee order of events.
2. + Consumer is a background thread, capturing records doesn't affect performance of transactional threads
3. - No opportunity to filter physical records at the entrypoint (might waste the buffer space). Will filter them before actual sending.
4. - The consumer is triggered by a schedule - every 500ms by default.
5. - Logic has some differences depending on the WAL settings (mmap true/false, FULL_SYNC)
Capturing in FileWriteAheadLogManager#log(WALRecord).
1. + Capture logical records only
2. + Common logic for all WAL settings
3. - Captures record in buffer in transactional threads - might affect performance
4. - CDC process must sort events by WALPointer by self - maintain concurrent ordering data structure, and implementing waiting for closing WAL gaps before sending.
5. - Send events before they actually flushed in local Ignite node - lead to inconsistency between main and stand-by clusters.

First option is proposed to use.

CdcWorker

CdcWorker is a thread responsible for collecting WAL records, transforming them to CdcEvents and submitting them to a CdcConsumer. The worker collects records in the queue.

Capturing from the buffer (wal-sync-thread):

wal-sync-thread (the only reader of mmap WAL), under the lock that synchronizes preparing ReadSegment and rolling the WAL segment, to guarantee there are no changes in the underlying buffer.
Offers a deep copy of flushing ReadSegments to the CdcWorker.
CdcWorker checks remaining capacity and the buffer size:
1. If the size fits the capacity then store the offered buffer data into the Queue.
2. Otherwise, stop online CDC:
  1. Persist actual CdcConsumerState with (enabled=false, last send WALPointer)
  2. Write StopOnlineCdcRecord into WAL (use the prepared CdcConsumerState).
  3. Clear the buffer, stop CdcWorker.
Optimization: thread might filter ReadSegments by record type, and store only logical records.

Also, it's possible to stop Online CDC using command in control.sh. In this case it also writes StopOnlineCdcRecord.

Body loop (cdc-worker-thread):

Checks metadata (mappings, binary_meta, caches - can check inside Ignite, not reading files), prepare updates if any.
Polls the Queue, transforms ReadSegment data to Iterator<CdcEvent>, pushes them to CdcConsumer.
If CdcConsumer#onEvents returns true:
1. Persists CdcConsumerState.
2. Write OnlineCdcRecord record to WAL with the WALPointer.
Optimization: transform segment buffers to CdcEvents in background (to reduce the buffer usage). CdcConsumer should be async then?

WAL records

OnlineCdcRecord extends WALRecord {
	private WALPointer last;
}

StopOnlineCdcRecord extends WALRecord {
	private WALPointer last;
}

ignite-cdc in BACKUP mode

Parses WAL records, looking for OnlineCdcRecord and StopOnlineCdcRecord
For OnlineCdcRecord - clears obsolete links from CDC directory
For StopOnlineCdcRecord - switch to ACTIVE mode, start capturing from the last WALPointer (from previous OnlineCdcRecord).

Meta Storage

Online CDC status - ON / OFF
Committed pointer (confirmed by CdcConsumer).

CdcWorker

class CdcWorker {
	private final CdcConsumer consumer;
	
	private final long checkFreq;
	
	// Invoked in wal-sync-thread.
	public void offer(ReadSegment seg) {
		// Check capacity, adding segment to the queue.
	} 

	// online-cdc-thread
	public void body() {
		// Polling queue, push to CdcConsumer, writing CdcState to MetaStorage.
	}
}