ID	IEP-96
Author	Alex Plehanov
Sponsor
Created	26 Oct 2022
Status	DRAFT

Motivation

During execution of SQL queries, SQL engine can load and keep in memory a lot of rows preventing GC to collect them. Such nodes as spools or sort nodes keep in memory all rows from the input until query is finished. To If there are too many objects, or the objects are too large, the JVM can easily run out of memory. To prevent nodes from running out of memory when executing SQL queries, available to SQL engine memory can be limited with memory quotas.

Description

It's proposed to create two configurable memory quota for SQL engine:

Memory quota for SQL engine (for all queries)
Memory quota for each query

Since, there is no information on node about memory consumption by query from another nodes, all these quotas affects only local node data (for example, each query can consume up to per-query quota on each node). Query should be cancelled if one of theese quotas is exceeded.

Affected execution nodes

Here is a list of fields (and corresponding execution nodes) which collect and store data and can keep up to O(n) rows in memory:

HashAggregateNode.Grouping#groups (HashAggregateNode)
Accumulators.AggAccumulator#buf (HashAggregateNode, SortAggregateNode)
TableSpoolNode#rows (TableSpoolNode)
AbstractSetOpNode.Grouping#groups(MinusNode, IntersectNode)
RuntimeSortedIndex#rows (IndexSpoolNode)
RuntimeHashIndex#rows (IndexSpoolNode)
CollectNode.MapCollector#outBuf (CollectNode)
CollectNode.ArrayCollector#outBuf (CollectNode)
SortNode#rows (SortNode)
SortNode#reversed (SortNode)
NestedLoopJoinNode#rightMaterialized (NestedLoopJoinNode)

All these nodes should be modified to track consumed memory.

Proposed implementation

It's proposed to create 3 levels of memory trackers:

Global memory tracker - control total memory usage by SQL queries on a cluster node.
Per-query memory tracker (perhaps we can start even with per-fragment memory tracker instead of per-query tracker to simplify implementation, since ExecutionContext currently is bounded to the fragment) - control memory usage by a single SQL query/fragment.
Per-execution-node memory tracker - tracks memory usage by a query execution node.

First and second trackers are configurable, third tracker is for internal usage.

Tracker on each level stores amount of memory, allocated by the tracked element and pass this information to the upper level tracker. When tracked element releases the rows (one by one or entirely), corresponding changes should be also reflected to the upper level tracker.

Proposed tracker interfaces, for global memory tracker and query memory tracker:

MemoryTracker

public interface MemoryTracker {
    public void onMemoryAllocated(long size);
    public void onMemoryReleased(long size);
    public void reset();
}

For execution node memory tracker:

RowTracker

public interface RowTracker<Row> {
    public void onRowAdded(Row row);
    public void onRowRemoved(Row row);
    public void reset();
}

Query memory tracker and execution node trackers are single threaded, global memory tracker can be called from the different threads. To reduce contention to upper level trackers track events can be batched on lower level trackers.

The main challenge in this task is implementation of accurate enough and fast enough size calculation instrument.

Size calculation

Object size calculation in Java world is a little bit tricky.

Exact shallow object size can be calculated using tools like JOL [1] or Instrumentation [2], but these tools require java agent to be configured and barely can be used in runtime in production.

Object size can also be estimated without java agent using some assumptions about JVM internals (object header size, alignment, pointer size). For the most of popular JVMs such an approach gives precise results, but for some JVMs or environments result can be not so accurate. Examples of tools, that use such approach: [3], [4], [5].

To calculate object graph size (deep object size) in common case reflection and recursive reference fields traverse must be used, taking into account already visited objects.

In our case we need rather fast than precise tool. We can use cached shallow size for most frequently used classes (classes natively supported by calcite type system and perhaps classes, supported by marshaller, which we can find in _KEY or _VAL fields) and shortcuts to calculate deep object size without using reflection (but still with recursive traversal for some classes, like collections). For other classas (rare case, for example when object marshalled with OptimizedMarshaller appears in _KEY or _VAL column) recursive traversal through reflection is requred. Alternatively, we can skip calculation for such a classes and use some constant value.

Risks and Assumptions

Performance drop due to ineffective row size calculation
Issues due to inaccurate row size calculation (OOM in case of underestimate, false positive query cancelation in case of overestimate)
Double counting of the same row (or subset of columns of row) in different execution nodes (for example, when one row collecting node is below another row collecting node)
Double counting of the same column value in different rows (for example, when large object is used as dynamic parameter and referenced by each row, like in this query: "SELECT a, ? FROM tbl ORDER BY a ")