Purpose

The purpose of this article is to explain some details about how the Geode Querying Mechanism, also known as OQL, works under the hood. The goal is to provide some UML diagrams and explanations to improve the knowledge of the reader, along with the main classes and interactions involved in the process to better understand and troubleshoot issues within the component.

Prerequisites

Basic UML Knowledge.
Geode OQL Knowledge.

Main Classes

The most important classes and methods are shown in the next two UML diagrams; it's important to note that the main entry points are the DefaultQuery and DefaultQueryService classes.

The second diagram contains the hierarchy for the actual classes that execute the logic related to the query itself, the most important (or most commonly used) one is CompiledSelect.

Creating and Compiling the Query

To create a query, the user executes the method QueryService.newQuery(String queryString) and, depending on some internal conditions and the actual place where the method is executed (client / peer / server), an instance of DefaultQueryService or ProxyQueryService (with its ProxyCache and InternalPool) will be used. The DefaultQuery object then will be instantiated and the queryString will be parsed using the QCompiler and the QueryExecutionContext.

The queryString is internally parsed using three internal classes: OQLParser, OQLLexer and OQLLexerTokenTypes. These classes are automatically generated through ANTLR (ANother Tool for Language Recognition) based on the oql.g file, the main definition for the Geode OQL Grammar. As a side note, ANTLR is a vast and complex framework on its own, the details are outside of the scope of the current article but, in short, it is a powerful parser generator for reading, processing, executing, and translating structured text or binary files; within Geode it is used to generate a parser to build and walk the parsed tree from a grammar definition.

Within the AST generated, each node is an instance of the CompiledValue class (see previous class diagrams for the full list of possible CompiledValue instances) and each instance is, at the same time, a smaller tree on its own. The tree's root node is in general an instance of the CompiledSelect class, which represents the SELECT clause, and within this class resides the logic to iterate through the nodes, decide whether an index can be used or not, etc. To illustrate a little bit more about the CompiledValue class hierarchy, and using simple examples, an instance of CompiledComparison will be created whenever the queryString has something like WHERE field = value, an instance of RangeJunction will be created whenever the queryString has something like aValue < field < otherValue, and so on.

The actual structure of the tree, on the other hand, is strictly dependent on the operands and the order on which the queryString was parsed, that's the reason why sometimes the performance obtained by an OQL execution could drastically change by just modifying the position of the filters within the WHERE clause in the queryString. The internal order of the nodes within the tree structure, moreover, is likely to change in runtime while the query is being evaluated, depending on the amount of data within the node and the buckets, the indexes (if any) being used, etc; so is worth noting that the same queryString could end up being a different tree on different members within the same cluster. This restructuring of the tree happens mainly within the organizeOperands method of the corresponding CompiledValue instances.

Query Execution per Region Type

The execution of the the query differs according to the underlying region type, within this section the main idea will be to provide details about how the query is executed, which classes are involved and how the overall flow process works.

The logic related to the client execution and the orchestration of the messaging itself, on the other hand, is the same no matter the type of the region. That said, and to avoid polluting the core interactions executed on server side depending on the region type, the following sequence diagrams show only the overall flow between client and server for a query execution.

Replicated Region

When using a REPLICATED region the same data is hosted across all members for which the region has been defined, so it's enough to execute the query on one single member instead of distributing the operation among all of them. A client connected to the distributed system through a locator member will get the connection details from a server member hosting the region and, using this connection, will execute the query only on that particular server. In short, one single member will be used to execute the whole query.

A sequence diagram showing the main interactions is shown below.

Partitioned Region

The execution of a query involving a PARTITIONED region is more complex that its counterpart: one node will act as the query coordinator, basically orchestrating the entire query execution across all nodes hosting buckets for the PARTITIONED region. It's worth noticing that any member can become the coordinator for a query execution and there's no logic associated with the election, it's just the member to which the client is connected while executing the query.

The coordinator node determines where the buckets are hosted and sends a QueryMessage to these members so they can execute the query on their local data set. Internally, the algorithm takes the full list of buckets and elects the members hosting the majority of them to reduce the amount of servers that have to execute the query locally and, thus, the amount of distributed messages that have to be sent and waited for. That said, it's important to note that it doesn't matter whether the member hosts a primary or secondary copy of the bucket, the query is executed on both and doesn't make any distinction.

Once every targeted member executes the query locally, it replies with the results to the coordinator node, which in turns aggregates all of the results and apply further operations if needed (ordering, filtering, retrying, etc.). These extra operations and aggregations are not executed until all replies come back. The main implementation resides within the PRQueryProcessor and PartitionedRegionQueryEvaluator classes.

Two sequence diagrams showing the main interactions are shown below, the full details about the distribution of the message are left out for the sake of simplicity and, for the same reason, there are two diagrams instead of just one: the first one corresponds to the execution on the coordinator node, and the second one to the execution on another node that participates in the query (assuming that the coordinator node doesn't host all of the buckets for the partitioned region).

Indexes

The purpose of this section is to give a brief overview about the different indexes types and how they work under the hood, it won't be as extensive as it should because the main topic of this article is OQL. Shortly, a new article will be created to explain indexes in detail.

That said, indexes are designed to significantly improve the query execution performance. Whenever a query is executed without the usage of an index, the query engine needs to iterate through every object within the collection. If, instead, an index that matches part or all of the query specification is available to use, the query engine iterates only over the indexed set, greatly improving the query time.

Indexes are created per node and they are not persisted, so every time a member with indexes defined starts up, it needs to recreate the actual index data locally. The creation process, on the other hand, works differently depending on the region type. When working with a REPLICATED region, the index must be manually created on every member hosting the region (this is automatically executed when creating the index through gfsh). When working with a PARTITIONED region, on the other hand, the index must be created on every single bucket, the actual internal index structure living within the bucket only contains the indexed set for this particular bucket, knowing nothing about the rest of the region data. Due to this, the index creation for this type of region is automatically and internally distributed to every member hosting buckets for the region, no matter the method used (gfsh, API call, etc.). Moreover, whenever a rebalance operation is executed for a PARTITIONED region with indexes defined, all entries are removed from the internal index structure and the index itself is automatically recreated for the bucket on the new member. In summary: the index data doesn't move along with the bucket, it gets recreated on the new member.

It's also worth noting that the usage of indexes might have a negative impact on regular region operations, basically because all associated indexes need to be updated as well during the execution. This scenario, however, is quite rare and the user must have hundreds or thousands of indexes defined to notice it.

There are 4 types of index (hash has been recently deprecated, so there's no point in explaining it here), each one will be briefly covered below.

Range/Functional Index

This type of index gets created whenever the expression has more than one iterator, is the most memory consuming of all types and can't be used with OVERFLOW nor OFF-HEAP.

The internal implementation stores a tuple containing the full copy of the actual value within the region (Region.Entry.getValue) plus the indexed value itself. Thus, and depending on the amount of repetitive values within the region being indexed, one entry can be stored more than once within the index data.

As an example, assuming that:

There is a courses region containing instances of the Course domain class.
The Course domain class has an internal List of students, each one with (at least) a name attribute.
For whatever reason, someone decides that creating an index on the name field of the Student class is a good idea (it would be more appropriate to use the surname as there will be far less repetitions).
The courses region contains the following entries: Course[id=1, students = { Hal, Clark, Lois, Victor }]; Course[id=2, students = { Lois, Hal }]; Course[id=3, students = { Lois, Hal, Victor }]; Course[id=4, students = { Bruce, Lois }].

The internal map-like structure containing the index data will look similar to the following (each Course is a full copy of the original value, the table only shows the id for simplicity):

*KEY*	*VALUE*
*Hal*	{Couse[id=1], Hal}; {Couse[id=2], Hal}; {Couse[id=3], Hal}
*Clark*	{Couse[id=1], Clark}
*Lois*	{Couse[id=1], Lois}; {Couse[id=2], Lois}; {Couse[id=3], Lois}; {Couse[id=4], Lois}
Victor	{Couse[id=1], Victor}; {Couse[id=3], Victor}
*Bruce*	{Couse[id=4], Bruce}

At this point there will be 5 identical instances Course[id=1] held in memory: one being the actual value in the region, plus four used within the internal index structure. It's easy to note that the memory footprint can greatly increase as the amount of repetitions increases as well, and that's because of what was explained before: the index stores an actual copy of the value, not just a reference.

While executing a query that uses this type of index, the query engine itself doesn't need to go and recover the value from the region, it's already stored within the index data and, so, the lookups are really quick.

Compact Range/Functional Index

This type of index gets created whenever the expression has one single iterator, can be used with OVERFLOW and OFF-HEAP.

Unlike its counterpart, this index doesn't store internally a full copy of the original value (Region.Entry.getValue), it stores a reference to the original value (Region.Entry) inserted into the region. This, of course, greatly decreases the memory footprint used, but it also requires the index to be updated synchronously with the region operations. Most importantly, the actual value must always be returned from the region itself by the query engine during the execution, since the index data only contains a reference. For this reason, validation logic should be executed at all times to make sure that the actual value within the Region.Entry in memory hasn't changed while the query was being executed (primarily to avoid returning wrong results).

Primary Key Index

This type of index is mainly used to make the query engine aware of a relationship between a particular field within the value and the actual key itself within a region, it is a good way to improve the query performance when data is partitioned using a key or a field value. The memory footprint is totally negligible and only equality comparisons can be made.

While executing a query that uses this type of index, the query engine basically executes a Region.get(key) to obtain the actual value to be returned.

As an example, assuming that:

There is a students region containing instances of the Student domain class.
The Student class contains an attribute called number, which is also used as the key when inserting objects into the region, like region.put(student.getNumber(), student).
A primary key index has been created on the region, and the expression for the index has been configured as number.

At this point, when the query engine executes the OQL "SELECT * FROM /students st WHERE st.number = `X`", the index will end up just calling Region.get(X) to retrieve the actual value.

Map Index

As it name describes, this type of index is used on Map fields, internally is basically a wrapper of the Range Index (both regular and compact) implementation.

The map-like structure used internally just contains a key representing the actual indexed key on the Map from the domain class, and the value is the instance of the real index being used (RangeIndex or CompactRangeIndex, the preceding one will be created whenever possible).

As an example, assuming that:

There is a points region containing instances of the Point domain class.
The Point class has a coordinate field, of Map type. The map uses two keys: "x" and "y".
A map index has been created on the points region, and the expression for the index has been configured as coordinate[*].

At this point, when the query engine executes the OQL "SELECT * FROM /points p WHERE p.coordinate['x'] = '1' OR p.coordinate['y'] = '1'", the map index will detect that both keys are indexed (x and y), will retrieve the corresponding RangeIndex or CompactRangeIndex from the internal structure, and will finally execute the lookup using these instances.

References

ANother Tool for Language Recognition

Space shortcuts

Page tree

Purpose

Prerequisites

Main Classes

Creating and Compiling the Query

Query Execution per Region Type

Replicated Region

Partitioned Region

Indexes

Range/Functional Index

Compact Range/Functional Index

Primary Key Index

Map Index

References

Space shortcuts

Page tree

Querying and Indexes Internals

Purpose

Prerequisites

Main Classes

Creating and Compiling the Query

Query Execution per Region Type

Replicated Region

Partitioned Region

Indexes

Range/Functional Index

Compact Range/Functional Index

Primary Key Index

Map Index

References