Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Table of Contents

minLevel	2

Background

Since that fact that Kylin4 is highly depend on Spark SQL, it better we have a deeper understanding of Spark SQL.

Definitation

Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.

The main abstraction in Catalyst is TreeNode that is then used to build trees of Expressions or QueryPlans.

Core Contract

Name	Role	Comment
SparkSession	Entry Point to Spark SQL	`SparkSession` is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application. As a Spark developer, you create a `SparkSession` using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).
Dataset	Structured Query with Data Encoder	Dataset is a strongly-typed data structure in Spark SQL that represents a structured query.
Catalyst	Tree Manipulation Framework	Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.
TreeNode	Node in Catalyst Tree	`TreeNode` is a recursive data structure that can have one or many children that are again `TreeNodes`.
Expression	Executable Node in Catalyst Tree	`Expression` is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per `InternalRow`.
QueryPlan	Structured Query Plan	`QueryPlan` is part of Catalyst to build a tree of relational operators of a structured query. Scala-specific, `QueryPlan` is an abstract class that is the base class of LogicalPlan and SparkPlan (for logical and physical plans, respectively).

Core Diagram

Framework UML Diagram

PlantUML

class TreeNode << BASIC >> {
   // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes.
  
  -children : Seq[BaseType]
  -verboseString: String
}

abstract class Expression  {
  // only required methods that have no implementation
  + dataType: DataType
  + doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
  + eval(input: InternalRow = EmptyRow): Any
  + nullable: Boolean
}

abstract class QueryPlan  {
  def output: Seq[Attribute]
  def validConstraints: Set[Expression]
}

TreeNode <|-- Expression
TreeNode <|-- QueryPlan

Credit

...

Space shortcuts

Page tree