Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

Table of Contents

Background

Since that fact that Kylin4 is highly depend on Spark SQL, it better we have a deeper understanding of Spark SQL.

Definitation

Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.

The main abstraction in Catalyst is TreeNode that is then used to build trees of Expressions or QueryPlans.

Core Components

Name	Target
SQL Parser Framework	SQL Parser Framework in Spark SQL uses ANTLR to translate a SQL text to a data type, Expression, `TableIdentifier` or LogicalPlan.
Catalyst Framework	Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.
Tungsten Execution Backend	The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. It does so by offering the following optimization features: Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly, Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates, Whole-Stage Code Generation (aka CodeGen).
Monitor	SQL tab in web UI shows SQLMetrics per physical operator in a structured query physical plan. You can access the SQL tab under `/SQL` URL, e.g. http://localhost:4040/SQL/. By default, it displays all SQL query executions. However, after a query has been selected, the SQL tab displays the details for the structured query execution.

...

Core Contract

Core Interface

Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.

Name Contract Comment

SparkSession

Entry Point to Spark SQL

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).

Dataset

Structured Query with Data Encoder

Dataset is a strongly-typed data structure in Spark SQL that represents a structured query.

Catalyst

Tree Manipulation Framework

.

Parser Framework

Name	Contract	Comment
ParserInterface	Base of SQL Parser	`ParserInterface` is the abstraction of SQL parsers that can convert (parse) textual representation of SQL statements into Expressions, LogicalPlans, TableIdentifiers, FunctionIdentifier, StructType, and DataType.
AbstractSqlParser	Base SQL Parsing Infrastructure	`AbstractSqlParser` is the base of ParserInterfaces that use an AstBuilder to parse SQL statements and convert them to Spark SQL entities, i.e. DataType, StructType, Expression, LogicalPlan and `TableIdentifier`.

Catalyst Framework

Name	Contract	Comment
TreeNode	Node in Catalyst Tree	`TreeNode` is a recursive data structure that can have one or many children that are again `TreeNodes`.
Expression	Executable Node in Catalyst Tree	`Expression` is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per `InternalRow`.
QueryPlan	Structured Query Plan	`QueryPlan` is part of Catalyst to build a tree of relational operators of a structured query. Scala-specific, `QueryPlan` is an abstract class that is the base class of LogicalPlan and SparkPlan (for logical and physical plans, respectively).
Catalog	Metastore Management Interface	`Catalog` is the interface for managing a metastore (aka metadata catalog) of relational entities (e.g. database(s), tables, functions, table columns and temporary views). `Catalog` is available using SparkSession.catalog property.
Attribute	Base of leaf named expressions	`Attribute` is the base of leaf named expressions.

...

Physical Operator

Name	Contract
Exchange	Base for Unary Physical Operators that Exchange Data	`Exchange` is the base of unary physical operators that exchange data among multiple threads or processes.

...

Core Diagram

...

Framework UML Diagram

High level Interface

PlantUML

abstract class TreeNode << BASIC >> {
   // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes.
  
  -children : Seq[BaseType]
  -verboseString: String
}

abstract class Expression  {
  // only required methods that have no implementation
  + dataType: DataType
  + doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
  + eval(input: InternalRow = EmptyRow): Any
  + nullable: Boolean
}

abstract class QueryPlan  {
  def output: Seq[Attribute]
  def validConstraints: Set[Expression]
}

abstract class LeafExpression {

  + children: Seq[Expression] = Nil
}

abstract class NamedExpression {
  + foldable: Boolean = false

  + name: String
  + exprId: ExprId
  + qualifiedName: String = (qualifier :+ name).mkString(".")

  + qualifier: Seq[String]

  + toAttribute: Attribute
  + metadata: Metadata = Metadata.empty
  + newInstance(): NamedExpression
  + typeSuffix = ..
}

abstract class Attribute {

  + references: AttributeSet = AttributeSet(this)

  + withNullability(newNullability: Boolean): Attribute
  + withQualifier(newQualifier: Seq[String]): Attribute
  + withName(newName: String): Attribute
  + withMetadata(newMetadata: Metadata): Attribute
  + withExprId(newExprId: ExprId): Attribute

  + toAttribute: Attribute = this
  + newInstance(): Attribute

}




' Layer 1
TreeNode <|-- Expression
TreeNode <|-- QueryPlan

Expression <|-- NamedExpression
Expression <|-- LeafExpression

LeafExpression <|-- Attribute
NamedExpression <|-- Attribute

...

PlantUML
abstract class TreeNode << BASIC >> { // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes. -children : Seq[BaseType] -verboseString: String }

Credit

...

Space shortcuts

Page tree

Versions Compared

Old Version 7

New Version 8

Key

Background

Definitation

Core Components

Core Contract

Core Interface

Parser Framework

Catalyst Framework

Physical Operator

Core Diagram

Framework UML Diagram

High level Interface

Credit

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 7

New Version 8

Key

Background

Definitation

Core Components

Core Contract

Core Interface

Parser Framework

Catalyst Framework

Physical Operator

Core Diagram

Framework UML Diagram

High level Interface

Credit