THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.



Background

Since that fact that Kylin4 is highly depend on Spark SQL, it better we have a deeper understanding of Spark SQL.

Definitation

Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions

The main abstraction in Catalyst is TreeNode that is then used to build trees of Expressions or QueryPlans.

Core Components

NameTarget
SQL Parser FrameworkSQL Parser Framework in Spark SQL uses ANTLR to translate a SQL text to a data type, Expression, TableIdentifier or LogicalPlan.
Catalyst FrameworkCatalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.
Tungsten Execution Backend

The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. It does so by offering the following optimization features:

  1. Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,

  2. Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,

  3. Whole-Stage Code Generation (aka CodeGen).

Monitor

SQL tab in web UI shows SQLMetrics per physical operator in a structured query physical plan.

You can access the SQL tab under /SQL URL, e.g. http://localhost:4040/SQL/.

By default, it displays all SQL query executions. However, after a query has been selected, the SQL tab displays the details for the structured query execution.



Contract/Interface

Core Interface

NameContractComment
SparkSessionEntry Point to Spark SQL

SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).

DatasetStructured Query with Data Encoder

Dataset is a strongly-typed data structure in Spark SQL that represents a structured query.

CatalogMetastore Management Interface

Catalog is the interface for managing a metastore (aka metadata catalog) of relational entities (e.g. database(s), tables, functions, table columns and temporary views).

Catalog is available using SparkSession.catalog property.


Parser Framework

NameContractComment
ParserInterfaceBase of SQL ParserParserInterface is the abstraction of SQL parsers that can convert (parse) textual representation of SQL statements into Expressions, LogicalPlans, TableIdentifiers, FunctionIdentifier, StructType, and DataType.
AbstractSqlParserBase SQL Parsing InfrastructureAbstractSqlParser is the base of ParserInterfaces that use an AstBuilder to parse SQL statements and convert them to Spark SQL entities, i.e. DataType, StructType, Expression, LogicalPlan and TableIdentifier.

Catalyst Framework

NameContractComment
TreeNodeNode in Catalyst Tree

TreeNode is a recursive data structure that can have one or many children that are again TreeNodes.

ExpressionExecutable Node in Catalyst Tree

Expression is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per InternalRow.

QueryPlanStructured Query Plan

QueryPlan is part of Catalyst to build a tree of relational operators of a structured query.

Scala-specific, QueryPlan is an abstract class that is the base class of LogicalPlan and SparkPlan (for logical and physical plans, respectively).

LogicalPlanLogical Relational Operator with Children and Expressions / Logical Query Plan

LogicalPlan is an extension of the QueryPlan contract for logical operators to build a logical query plan (i.e. a tree of logical operators).

A logical query plan is a tree of nodes of logical operators that in turn can have (trees of) Catalyst expressions. In other words, there are at least two trees at every level (operator).

SparkPlanPhysical Operators in Physical Query Plan of Structured Query

SparkPlan is the contract of physical operators to build a physical query plan (aka query execution plan).

SparkPlan contract requires that a concrete physical operator implements doExecute method.

QueryPlannerConverting Logical Plan to Physical Trees

QueryPlanner plans a logical plan for execution, i.e. converts a logical plan to one or more physical plans using strategies.

abstract class QueryPlanner[PhysicalPlan <: TreeNode[PhysicalPlan]] {
  def collectPlaceholders(plan: PhysicalPlan): Seq[(PhysicalPlan, LogicalPlan)]
  def prunePlans(plans: Iterator[PhysicalPlan]): Iterator[PhysicalPlan]
  def strategies: Seq[GenericStrategy[PhysicalPlan]]
}
InternelRowBinary Row Format

InternalRow is also called Catalyst row or Spark SQL row.

UnsafeRow is a concrete InternalRow.

AttributeBase of leaf named expressions

Attribute is the base of leaf named expressions.


CodegenSupportPhysical Operators with Java Code Generation

CodegenSupport is the contract of physical operators that want to support Java code generation and participate in the Whole-Stage Java Code Generation (Whole-Stage CodeGen).

trait CodegenSupport extends SparkPlan {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def doProduce(ctx: CodegenContext): String
  def inputRDDs(): Seq[RDD[InternalRow]]

  // ...except the following that throws an UnsupportedOperationException by default
  def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String
}

BaseRelation

Collection of Tuples with schema

BaseRelation is the contract of relations (aka collections of tuples) with a known schema.

"Data source", "relation" and "table" are often used as synonyms.

abstract class BaseRelation {
  // only required properties (vals and methods) that have no implementation
  // the others follow
  def schema: StructType
  def sqlContext: SQLContext
}

Physical Operator

NameContract
ExchangeBase for Unary Physical Operators that Exchange Data

Exchange is the base of unary physical operators that exchange data among multiple threads or processes.



Core Diagram


Framework UML Diagram


 High level Interface

«BASIC»TreeNode // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes. children : Seq[BaseType] verboseString: String Expression // only required methods that have no implementation dataType: DataType nullable: Boolean doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode eval(input: InternalRow = EmptyRow): Any QueryPlan def output: Seq[Attribute]def validConstraints: Set[Expression] LogicalPlan LeafExpression children: Seq[Expression] = Nil NamedExpression foldable: Boolean = false name: String exprId: ExprId qualifier: Seq[String] toAttribute: Attribute metadata: Metadata = Metadata.empty typeSuffix = .. qualifiedName: String = (qualifier :+ name).mkString(".") newInstance(): NamedExpression Attribute toAttribute: Attribute = this references: AttributeSet = AttributeSet(this) withNullability(newNullability: Boolean): Attribute withQualifier(newQualifier: Seq[String]): Attribute withName(newName: String): Attribute withMetadata(newMetadata: Metadata): Attribute withExprId(newExprId: ExprId): Attribute newInstance(): Attribute «BASIC»TreeNode // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes. children : Seq[BaseType] verboseString: String

Credit

All right reserved to jaceklaskowski.