Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Table of Contents | ||
---|---|---|
|
Background
Since that fact that Kylin4 is highly depend on Spark SQL, it better we have a deeper understanding of Spark SQL.
Definitation
Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.
The main abstraction in Catalyst is TreeNode that is then used to build trees of Expressions or QueryPlans.
Core Components
Name | Target |
---|---|
SQL Parser Framework | SQL Parser Framework in Spark SQL uses ANTLR to translate a SQL text to a data type, Expression, TableIdentifier or LogicalPlan. |
Catalyst Framework | Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions. |
Tungsten Execution Backend | The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough). Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc. It does so by offering the following optimization features:
|
...
Core Contract
Name | RoleContract | Comment |
---|---|---|
SparkSession | Entry Point to Spark SQL |
As a Spark developer, you create a |
Dataset | Structured Query with Data Encoder | Dataset is a strongly-typed data structure in Spark SQL that represents a structured query. |
Catalyst | Tree Manipulation Framework | Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions. |
TreeNode | Node in Catalyst Tree |
|
Expression | Executable Node in Catalyst Tree |
|
QueryPlan | Structured Query Plan |
Scala-specific, |
Catalog | Metastore Management Interface |
|
Attribute | Base of leaf named expressions |
|
PHYSICAL OPERATOR
Name | Contract | |
---|---|---|
Exchange | Base for Unary Physical Operators that Exchange Data |
|
...
Core Diagram
...
Framework UML Diagram
High level Interface
PlantUML |
---|
abstract class TreeNode << BASIC >> { // TreeNode is a recursive data structure that can have one or many children that are again TreeNodes. -children : Seq[BaseType] -verboseString: String } abstract class Expression { // only required methods that have no implementation + dataType: DataType + doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode + eval(input: InternalRow = EmptyRow): Any + nullable: Boolean } abstract class QueryPlan { def output: Seq[Attribute] def validConstraints: Set[Expression] } traitabstract NamedExpressionclass extends ExpressionLeafExpression { /** We should never fold named expressions in order to not remove the alias. */ override def+ children: Seq[Expression] = Nil } abstract class NamedExpression { + foldable: Boolean = false def+ name: String def+ exprId: ExprId /** * Returns a dot separated fully qualified name for this attribute. Given that there can be * multiple qualifiers, it is possible that there are other possible way to refer to this * attribute. */ def + qualifiedName: String = (qualifier :+ name).mkString(".") /** * Optional qualifier for the expression.+ qualifier: Seq[String] + *toAttribute: QualifierAttribute can also+ containmetadata: theMetadata fully qualified information, for e.g, Sequence of string= Metadata.empty + * containing the database and the table name * * For now, since we do not allow using original table name to qualify a column name once the * table is aliased, this can only be: * * 1. Empty Seq: when an attribute doesn't have a qualifier, * e.g. top level attributes aliased in the SELECT clause, or column from a LocalRelation. * 2. Seq with a Single element: either the table name or the alias name of the table. * 3. Seq with 2 elements: database name and table name */ def qualifier: Seq[String] def toAttribute: Attribute /** Returns the metadata when an expression is a reference to another expression with metadata. */ def metadata: Metadata = Metadata.empty /** Returns a copy of this expression with a new `exprId`. */ def newInstance(): NamedExpression protected def typeSuffix = if (resolved) { dataType match { case LongType => "L" case _ => "" } } else { "" } } TreeNode <|-- Expression TreeNode <|-- QueryPlannewInstance(): NamedExpression + typeSuffix = .. } abstract class Attribute { + references: AttributeSet = AttributeSet(this) + withNullability(newNullability: Boolean): Attribute + withQualifier(newQualifier: Seq[String]): Attribute + withName(newName: String): Attribute + withMetadata(newMetadata: Metadata): Attribute + withExprId(newExprId: ExprId): Attribute + toAttribute: Attribute = this + newInstance(): Attribute } ' Layer 1 TreeNode <|-- Expression TreeNode <|-- QueryPlan Expression <|-- NamedExpression Expression <|-- LeafExpression LeafExpression <|-- Attribute NamedExpression <|-- Attribute |
PlantUML |
---|
abstract class TreeNode << BASIC >> {
// TreeNode is a recursive data structure that can have one or many children that are again TreeNodes.
-children : Seq[BaseType]
-verboseString: String
}
|
Credit
All right reserved to jaceklaskowski.
...