Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
1. Background: Why Kylin on Parquet
- HBase is not columnar storage and has no secondary index
- HBase does not fit for cloud deployment and auto-scaling;
- Parquet is an open-source columnar file format;
- Cloud-Native, work with most FS including HDFS,S3, Blob store, OSS, etc;
- Well integration with Hadoop, Hive, Spark, Impala, and others;
- Custom index
- Mature and stable
2. Parquet file layouts on HDFS
Storage layout is important for I/O optimizations
Do as much as possible pruning before reading the file
- Filter by folder, file name, etc
Each Cuboid uses a dedicated folder
Cube
- Segment A
- Cuboid-1111
- part-0000-XXX.snappy.parquet
- part-0001-XXX.snappy.parquet
- Cuboid-1001
- part-0000-XXX.snappy.parquet
- part-0001-XXX.snappy.parquet
- Cuboid-1111
- Segment B
- Cuboid-1111
- part-0000-XXX.snappy.parquet
- ...
- Cuboid-1111
- Segment A
Advantages
- Filter by folder is good enough
- Can dynamically add/remove cuboid without impact others
Disadvantage
- Many folders when the cube has many cuboids
3. Dimension/measure layouts in Parquet
- Dimension and measures layouts in parquet files
If there is a dimensions of d1 d2 d3 and measures of m1 m2,Then a parquet file like this will be generated.
Columns 1, 2, and 3 correspond to Dimension d1, d2, and d3, respectively
Column 110000 and 110001 respectively correspond to Measure m1, m2
Parquet file schema:
1: OPTIONAL INT64 R:0 D:1
2: REQUIRED INT64 R:0 D:0
3: OPTIONAL INT64 R:0 D:1
110000: OPTIONAL INT64 R:0 D:1
110001: OPTIONAL INT64 R:0 D:1
How to deal with the order of dimension and measure
- In a parquet file, the order of the columns is always dimension first and measure last
- There is no order between dimensions and between measures
Parquet file split
- parquet.block.size default 128mb
4. Data types mapping in Parquet
- How do you encode the data into a parquet?
- Kylin no longer needs to encode columns
- Parquet will encode needed columns
- All data types can be accurately mapped to Parquet
- Support with ParquetWriteSupport
- StructType ArrayType MapType
- Direct mapping transformation
- Support with ParquetWriteSupport
Type | Spark | Parquet |
---|---|---|
Numeric types | ByteType | INT32 |
Numeric types | ShortType | INT32 |
Numeric types | IntegerType | INT32 |
Numeric types | LongType | INT64 |
Numeric types | FloatType | FLOAT |
Numeric types | DoubleType | DOUBLE |
Numeric types | DecimalType | INT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY |
String type | StringType | Binary |
Binary type | BinaryType | Binary |
Boolean type | BooleanType | BOOLEAN |
Datetime type | TimestampType | INT96 |
Datetime type | DateType | INT32 |
- How computed columns are stored
- Bitmap: Binary
- TopN: Binary
5. How to build Cube into Parquet
- Reduced build steps
- From ten steps to twenty steps to two steps
- Build Engine
- Simple and clear architecture
- Spark as the only build engine
- All builds are done via spark
- Adaptively adjust spark parameters
- Dictionary of dimensions no longer needed
- Supported measures
- Sum
- Count
- Min
- Max
- TopN
- Bitmap
- Hllc
- Cube into parquet
*
6. How to query with Parquet
- Query Engine: Sparder
- Use spark as a calculation tool
- Distributed query engine,Avoid stand-alone pressure
- Unified calculation engine for building and querying
- There is a substantial increase in query speed
- Can integrate more spark features and ecology
- Basic process of Sparder query
- Parser => Sql to AST tree
- Validation => Further verify the validity of SQL based on metadata
- Optimizer => Generate LogicPlan according to optimization rules
- Kylin's Adaptation => Convert AST's nodes to rel nodes(Various classes ending with Rel, such as FilterRel)
- Spark Plan => relnode to spark plan
- Query Execution => Read cube data based on the generated spark plan
- What are the optimizations of Kylin reading parquet cube data?
- Segment Pruning
- Shardby
- Parquet page index
- Project Pushdown
- Predicate Pushdown
7. Performance
Build
- Use Tpch as the dataset to remember the test
- The detailed data is as follows
Query
- TPC-H 50, Average time 3.93s
- TPC-H 100,Average time 5.95s
- TPC-H 500,Average time 11.33s
- TPC-H 1000,Average time 18.92s
8. Next step
Overview
Content Tools
ThemeBuilder
Apps