THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.


1. Background: Why Kylin on Parquet

  • HBase is not columnar storage and has no secondary index
  • HBase does not fit for cloud deployment and auto-scaling;
      
  • Parquet is an open-source columnar file format;
  • Cloud-Native, work with most FS including HDFS,S3, Blob store, OSS, etc;
  • Well integration with Hadoop, Hive, Spark, Impala, and others;
  • Custom index
  • Mature and stable

2. Parquet file layouts on HDFS

  • Storage layout is important for I/O optimizations

  • Do as much as possible pruning before reading the file

    • Filter by folder, file name, etc
  • Each Cuboid uses a dedicated folder

  • Cube

    • Segment A
      • Cuboid-1111
        • part-0000-XXX.snappy.parquet
        • part-0001-XXX.snappy.parquet
      • Cuboid-1001
        • part-0000-XXX.snappy.parquet
        • part-0001-XXX.snappy.parquet
    • Segment B
      • Cuboid-1111
        • part-0000-XXX.snappy.parquet
        • ...
  • Advantages

    • Filter by folder is good enough
    • Can dynamically add/remove cuboid without impact others
  • Disadvantage

    • Many folders when the cube has many cuboids

3. Dimension/measure layouts in Parquet

  • Dimension and measures layouts in parquet files
    If there is a dimensions of d1 d2 d3 and measures of m1 m2,Then a parquet file like this will be generated.
    Columns 1, 2, and 3 correspond to Dimension d1, d2, and d3, respectively
    Column 110000 and 110001 respectively correspond to Measure m1, m2
Parquet file schema:
    1:           OPTIONAL INT64 R:0 D:1
    2:           REQUIRED INT64 R:0 D:0
    3:           OPTIONAL INT64 R:0 D:1
    110000:      OPTIONAL INT64 R:0 D:1
    110001:      OPTIONAL INT64 R:0 D:1
  • How to deal with the order of dimension and measure

    • In a parquet file, the order of the columns is always dimension first and measure last
    • There is no order between dimensions and between measures
  • Parquet file split

    • parquet.block.size default 128mb

4. Data types mapping in Parquet

  • How do you encode the data into a parquet?
    • Kylin no longer needs to encode columns
    • Parquet will encode needed columns
  • All data types can be accurately mapped to Parquet
    • Support with ParquetWriteSupport
      • StructType ArrayType MapType
    • Direct mapping transformation
TypeSparkParquet
Numeric typesByteTypeINT32
Numeric typesShortTypeINT32
Numeric typesIntegerTypeINT32
Numeric typesLongTypeINT64
Numeric typesFloatTypeFLOAT
Numeric typesDoubleTypeDOUBLE
Numeric typesDecimalTypeINT32,INT64,BinaryType,FIXED_LEN_BYTE_ARRAY
String typeStringTypeBinary
Binary typeBinaryTypeBinary
Boolean typeBooleanTypeBOOLEAN
Datetime typeTimestampTypeINT96
Datetime typeDateTypeINT32
  • How computed columns are stored
    • Bitmap: Binary
    • TopN: Binary

5. How to build Cube into Parquet

  • Reduced build steps
    • From ten steps to twenty steps to two steps
  • Build Engine
    • Simple and clear architecture
    • Spark as the only build engine
    • All builds are done via spark
    • Adaptively adjust spark parameters
    • Dictionary of dimensions no longer needed
  • Supported measures
    • Sum
    • Count
    • Min
    • Max
    • TopN
    • Bitmap
    • Hllc
  • Cube into parquet
    *

6. How to query with Parquet

  • Query Engine: Sparder
    • Use spark as a calculation tool
    • Distributed query engine,Avoid stand-alone pressure
    • Unified calculation engine for building and querying
    • There is a substantial increase in query speed
    • Can integrate more spark features and ecology

  

  • Basic process of Sparder query
    1. Parser => Sql to AST tree
    2. Validation => Further verify the validity of SQL based on metadata
    3. Optimizer => Generate LogicPlan according to optimization rules
    4. Kylin's Adaptation => Convert AST's nodes to rel nodes(Various classes ending with Rel, such as FilterRel)
    5. Spark Plan => relnode to spark plan
    6. Query Execution => Read cube data based on the generated spark plan

  

  • What are the optimizations of Kylin reading parquet cube data?
    • Segment Pruning
    • Shardby
    • Parquet page index
    • Project Pushdown
    • Predicate Pushdown
        

7. Performance

  • Build

    • Use TPCH as the dataset to remember the test
    • The detailed data is as follows
  • Query

    • TPC-H 50, Average time 3.93s
    • TPC-H 100,Average time 5.95s
    • TPC-H 500,Average time 11.33s
    • TPC-H 1000,Average time 18.92s

8. Next step

  • No labels