KIP-1 Parquet storage

1. Background: Why Kylin on Parquet

HBase is not columnar storage and has no secondary index
HBase does not fit for cloud deployment and auto-scaling;
Parquet is an open-source columnar file format;
Cloud-Native, work with most FS including HDFS,S3, Blob store, OSS, etc;
Well integration with Hadoop, Hive, Spark, Impala, and others;
Custom index
Mature and stable

2. Parquet file layouts on HDFS

Storage layout is important for I/O optimizations
Do as much as possible pruning before reading the file
- Filter by folder, file name, etc
Each Cuboid uses a dedicated folder
Cube
- Segment A
  - Cuboid-1111
    - part-0000-XXX.snappy.parquet
    - part-0001-XXX.snappy.parquet
  - Cuboid-1001
    - part-0000-XXX.snappy.parquet
    - part-0001-XXX.snappy.parquet
- Segment B
  - Cuboid-1111
    - part-0000-XXX.snappy.parquet
    - ...
Advantages
- Filter by folder is good enough
- Can dynamically add/remove cuboid without impact others
Disadvantage
- Many folders when the cube has many cuboids

3. Dimension/measure layouts in Parquet

Dimension and measures layouts in parquet files
If there is a dimensions of d1 d2 d3 and measures of m1 m2，Then a parquet file like this will be generated.
Columns 1, 2, and 3 correspond to Dimension d1, d2, and d3, respectively
Column 110000 and 110001 respectively correspond to Measure m1, m2

Parquet file schema:
    1:           OPTIONAL INT64 R:0 D:1
    2:           REQUIRED INT64 R:0 D:0
    3:           OPTIONAL INT64 R:0 D:1
    110000:      OPTIONAL INT64 R:0 D:1
    110001:      OPTIONAL INT64 R:0 D:1

How to deal with the order of dimension and measure
- In a parquet file, the order of the columns is always dimension first and measure last
- There is no order between dimensions and between measures
Parquet file split
- parquet.block.size default 128mb

4. Data types mapping in Parquet

How do you encode the data into a parquet?
- Kylin no longer needs to encode columns
- Parquet will encode needed columns
All data types can be accurately mapped to Parquet
- Support with ParquetWriteSupport
  - StructType ArrayType MapType
- Direct mapping transformation

Type	Spark	Parquet
Numeric types	ByteType	INT32
Numeric types	ShortType	INT32
Numeric types	IntegerType	INT32
Numeric types	LongType	INT64
Numeric types	FloatType	FLOAT
Numeric types	DoubleType	DOUBLE
Numeric types	DecimalType	INT32，INT64，BinaryType，FIXED_LEN_BYTE_ARRAY
String type	StringType	Binary
Binary type	BinaryType	Binary
Boolean type	BooleanType	BOOLEAN
Datetime type	TimestampType	INT96
Datetime type	DateType	INT32

How computed columns are stored
- Bitmap: Binary
- TopN: Binary

5. How to build Cube into Parquet

Reduced build steps
- From ten steps to twenty steps to two steps
Build Engine
- Simple and clear architecture
- Spark as the only build engine
- All builds are done via spark
- Adaptively adjust spark parameters
- Dictionary of dimensions no longer needed
Supported measures
- Sum
- Count
- Min
- Max
- TopN
- Bitmap
- Hllc
Cube into parquet
*

6. How to query with Parquet

Query Engine: Sparder
- Use spark as a calculation tool
- Distributed query engine，Avoid stand-alone pressure
- Unified calculation engine for building and querying
- There is a substantial increase in query speed
- Can integrate more spark features and ecology

Basic process of Sparder query
1. Parser => Sql to AST tree
2. Validation => Further verify the validity of SQL based on metadata
3. Optimizer => Generate LogicPlan according to optimization rules
4. Kylin's Adaptation => Convert AST's nodes to rel nodes(Various classes ending with Rel, such as FilterRel)
5. Spark Plan => relnode to spark plan
6. Query Execution => Read cube data based on the generated spark plan

What are the optimizations of Kylin reading parquet cube data？
- Segment Pruning
- Shardby
- Parquet page index
- Project Pushdown
- Predicate Pushdown

7. Performance

Build
- Use TPCH as the dataset to remember the test
- The detailed data is as follows
Query
- TPC-H 50， Average time 3.93s
- TPC-H 100，Average time 5.95s
- TPC-H 500，Average time 11.33s
- TPC-H 1000，Average time 18.92s

Space shortcuts

Page tree

1. Background: Why Kylin on Parquet

2. Parquet file layouts on HDFS

3. Dimension/measure layouts in Parquet

4. Data types mapping in Parquet

5. How to build Cube into Parquet

6. How to query with Parquet

7. Performance

8. Next step