THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Data source:

  • Do the pre-checks before design a cube:
    • Check data quality, includes data type, null values;
    • Clean the data before loading into Kylin;
    • Collect statistics, includes scale, dimension cardinalities;
    • Pay attention to the very low cardinality columns (booleans, gender), and very high cardinality columns (like email, uuid, timestamp, numbers);
  • Use Hive view for flexibilities (if there will be logic change)
    • Define calculations, conversions in the Hive view; Then use the view as the fact table in Kylin;
    • Logic change only affects Hive view, no need to re-create model/cube; 
    • If the physical table is partitioned in Hive, also define the view as partitioned. 
  • Use partitioned Hive table for incremental loading
    • Avoid a full table scan on each build
  • Use the Hive partition column as Kylin model's partition column
    • So that when Kylin loading data from Hive, Hive can leverage the time condition to skip non-related partitions;
  • Never delete the column from Hive table;
    • If a column is in use but deleted from Hive, Kylin model will become invalid;
    • Add new column is always okay.

Lookup tables:

  • Keep lookup table small;
    • Kylin will load the lookup tables into memory;
    • Usually, it should be < 1 million rows, size < 100 MB;
    • Remove unnecessary columns from lookup tables (physically, or via Hive view);
  • Handle very large lookup tables (say > 10 million rows)
    • Make a flat table in Hive, and then use the flat table as the fact table in Kylin;
    • Or, when defining the data model, check the option "Not building snapshot for this table".
  • Lookup table should be appended;
    • No deletion for history records;
    • If some rows updated, may need to refresh the cube segments; 
  • Avoid using Hive view for lookup tables;
    • Kylin will load the lookup table as a snapshot, which needs time to materialize it;
    • Though, Kylin still support view as lookup table;
  • Lookup table PKs should be unique;
    • Hive has no restriction on this, but Kylin will check it on the building;
  • Join condition should be "fact.fkCol=lookup.pkCol"
    • Kylin doesn't support non-equal join, constant join;
    • Composite PK is okay;
  • No labels