Model/cube design best practices

Data source:

Do the pre-checks before design a cube:
- Check data quality, includes data type, null values;
- Clean the data before loading into Kylin;
- Collect statistics, includes scale, dimension cardinalities;
- Pay attention to the very low cardinality columns (booleans, gender), and very high cardinality columns (like email, uuid, timestamp, numbers);
Use Hive view for flexibilities (if there will be logic change)
- Define calculations, conversions in the Hive view; Then use the view as the fact table in Kylin;
- Logic change only affects Hive view, no need to re-create model/cube;
- If the physical table is partitioned in Hive, also define the view as partitioned.
Use partitioned Hive table for incremental loading
- Avoid a full table scan on each build
Use the Hive partition column as Kylin model's partition column
- So that when Kylin loading data from Hive, Hive can leverage the time condition to skip non-related partitions;
Never delete the column from Hive table;
- If a column is in use but deleted from Hive, Kylin model will become invalid;
- Add new column is always okay.

Lookup tables:

Keep lookup table small;
- Kylin will load the lookup tables into memory;
- Usually, it should be < 1 million rows, size < 100 MB;
- Remove unnecessary columns from lookup tables (physically, or via Hive view);
Handle very large lookup tables (say > 10 million rows)
- Make a flat table in Hive, and then use the flat table as the fact table in Kylin;
- Or, when defining the data model, check the option "Not building snapshot for this table".
Lookup table should be appended;
- No deletion for history records;
- If some rows updated, may need to refresh the cube segments;
Avoid using Hive view for lookup tables;
- Kylin will load the lookup table as a snapshot, which needs time to materialize it;
- Though, Kylin still support view as lookup table;
Lookup table PKs should be unique;
- Hive has no restriction on this, but Kylin will check it on the building;
Join condition should be "fact.fkCol=lookup.pkCol"
- Kylin doesn't support non-equal join, constant join;
- Composite PK is okay;

Space shortcuts