Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Background
What is Realtime OLAP in Kylin
Kylin v3.0.0 will release the real-time OLAP feature, by the power of newly added streaming reciever cluster, Kylin can query streaming data with sub-second latency. You can check this tech blog for the overall design and core concept.
If you want to find a step by step tutorial, please check this this tech blog.
In this article, we will introduce how to update segment and set timezone for derived time column in realtime OLAP cube.
Sample Event
This sample event comes from my python script with some additional fields such as event_time
. We have the field such as event_time
, which stands for the timestamp of event.
And we assume that event come from countries of different timezone, “2019-12-09 08:44:50.000-0500” indicated that event applies America/New_York
timezone. You may have some events which come from Asia/Shanghai
as well.
Says we have Kafka message which looks like this:
Question
When perform realtime OLAP analysis with Kylin, you may have some concerns included:
- Will events in different timezones cause incorrect query results?
- How could I make it correct when kafka messages contain the value which is not what you want, says some dimension value is misspelled?
- How could I retrieve long-late messages which has been dropped?
- My query only hit a small range of time, how should I write filter condition to make sure unused segments are purged/skipped from scan?
Quick Answer
For the first question, you can always get the correct result in the right timezone of location by set kylin.stream.event.timezone=GMT+N
for all Kylin processes. By default, UTC is used for derived time column.
For the second and third question, in fact you cannot update/append segment to a normal streaming cube, but you can update/append a streaming cube which in lambda mode, all you need to prepare is creating a Hive table which is mapped to your kafka event.
For the fourth question, you can achieved this by adding derived time column in your filter condition like MINUTE_START
/DAY_START
etc.