THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

What is Realtime OLAP in Kylin

Kylin v3.0.0 will release the real-time OLAP feature, by the power of newly added streaming reciever cluster, Kylin can query streaming data with sub-second latency. You can check this tech blog for the overall design and core concept.

If you want to find a step by step tutorial, please check this this tech blog.
In this article, we will introduce how to update segment and set timezone for derived time column in realtime OLAP cube.

Sample Event

This sample event comes from my python script with some additional fields such as event_time. We have the field such as event_time, which stands for the timestamp of event.
And we assume that event come from countries of different timezone, “2019-12-09 08:44:50.000-0500” indicated that event applies America/New_York timezone. You may have some events which come from Asia/Shanghai as well.

Says we have Kafka message which looks like this:

SampleEvent
{
    "content_list":[
        "22",
        "22",
        "22"
    ],
    "act_type":"click",
    "event_ts_2":1600877255000,
    "event_ts":1600877255000,
    "user_detail":{
        "devide_type":7,
        "location":{
            "city":"shenzhen"
        },
        "network_type":3
    },
    "video_id":22,
    "event_date_2":"2020-09-23 16:07:35.000+08:00",
    "str_minute":"7",

    "video_type":"3c8416",
    "play_times":22,
    "pageview_id":"3c84cf9d-b8fb-3dec-8b8c-f510c4b6fd097",
    "active_minutes":50.0208,
    "device_brand":"vivo",
    "str_minute_second":"16_7_35",
    "play_duration":37.6584,
    "event_date":"2020-09-23 16:07:35.000+08:00",
    "page_id":"page_22",
    "str_second":"35",
    "uid":2
}

Question

When perform realtime OLAP analysis with Kylin, you may have some concerns included:

  1. Will events in different timezones cause incorrect query results?
  2. How could I make it correct when kafka messages contain the value which is not what you want, says some dimension value is misspelled?
  3. How could I retrieve long-late messages which has been dropped?
  4. My query only hit a small range of time, how should I write filter condition to make sure unused segments are purged/skipped from scan?

Quick Answer

For the first question, you can always get the correct result in the right timezone of location by set kylin.stream.event.timezone=GMT+N for all Kylin processes. By default, UTC is used for derived time column.

For the second and third question, in fact you cannot update/append segment to a normal streaming cube, but you can update/append a streaming cube which in lambda mode, all you need to prepare is creating a Hive table which is mapped to your kafka event.

For the fourth question, you can achieved this by adding derived time column in your filter condition like MINUTE_START/DAY_START etc.

Lambda Architecture

  • No labels