You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Status

Current state: under discussion

Discussion thread: 

JIRA

Released: 1.13.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently some temporal function behaviors are wired to users. 

  • When users use a PROCTIME() in SQL, the return value of PROCTIME() has a timezone offset with the wall-clock time in users' local time zone, users need to add their local time zone offset manually to get expected local timestamp(e.g: Users in Germany need to +1h to get expected local timestamp). 
  • Users can not use CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP  to get wall-clock timestamp in local time zone, and thus they need write UDF in their SQL just for implementing a simple filter like WHERE date_col =  CURRENT_DATE. 
  • Another common case  is the time window  with day interval based on PROCTIME(), user plan to put all data from one day into the same window, but the window is assigned using timestamp in UTC+0 timezone rather than the session timezone which leads to the window starts with an offset(e.g: Users in China need to add -8h in their business sql start and then +8h when output the result, the conversion like a magic for users). 

These problems come from the fact that lots of time-related functions like PROCTIME(), NOW(), CURRENT_DATE, CURRENT_TIME and CURRENT_TIMESTAMP are returning time values based on UTC+0 time zone.

This FLIP aims to consistent the timestamp function behavior and eventually improve the usability.

Public Interfaces 

As we knew some functions' behavior is wrong currently, but after we correct these function, the legacy behavior should still work in old code.

This FLIP introduce an option for compatibility consideration, given two option names, I prefer the first one, the second one is like Impala style[1].

table.exec.fallback-legacy-time-function= false 

table.exec.use-utc-for-unixtime-conversion = false    

The default value is false which means the return values of function

  • CURRENT_DATE
  • CURRENT_TIME
  • CURRENT_TIMESTAMP
  • NOW()
  • PROCTIME()

consider the local timezone. Users can set the option to 'true', thus these functions would keep the legacy behavior.

And when the option also influence the result of CAST conversion between NUMERIC TYPE and TIMESTAMP, by default,the CAST conversion is not supported anymore,  users can use TO_TIMESTAMP(milliseconds)  function to get a TIMESTAMP in local time zone from a java epoch milliseconds.  Users can set the option to 'true', thus these cast conversion would keep the legacy behavior.

Proposed Changes

1. Correct five time function behavior

I invested all Flink time-related functions current behavior and compared with other DB vendors like Pg,Presto, Hive, Spark, Snowflake,  I made an excel [2] to organize them well.

To correct our current behavior, we need to make the function return type clear, especially for timestamp type, i.e.

TIMESTAMP/TIMESTAMP WITHOUT TIME ZONE
TIMESTAMP WITH LOCAL TIME ZONE
TIMESTAMP WITH TIME ZONE.
In order to understand these types better, I wrote a document[3], Flink also keeps same semantics for three timestamp types compareing with Hadoop ecosystem.

From my investigation, to correct this time functions' behavior, we have server options
(1) change the function return type
(2) change the function return value
(3) change them both.
All of those way are valid because SQL:2011 does not specify the function return type and every SQL engine vendor has its own implementation[2], for exampe CURRENT_TIMESTAMP.

This FLIP proposed option(2) which only change the return value,



Correct time-related function behavior as following, assume the current wall-clock is 2020-12-29 07:52:52 in Beijing time(UTC+8).

function

existed problem

current behavior

proposed changes

CURRENT_DATE

returns UTC date, but user expects current date in session time zone

return type: DATE

#session timezone: UTC

2020-12-28

#session timezone: UTC+8

2020-12-28

 return current date in session time zone, the return type should be DATE

#session timezone: UTC

2020-12-28

#session timezone: UTC+8

2020-12-29

CURRENT_TIME

returns UTC time, but user expects current time in session time zone

return type:  TIME 

#session timezone: UTC

23:52:52

#session timezone: UTC+8

23:52:52

return current time in session time zone, the return type should be TIME

#session timezone: UTC

23:52:52

#session timezone: UTC+8

07:52:52

CURRENT_TIMESTAMP

returns UTC timestamp, but user expects current timestamp in session time zone

return type:  TIMESTAMP 

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-28T23:52:52

return current timestamp in session time zone, the return type should be TIMESTAMP

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-29T07:52:52

NOW()

returns UTC timestamp, but user expects current timestamp in session time zone

return type:  TIMESTAMP 

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-28T23:52:52

return current timestamp in session time zone, the return type should be TIMESTAMP

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-29T07:52:52

PROCTIME()

returns UTC timestamp, but user expects current timestamp in session time zone

return type:  TIMESTAMP *PROCTIME*

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-28T23:52:52

return current timestamp in session time zone for PROCTIME(), the return type should be TIMESTAMP  *PROCTIME*

#session timezone: UTC

2020-12-28T23:52:52

#session timezone: UTC+8

2020-12-29T07:52:52


After the proposal, the function NOW(), CURRENT_TIMESTAMP and LOCALTIMESTAMP become  synonyms, the function CURRENT_TIME and LOCALTIME become synonyms, you can also lookup all time function behaviors in reference [2]. 


2. Correct CAST conversion between NUMERIC TYPE  and TIMESTAMP type

Currently, the following CAST conversion behaviors are wrong which does not consider the session time zone. It should use session time zone when cast between NUMERIC type and TIMESTAMP type if there’re strong requirements to support this , the numeric type include TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, they keep same conversion behavior, for example the cast between BIGINT and TIMESTAMP. Although we didn’t expose this feature in the document, some users may use them.


As these cast conversions have wrong behavior and problematic semantics,  a better way to replace these functionalities is the TO_TIMESTAMP(milliseconds) function. 

function

current behavior

existed problem

proposed changes

CAST(44 AS TIMESTAMP) 


TIMESTAMP(6) NOT NULL

#session timezone: UTC

1970-01-01T00:00:44 

#session timezone: UTC+8

1970-01-01T00:00:44 

The time in bigint represents unixtime semantic, which always represents the elapsed seconds since java epoch(1970-01-01 00:00:00 UTC+0), when convert to a timestamp we should consider local time zone

support function  TO_TIMESTAMP(mills)

TO_TIMESTAMP(44000)

TIMESTAMP(3)

#session timezone: UTC

1970-01-01T00:00:44.000 

#session timezone: UTC+8

1970-01-01T08:00:44.000 

CAST(TIMESTAMP ‘1970-01-01 00:00:44’ AS BIGINT) 

BIGINT NOT NULL

#session timezone: UTC

44 

#session timezone: UTC+8

44

The inverse conversion of above, this conversion is used rarely.


UNIX_TIMESTAMP(TIMESTAMP ‘1970-01-01 00:00:44’)

#session timezone: UTC

1970-01-01T00:00:44.000 

#session timezone: UTC+8

1970-01-01T00:00:44.000 



After the proposal is finished, the above user cases will work smoothly.  Assume users' local time zone is UTC+8, the wall-clock is 2020-12-29 07:52:52.

  • user case 1: the query 

SELECT NOW(), PROCTIME(), CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP;

will output:

| 2020-12-29 07:52:52 | 2020-12-29 07:52:52 | 2020-12-29 | 07:52:52 | 2020-12-29 07:52:52|


  • user case 2: the query 

SELECT 

   TUMBLE_START(proctime, INTERVAL ‘1’ DAY),

   TUMBLE_END(proctime, INTERVAL ‘1’ DAY),

   count(userId) as cnt

       FROM userLog

       GROUP BY TUMBLE_WINDOW(proctime, INTERVAL ‘1’ DAY)

 will output:

| 2020-12-29 00:00:00 | 2020-12-30 00:00:00 | 100|

  • user case 3: int the query 

SELECT *

       FROM userLog

      where date >= CURRENT_DATE

records earlier than 2020-12-29 will not be output.


General Implementations

1.Change the codegen implementations for above five functions/cast conversions according to the value of introduced table option: table.exec.fallback-legacy-time-function.enable 

2.Consider the session time zone offset when materializing a PROCTIME() attribute, including PROCTIME_MATERIALIZE function and the processing timestamp used to register timer in the window operator, because after change the return value of PROCTIME()function,

the window based on processing time should be triggered by the changed processing time value.

3. When converting a Table contains rowtime from/to DataStream, the timestamp in StreamRecord will exchange with Table’s row time column. We need to consider the time zone offset just like the cast conversion between the TIMESTAMP and BIGINT.  This way can make the

both datastream and SQL/API users feel intuitive.

Compatibility, Deprecation, and Migration Plan

  • Compatibility

This is an incompatible change, we introduce SQL/API option  table.exec.fallback-legacy-time-function for compacting current wrong behavior, and set it to ‘false’. If users want to keep the legacy behavior, they need to set it to ’true’ manually. 

  • Deprecation
  • Migration Plan


Test Plan

Will add plan tests, unit tests, window operator harness tests as well as IT tests. 

Rejected Alternatives

(1) Change the return type of function CURRENT_TIMESTAMP/NOW()/PROCETIME() to TIMESTAMP WITH TIME ZONE. This proposal needs to introduce a new type TIMESTAMP WITH TIME ZONE, and we think there are no enough benefits. If we do this, the return type of function CURRENT_TIME must be TIME WITH TIME ZONE for consistent consideration, we need to introduce another type.

(2)Change the return type of function  CURRENT_TIMESTAMP/NOW()/PROCETIME() to TIMESTAMP WITH LOCAL TIME ZONE. This proposal will lead to a embarrassed situation for function CURRENT_TIME, because no DB vendor use type TIME WITH LOCAL TIME ZONE yet, and the time semantic of TIMESTAMP WITH LOCAL TIME ZONE is instant which is too complex to understand for normal users.


References:

  1. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_timezone.html

      2. https://docs.google.com/document/d/1iY3eatV8LBjmF0gWh2JYrQR0FlTadsSeuCsksOVp_iA/edit?usp=sharing

      3. https://docs.google.com/spreadsheets/d/1T178krh9xG-WbVpN7mRVJ8bzFnaSJx3l-eg1EWZe_X4/edit?usp=sharing

  • No labels