Intro
Executing statements periodically can be usefull in
- Pulling informations from external systems
- Periodically updating column statistics
- Rebuilding materialized views
Overview
- The metastore maintains the scheduled queries in the metastore database
- Hiveserver(s) periodically polls the metastore for a scheduled query to be executed
- During execution informations about ongoing/finished executions are kept in the metastore
Scheduled queries were added in Hive 4.0 (HIVE-21884)
Hive has it’s scheduled query interface built into the language itself for easy access:
Maintaining scheduled queries
Create Scheduled query syntax
CREATE SCHEDULED QUERY <scheduled_query_name>
<scheduleSpecification>
[<executedAsSpec> ]
[<enableSpecification>]
<definedAsSpec>
Alter Scheduled query syntax
ALTER SCHEDULED QUERY <scheduled_query_name>
(<scheduleSpec>|<executedAsSpec>|<enableSpecification>|<definedAsSpec>);
Drop syntax
DROP SCHEDULED QUERY <scheduled_query_name>;
scheduleSpecification syntax
Schedules can be specified using CRON expressions or for common cases there is a simpler form; in any case the schedule is stored as Quartz cron expression.
CRON based schedule syntax
CRON <quartz_schedule_expression>
where quartz_schedule_expression is quoted schedule in the Quartz format
https://www.freeformatter.com/cron-expression-generator-quartz.html
For example the CRON '0 */10 * * * ? *'
expression will fire every 10 minutes.
EVERY based schedule syntax
To give a more readable way to declare schedules EVERY can be used.
EVERY [<integer>] (SECOND|MINUTE|HOUR) [(OFFSET BY|AT) <timeOrDate>]
the format makes it possible to declare schedules in a more readable way:
EVERY 2 MINUTES
EVERY HOUR AT '0:07:30'
EVERY DAY AT '11:35:30'
ExecutedAs syntax
EXECUTED AS <user_name>
Scheduled queries are executed as the declaring user by default; but people with admin privileges might be able to change the executing user.
enableSpecification syntax
(ENABLE[D] | DISABLE[D])
Can be used to enable/disable a schedule.
For CREATE SCHEDULED QUERY statements the default behaviour is set by the configuration key hive.scheduled.queries.create.as.enabled
Defined AS syntax
[DEFINED] AS <hiveQuery>
The “query” is a single statement expression to be scheduled for execution.
System tables/views
Informations about scheduled queries/executions can be obtain by using the information_schema or the sysdb - recommended way is to use the information_schema; sysdb is tables are there to build the information_schema level views - and for debugging.
information_schema.scheduled_queries
Suppose we have a scheduled query defined by:
create scheduled query sc1 cron '0 */10 * * * ? *' as select 1;
Let’s take a look at it in the information_schema.scheduled_queries table by using
select * from information_schema.scheduled_queries;
I will transpose the resultset to describe each column
scheduled_query_id | 1 | Internally, every scheduled query also has a numeric id |
schedule_name | sc1 | The name of the schedule |
enabled | true | True if the schedule is enabled |
cluster_namespace | hive | The namespace thes scheduled query belongs to |
schedule | 0 */10 * * * ? * | The schedule described in QUARTZ cron format |
user | dev | The owner/executor of the query |
query | select 1 | The query being scheduled |
next_execution | 2020-01-29 16:50:00 | Technical column; shows when the next execution should happen |
(schedule_name,cluster_namespace) is unique
information_schema.scheduled_executions
This view can be used to get information about recent scheduled query executions.
select * from information_schema.scheduled_executions;
One record in this view has the following informations:
scheduled_execution_id | 13 | Every scheduled query execution has a unique numeric id |
schedule_name | sc1 | The schedule name to which this execution belongs |
executor_query_id | dev_20200131103008_c9a39b8d-e26b-44cd-b8ae-9d054204dc07 | The query id assigned by the execution engine for the given scheduled execution |
state | FINISHED | State of the execution; can be |
start_time | 2020-01-31 10:30:06 | Start time of execution |
end_time | 2020-01-31 10:30:08 | End time of execution |
elapsed | 2 | (computed) end_time-start_time |
error_message | NULL | In case the query is FAILED the error message is shown here |
last_update_time | NULL | During execution the last update time the executor provided informations about the state |
Execution states
INITED | The scheduled execution record is created at the time an executor is assigned to run it; |
EXECUTING | Queries in executing state are being processed by the executor; during this phase the executor reports the progress of the query in intervals defined by: hive.scheduled.queries.executor.progress.report.interval |
FAILED | The query execution stoped by an error code(or an exception) when this state is set the error_message is also filled. |
FINISHED | The query finished without problems |
TIMED_OUT | An execution is considered timed out when it’s being executed for more than metastore.scheduled.queries.execution.timeout. The scheduled queries maintenance task checks for any timed out executions. |
How long are execution informations are retained?
The scheduled query maintenance task removes older than metastore.scheduled.queries.execution.max.age entries.
Configuration
Hive metastore related configuration
- metastore.scheduled.queries.enabled (default: true)
Controls the metastore side support for scheduled queries; forces all HMS scheduled query related endpoints to return with an error - metastore.scheduled.queries.execution.timeout (default: 2 minutes)
In case a scheduled execution is not updated for at least this amount of time; it’s state will be changed to TIMED_OUT by the cleaner task - metastore.scheduled.queries.execution.maint.task.frequency (default: 1 minute)
Interval of scheduled query maintenance task. Which removes executions above max age; and marks executions as TIMED_OUT if the condition is met - metastore.scheduled.queries.execution.max.age (default: 30 days)
Maximal age of a scheduled query execution entry before it is removed.
HiveServer2 related configuration
- hive.scheduled.queries.executor.enabled (default: true)
Controls whether HS2 will run scheduled query executor. - hive.scheduled.queries.namespace (default: "hive")
Sets the scheduled query namespace to be used. New scheduled queries are created in this namespace; and execution is also bound to the namespace - hive.scheduled.queries.executor.idle.sleep.time (default: 1 minute)
Time to sleep between querying for the presence of a scheduled query. - hive.scheduled.queries.executor.progress.report.interval (default: 1 minute)
While scheduled queries are in flight; a background update happens periodically to report the actual state of the query. - hive.scheduled.queries.create.as.enabled (default: true)
This option sets the default behaviour of newly created scheduled queries. - hive.security.authorization.scheduled.queries.supported (default: false)
Enable this if the configured authorizer is able to handle scheduled query related calls.
Examples
create table t (a integer);
create scheduled query sc1 cron ‘0 */10 * * * ? *’ as insert into t values (1);
Example 1 - general usage of schedules
create table t (a integer);
-- create scheduled queries
create scheduled query sc1 cron ‘0 */10 * * * ? *’ as insert into t values (1);
-- depending on hive.scheduled.queries.create.as.enabled the query might get create in disabled mode
-- it can be enabled using:
alter scheduled query sc1 enabled;
-- inspect scheduled queries using the information_schema
create scheduled query sc2 cron ‘0 */10 * * * ? *’ as analyze table t compute statistics for columns;