sad
Executing statements periodically can be usefull in
- Pulling informations from external systems
- Periodically updating column statistics
- Rebuilding materialized views
Overview
- The metastore maintains the scheduled queries in the metastore database
- The Hiveserver periodically polls the metastore for a scheduled query to be executed
- During execution informations about ongoing/finished executions are kept in the metastore to help
Scheduled queries were added in Hive 4.0 (HIVE-21884)
Hive has it’s scheduled query interface built into the language itself for easy access:
Maintaining scheduled queries
Create Scheduled query syntax
CREATE SCHEDULED QUERY <scheduled_query_name>
<scheduleSpec>
[<executedAsSpec>]
[<enableSpecification>]
<definedAsSpec>
Alter Scheduled query syntax
ALTER SCHEDULED QUERY <scheduled_query_name>
(<scheduleSpec>|<executedAsSpec>|<enableSpecification>|<definedAsSpec>);
Drop syntax
DROP SCHEDULED QUERY <scheduled_query_name>;
D;
scheduleSpec syntax
CRON quartz_schedule_expression
quartz_schedule_expression
The schedule have to be specified in the Quartz format
https://www.freeformatter.com/cron-expression-generator-quartz.html
executedAsSpec syntax
EXECUTED AS user_name
Scheduled queries are executed as the declaring user by default; but people with admin privileges might be able to change the executing user.
enableSpecification syntax
(ENABLE[D] | DISABLE[D])
Can be used to disable a schedule.
definedAsSpec syntax
[DEFINED] AS query
The “query” is a single statement expression to be scheduled for execution.
System Tables
Informations about scheduled queries/executions can be obtain by using the information_schema or the sysdb - recommended way is to use the information_schema; sysdb is tables are there to build the information_schema level views - and for debugging.
information_schema.scheduled_queries
Suppose we have a scheduled query defined by:
create scheduled query sc1 cron '0 */10 * * * ? *' as select 1;
Let’s take a look at it in the information_schema.scheduled_queries table by using
select * from information_schema.scheduled_queries;
I will transpose the resultset to describe each column
scheduled_query_id | 1 | Internally, every scheduled query also has a numeric id |
schedule_name | sc1 | The name of the schedule |
enabled | true | True if the schedule is enabled |
cluster_namespace | hive | The namespace thes scheduled query belongs to |
schedule | 0 */10 * * * ? * | The schedule described in QUARTZ cron format |
user | dev | The owner/executor of the query |
query | select 1 | The query being scheduled |
next_execution | 2020-01-29 16:50:00 | Technical column; shows when the next execution should happen |
Note: (schedule_name,cluster_namespace) is unique
information_schema.scheduled_executions
select * from information_schema.scheduled_executions;
FIXME
Configuration
Hive metastore related configuration
- metastore.scheduled.queries.enabled (default: true)
Controls the metastore side support for scheduled queries; forces all HMS scheduled query related endpoints to return with an error - metastore.scheduled.queries.execution.timeout (default: 2 minutes)
In case a scheduled execution is not updated for at least this amount of time; it’s state will be changed to TIMED_OUT by the cleaner task
- metastore.scheduled.queries.execution.maint.task.frequency (default: 1 minute)
Interval of scheduled query maintenance task. Which removes executions above max age; and marks executions as TIMED_OUT if the condition is met
- metastore.scheduled.queries.execution.max.age (default: 30 days)
Maximal age of a scheduled query execution entry before it is removed.
HiveServer2 related configuration
- hive.scheduled.queries.executor.enabled (default: true)
Controls whether HS2 will run scheduled query executor. - hive.scheduled.queries.namespace (default: "hive")
Sets the scheduled query namespace to be used. New scheduled queries are created in this namespace; and execution is also bound to the namespace - hive.scheduled.queries.executor.idle.sleep.time (default: 1 minute)
Time to sleep between querying for the presence of a scheduled query. - hive.scheduled.queries.executor.progress.report.interval (default: 1 minute)
While scheduled queries are in flight; a background update happens periodically to report the actual state of the query. - hive.scheduled.queries.create.as.enabled (default: true)
This option sets the default behaviour of newly created scheduled queries. - hive.security.authorization.scheduled.queries.supported (default: false)
Enable this if the configured authorizer is able to handle scheduled query related calls.
Example
create table t (a integer);
create scheduled query sc1 cron ‘0 */10 * * * ? *’ as insert into t values (1);
Example 1 - general usage of schedules
create table t (a integer);
-- create scheduled queries
create scheduled query sc1 cron ‘0 */10 * * * ? *’ as insert into t values (1);
-- depending on hive.scheduled.queries.create.as.enabled the query might get create in disabled mode
-- it can be enabled using:
alter scheduled query sc1 enabled;
-- inspect scheduled queries using the information_schema
create scheduled query sc2 cron ‘0 */10 * * * ? *’ as analyze table t compute statistics for columns;