Objectives
Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the pre-computation of relevant summaries or materialized views.
The initial implementation focuses on introducing materialized views and automatic query rewriting based on those materializations in the project. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations.
In this document, we provide details about materialized view creation and management in Hive, describe the current coverage of the rewriting algorithm with some examples, and explain how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data.
Management of materialized views in Hive
In this section, we present the main operations that are currently present in Hive for materialized views management.
Materialized views creation
The syntax to create a materialized view in Hive is very similar to the CTAS statement syntax, supporting common features such as partition columns, custom storage handler, or passing table properties.
|
When a materialized view is created, its contents will be automatically populated by the results of executing the query in the statement. The materialized view creation statement is atomic, meaning that the materialized view is not seen by other users until all query results are populated.
By default, materialized views are usable for query rewriting by the optimizer, while the DISABLE REWRITE
option can be used to alter this behavior at materialized view creation time.
The default values for SerDe and storage format when they are not specified in the materialized view creation statement (they are optional) are specified using the configuration properties hive.materializedview.serde
and hive.materializedview.fileformat
, respectively.
Materialized views can be stored in external systems, e.g., Druid, using custom storage handlers. For instance, the following statement creates a materialized view that is stored in Druid:
Example:
|
Other operations for materialized view management
Currently we support the following operations that aid at managing the materialized views in Hive:
|
The functionality of these operations will be extended in the future and more operations may be added.
Materialized view-based query rewriting
Once a materialized view has been created, the optimizer will be able to exploit its definition semantics to automatically rewrite incoming queries using materialized views, and hence, accelerate query execution.
The rewriting algorithm can be enabled and disabled globally using the hive.materializedview.rewriting
configuration property (default value is true
). In addition, users can selectively enable/disable materialized views for rewriting. Recall that, by default, materialized views are enabled for rewriting at creation time. To alter that behavior, the following statement can be used:
|
The rewriting algorithm is part of Apache Calcite and it supports queries containing TableScan, Project, Filter, Join, and Aggregate operators. More information about the rewriting coverage can be found here. In the following, we include a few examples that briefly illustrate different rewritings.
Example 1
Consider the database schema created by the following DDL statements:
|
Assume we want to obtain frequently information about employees that were hired in different period granularities after 2016 and their departments. We may create the following materialized view:
|
Then, the following query extracting information about employees that were hired in Q1 2018 is issued to Hive:
|
Hive will be able to rewrite the incoming query using the materialized view, including a compensation predicate on top of the scan over the materialization. Though the rewriting happens at the algebraic level, to illustrate this example, we include the SQL statement equivalent to the rewriting using the mv
used by Hive to answer the incoming query:
|
Example 2
For the second example, consider the star schema based on the SSB benchmark created by the following DDL statements:
|
As you can observe, we declare multiple integrity constraints for the database, using the RELY
keyword so they are visible to the optimizer. Now assume we want to create a materialization that denormalizes the database contents (consider dims
to be the set of dimensions that we will be querying often):
|
The materialized view above may accelerate queries that execute joins among the different tables in the database. For instance, consider the following query:
|
Though the query does not use all tables present in the materialized view, it may be answered using the materialized view because the joins in mv2
preserve all the rows in the lineorder
table (we know this because of the integrity constraints). Hence, the materialized view-based rewriting produced by the algorithm would be the following:
|
Example 3
For the third example, consider the database schema with a single table that stores the edit events produced by a given website:
|
For this example, we will use Druid to store the materialized view. Assume we want to execute queries over the table, however we are not interested on any information about the events at a higher time granularity level than a minute. We may create the following materialized view that rolls up the events by the minute:
|
Then, assume we need to answer the following query that extracts the number of characters added per month:
|
Hive will be able to rewrite the incoming query using mv3
by rolling up the data of the materialized view to month granularity and projecting the information needed for the query result:
|
Materialized view maintenance
When data in the source tables used by a materialized view changes, e.g., new data is inserted or existing data is modified, we will need to refresh the contents of the materialized view to keep it up-to-date with those changes. Currently, the rebuild operation for a materialized view needs to be triggered by the user. In particular, the user should execute the following statement:
|
Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.
By default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT
operations over the source tables, while UPDATE
and DELETE
operations will force a full rebuild of the materialized view.
To execute incremental maintenance, following conditions should be met:
- The materialized view should only use transactional tables, either micromanaged or ACID.
- If the materialized view definition contains a Group By clause, the materialized view should be stored in an ACID table, since it needs to support MERGE operation. For materialized view definitions consisting of Scan-Project-Filter-Join, this restriction does not exist.
A rebuild operation acquires an exclusive write lock over the materialized view, i.e., for a given materialized view, only one rebuild operation can be executed at a given time.
Materialized view lifecycle
By default, once a materialized view contents are stale, the materialized view will not be used for automatic query rewriting.
However, in some occasions it may be fine to accept stale data, e.g., if the materialized view uses non-transactional tables and hence we cannot verify whether its contents are outdated, however we still want to use the automatic rewriting. For those occasions, we can combine a rebuild operation run periodically, e.g., every 5minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window
configuration parameter, for instance:
|
The parameter value can be also overridden by a concrete materialized view just by setting it as a table property when the materialization is created.