Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The syntax to create a materialized view in Hive is very similar to the CTAS statement syntax, supporting common features such as partition columns, custom storage handler, or passing table properties.

Code Block
sql
sql
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
  [DISABLE REWRITE]
  [COMMENT materialized_view_comment]
  [
PARTITIONED ON (col_name, ...)]
  [ROWCLUSTERED FORMATON row_format]
    [STORED AS file_format]
      (col_name, ...) | DISTRIBUTED ON (col_name, ...) SORTED ON (col_name, ...)]
  [
    [ROW FORMAT row_format]
    [STORED AS file_format]
      | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;

...

Consider the database schema created by the following DDL statements:

Code Block
sql
sql
<DDL statements>

Assume we want to obtain frequently information about employees that were hired in different period granularities after 2016 and their departments. We may create the following materialized view:

...

CREATE MATERIALIZED VIEW mv1
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
  ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
CREATE TABLE emps (
  empid INT,
  deptno INT,
  name VARCHAR(256),
  salary FLOAT,
  hire_date TIMESTAMP)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

CREATE TABLE depts (
  deptno INT,
  deptname VARCHAR(256),
  locationid INT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

Assume we want to obtain frequently information about employees that were hired in different period granularities after 2016 and their departments. We may create the following materialized view:

Code Block
sql
sql
CREATE MATERIALIZED VIEW mv1
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
  ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';

Then, the following query extracting information about employees that were hired in Q1 2018 is issued to Hive

Then, the following query extracting information about employees that were hired in Q1 2018 is issued to Hive:

...

SELECT empid, deptname
FROM emps
JOIN depts
  ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
    AND hire_date <= '2018-03-31';

Hive will be able to rewrite the incoming query using the materialized view, including a compensation predicate on top of the scan over the materialization. Though the rewriting happens at the algebraic level, to illustrate this example, we include the SQL statement equivalent to the rewriting using the mv used by Hive to answer the incoming query:

Code Block
sql
sql
SELECT empid, deptname
FROM mv1
WHERE hire_date >= '2018-01-01'
    AND hire_date <= '2018-03-31';

...

For the second example, consider the database schema created by the following DDL statements:

...

<DDL statements>
 deptname
FROM emps
JOIN depts
  ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
    AND hire_date <= '2018-03-31';

Hive will be able to rewrite the incoming query using the materialized view, including a compensation predicate on top of the scan over the materialization. Though the rewriting happens at the algebraic level, to illustrate this example, we include the SQL statement equivalent to the rewriting using the mv used by Hive to answer the incoming query:

Code Block
sql
sql
SELECT empid, deptname
FROM mv1
WHERE hire_date >= '2018-01-01'
    AND hire_date <= '2018-03-31';

Example 2

For the second example, consider the star schema based on the SSB benchmark created by the following DDL statements:

Code Block
sql
sql
CREATE TABLE `customer`(
  `c_custkey` BIGINT, 
  `c_name` STRING, 
  `c_address` STRING, 
  `c_city` STRING, 
  `c_nation` STRING, 
  `c_region` STRING, 
  `c_phone` STRING, 
  `c_mktsegment` STRING,
  PRIMARY KEY (`c_custkey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

CREATE TABLE `dates`(
  `d_datekey` BIGINT, 
  `d_date` STRING, 
  `d_dayofweek` STRING, 
  `d_month` STRING, 
  `d_year` INT, 
  `d_yearmonthnum` INT, 
  `d_yearmonth` STRING, 
  `d_daynuminweek` INT,
  `d_daynuminmonth` INT,
  `d_daynuminyear` INT,
  `d_monthnuminyear` INT,
  `d_weeknuminyear` INT,
  `d_sellingseason` STRING,
  `d_lastdayinweekfl` INT,
  `d_lastdayinmonthfl` INT,
  `d_holidayfl` INT,
  `d_weekdayfl`INT,
  PRIMARY KEY (`d_datekey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

CREATE TABLE `part`(
  `p_partkey` BIGINT, 
  `p_name` STRING, 
  `p_mfgr` STRING, 
  `p_category` STRING, 
  `p_brand1` STRING, 
  `p_color` STRING, 
  `p_type` STRING, 
  `p_size` INT, 
  `p_container` STRING,
  PRIMARY KEY (`p_partkey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

CREATE TABLE `supplier`(
  `s_suppkey` BIGINT, 
  `s_name` STRING, 
  `s_address` STRING, 
  `s_city` STRING, 
  `s_nation` STRING, 
  `s_region` STRING, 
  `s_phone` STRING,
  PRIMARY KEY (`s_suppkey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

CREATE TABLE `lineorder`(
  `lo_orderkey` BIGINT, 
  `lo_linenumber` int, 
  `lo_custkey` BIGINT not null DISABLE RELY,
  `lo_partkey` BIGINT not null DISABLE RELY,
  `lo_suppkey` BIGINT not null DISABLE RELY,
  `lo_orderdate` BIGINT not null DISABLE RELY,
  `lo_ordpriority` STRING, 
  `lo_shippriority` STRING, 
  `lo_quantity` DOUBLE, 
  `lo_extendedprice` DOUBLE, 
  `lo_ordtotalprice` DOUBLE, 
  `lo_discount` DOUBLE, 
  `lo_revenue` DOUBLE, 
  `lo_supplycost` DOUBLE, 
  `lo_tax` DOUBLE, 
  `lo_commitdate` BIGINT, 
  `lo_shipmode` STRING,
  PRIMARY KEY (`lo_orderkey`) DISABLE RELY,
  CONSTRAINT fk1 FOREIGN KEY (`lo_custkey`) REFERENCES `customer_n1`(`c_custkey`) DISABLE RELY,
  CONSTRAINT fk2 FOREIGN KEY (`lo_orderdate`) REFERENCES `dates_n0`(`d_datekey`) DISABLE RELY,
  CONSTRAINT fk3 FOREIGN KEY (`lo_partkey`) REFERENCES `ssb_part_n0`(`p_partkey`) DISABLE RELY,
  CONSTRAINT fk4 FOREIGN KEY (`lo_suppkey`) REFERENCES `supplier_n0`(`s_suppkey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

As you can observe, we declare multiple integrity constraints for the database, using the RELY keyword so they are visible to the optimizer. Now assume we want to create a materialization that denormalizes the database contentscontents (consider dims to be the set of dimensions that we will be querying often):

Code Block
sql
sql
CREATE MATERIALIZED VIEW mv2
AS
SELECT <dims>,
    lo_revenue,
    lo_extendedprice * lo_discount AS d_price,
    lo_revenue - lo_supplycost
FROM customer, dates, lineorder, part, supplier
WHERE lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND lo_custkey = c_custkey;

...

For the third example, consider the database schema with a single table that stores the edit events produced by a given website:website:

Code Block
sql
sql
CREATE TABLE `wiki` (
  `time` TIMESTAMP, 
  `page` STRING, 
  `user` STRING, 
  `characters_added` BIGINT,
  `characters_removed` BIGINT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
Code Block
sqlsql
<DDL statements>

For this example, we will use Druid to store the materialized view. Assume we want to execute queries over the table, however we are not interested on any information about the events at a higher time granularity level than a minute. We may create the following materialized view that rolls up the events by the minute:

Code Block
sql
sql
CREATE MATERIALIZED VIEW mv3
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT floor(time to minute)) as `__time`, page,
    SUM(characters_added) AS c_added,
    SUM(characters_removed) AS c_removed
FROM wiki
GROUP BY floor(time to minute), page;

...

Materialized view maintenance

When data in the sources source tables used by a materialized view changes, e.g., new data is inserted or existing data is modified, we will need to refresh the contents of the materialized view to keep it up-to-date with those changes. Currently, the rebuild operation for a materialized view needs to be triggered by the user. In particular, the user should execute the following statement:

Code Block
sql
sql
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;

Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view. To execute incremental view maintenance, the

By default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.

To execute incremental maintenance, following conditions should be met:

  • The materialized view should only use transactional tables, either micromanaged or ACID.
  • If the materialized view definition contains a Group By clause, the materialized view should be stored in an ACID table, since it needs to support MERGE operation. For materialized view definitions consisting of Scan-Project-Filter-Join, this restriction does not exist.  

A rebuild operation acquires an exclusive write lock over the materialized view, i.e., for a given materialized view, only one rebuild operation can be executed at a given timeBy default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.

Materialized view lifecycle

...

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Hive AND component = "Materialized views" and resolution = Unresolved ORDER BY key ASC
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
This wiki will be updated with information about materialized views before Apache Hive 3.0 is released.