Join Optimization

Table of Contents

For a general discussion of Hive joins including syntax, examples, and restrictions, see the Joins wiki doc.

...

Star Schema Example

No Format


Select count(*) cnt
From store_sales ss
     join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk)
     join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
     join store s on (s.s_store_sk = ss.ss_store_sk)
Where
     t.t_hour = 8
     t.t_minute >= 30
     hd.hd_dep_count = 2
order by cnt;

...

Hive supports MAPJOINs, which are well suited for this scenario – at least for dimensions small enough to fit in memory. Before release 0.11, a MAPJOIN could be invoked either through an optimizer hint:

No Format
select /+ MAPJOIN(time_dim) / count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

or via auto join conversion:

No Format
set hive.auto.convert.join=true; select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

...

The following query will produce two separate map-only jobs when executed:

No Format
select /+ MAPJOIN(time_dim, date_dim) / count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002

...

When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:

No Format
set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000;

The default for hive.auto.convert.join.noconditionaltask is false true which means auto conversion is enabled. (Originally the default was false – see HIVE-3784 – but it was changed to true by HIVE-4146 before Hive 0.11.0 was released.)

false which means auto conversion is disabled.

...

For example, the previous query just becomes:

No Format
select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002

...

The following configuration settings enable the conversion of an SMB to a map-join SMB:

No Format
set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;

There is an option to set the big table selection policy using the following configuration:

No Format
set hive.auto.convert.sortmerge.join.bigtable.selection.policy = org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;

...

The available selection policies are:

No Format


org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default)
org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ
org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

...

Space shortcuts

Child pages

Versions Compared

Old Version 5

New Version 6

Key

Join Optimization

Star Schema Example

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 5

New Version 6

Key

Join Optimization

Star Schema Example