Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: update star join enhancements & task-side hashtables (thanks to Vikram Dixit for review)

...

The optimizer enhancements in Hive 0.11 focus on efficient processing of the joins needed in star schema configurations. The initial work is was limited to star schema joins where all dimension tables after filtering and projecting fit into memory at the same time. ( Scenarios where only some of the dimension tables fit into memory will be handled in future work.)are now implemented as well.

The join optimizations are can be grouped into three parts:

...

  • Execute chains of mapjoins in the operator tree in a single map-only job, when maphints are used.
  • Extend optimization to the auto-conversion case (generating an appropriate backup plan when optimizing).

The following sections describe each of these optimizer enhancements.

Generate Hash Tables on the Task Side

Pros and Cons of Client-Side Hash Tables

Generating the hashtable (or multiple hashtables for multitable joins) on the client machine has drawbacks. (The client machine is the host that is used to run the Hive client and submit jobs.)

  • Data locality: The client machine typically is not a data node. All the data accessed is remote and has to be read via the network.
  • Specs: For the same reason, it is not clear what the specifications of the machine running this processing will be. It might have limitations in memory, hard drive, or CPU that the task nodes do not have.
  • HDFS upload: The data has to be brought back to the cluster and replicated via the distributed cache to be used by task nodes.

Pre-processing the hashtables on the client machine also has some benefits:

  • What is stored in the distributed cache is likely to be smaller than the original table (filter and projection).
  • In contrast, loading hashtables directly on the task nodes using the distributed cache means larger objects in the cache, potentially reducing opportunities for using MAPJOIN.
Task-Side Generation of Hash Tables

When the hashtables are generated completely on the task side, all task nodes have to access the original data source to generate the hashtable. Since in the normal case this will happen in parallel it will not affect latency, but Hive has a concept of storage handlers and having many tasks access the same external data source (HBase, database, etc.) might overwhelm or slow down the source.

...

Further Options for Optimization

...

  • Generate in-memory hashtable completely on the task side. (Future work.)

The following sections describe each of these optimizer enhancements

...

.

Optimize Chains of Map Joins

...

The names describe their uses. This is especially useful for the fact-fact join (query 82 in the TPC DS benchmark).

Generate Hash Tables on the Task Side

Future work will make it possible to generate in-memory hashtables completely on the task side.

Pros and Cons of Client-Side Hash Tables

Generating the hashtable (or multiple hashtables for multitable joins) on the client machine has drawbacks. (The client machine is the host that is used to run the Hive client and submit jobs.)

  • Data locality: The client machine typically is not a data node. All the data accessed is remote and has to be read via the network.
  • Specs: For the same reason, it is not clear what the specifications of the machine running this processing will be. It might have limitations in memory, hard drive, or CPU that the task nodes do not have.
  • HDFS upload: The data has to be brought back to the cluster and replicated via the distributed cache to be used by task nodes.

Pre-processing the hashtables on the client machine also has some benefits:

  • What is stored in the distributed cache is likely to be smaller than the original table (filter and projection).
  • In contrast, loading hashtables directly on the task nodes using the distributed cache means larger objects in the cache, potentially reducing opportunities for using MAPJOIN.
Task-Side Generation of Hash Tables

When the hashtables are generated completely on the task side, all task nodes have to access the original data source to generate the hashtable. Since in the normal case this will happen in parallel it will not affect latency, but Hive has a concept of storage handlers and having many tasks access the same external data source (HBase, database, etc.) might overwhelm or slow down the source.

...

Further Options for Optimization

...

  1. Increase the replication factor on dimension tables.
  2. Use the distributed cache to hold dimension tables.