Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Table of Contents

1. Introduction

1.1 Motivation

1.2 Design Principle

1.3 Comparison with Shark and Spark SQL

1.4 Other Considerations

2. High-Level Functionality

2.1 A New Execution Engine

2.2 Spark Configuration

2.3 Miscellaneous Functionality

3. Hive-Level Design

3.1 Query Planning

3.2 Job Execution

3.3 Design Considerations

Table as RDD

SparkWork

SparkTask

Shuffle, Group, and Sort

Join

Number of Tasks

Local MapReduce Tasks

Semantic Analysis and Logical Optimizations

Job Diagnostics

Counters and Metrics

Explain Statements

Hive Variables

Union

Concurrency and Thread Safety

Build Infrastructure

Mini Spark Cluster

Testing

3.4 Potentially Required Work from Spark

4. Summary

5. Acknowledgement

1. Introduction

We propose modifying Hive to add Spark as a third execution backend(HIVE-7292), parallel to MapReduce and Tez.

...

More information about Spark can be found here:

...

While sortByKey provides no grouping, it’s easy to group the keys as rows with the same key will come consecutively. On the other hand,  groupByKey clusters the keys in a collection, which naturally fits the MapReduce’s reducer interface.

...

It’s rather complicated in implementing join in MapReduce world, as manifested in Hive. Hive has reduce-side join as well as map-side join (including map-side hash lookup and map-side sorted merge). We will keep Hive’s join implementations. However, extra attention needs to be paid on the shuffle behavior (key generation, partitioning, sorting, etc), since Hive extensively uses MapReduce’s shuffling in implementing reduce-side join. It’s expected that Spark is, or will be, able to provide flexible control over the shuffling, as pointed out in the previous section(Shuffle, Group, and Sort).

See: Hive on Spark: Join Design Master for detailed design.

Number of Tasks

As specified above, Spark transformations such as partitionBy will be used to connect mapper-side’s operations to reducer-side’s operations. The number of partitions can be optionally given for those transformations, which basically dictates the number of reducers.

...