Table of Contents

1. Introduction

We propose modifying Hive to add Spark as a third execution backend(HIVE-7292), parallel to MapReduce and Tez.

...

More information about Spark can be found here:

Apache Apark pageSpark page: http://spark.apache.org/
Apache Spark blogpost: http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/
Apache Spark JavaDoc: http://spark.apache.org/docs/1.0.0/api/java/index.html

...

While sortByKey provides no grouping, it’s easy to group the keys as rows with the same key will come consecutively. On the other hand, groupByKey clusters the keys in a collection, which naturally fits the MapReduce’s reducer interface.

...

It’s rather complicated in implementing join in MapReduce world, as manifested in Hive. Hive has reduce-side join as well as map-side join (including map-side hash lookup and map-side sorted merge). We will keep Hive’s join implementations. However, extra attention needs to be paid on the shuffle behavior (key generation, partitioning, sorting, etc), since Hive extensively uses MapReduce’s shuffling in implementing reduce-side join. It’s expected that Spark is, or will be, able to provide flexible control over the shuffling, as pointed out in the previous section(Shuffle, Group, and Sort).

See: Hive on Spark: Join Design Master for detailed design.

Number of Tasks

As specified above, Spark transformations such as partitionBy will be used to connect mapper-side’s operations to reducer-side’s operations. The number of partitions can be optionally given for those transformations, which basically dictates the number of reducers.

...

Space shortcuts

Child pages

Versions Compared

Old Version 3

New Version Current

Key

1. Introduction

Number of Tasks

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 3

New Version Current

Key

1. Introduction

Number of Tasks