Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Minor Edits - changed #LLAP to LLAP

...

The diagram below shows an example execution with #LLAPLLAP. Tez AM orchestrates overall execution. The initial stage of the query is pushed into #LLAPLLAP, and large shuffle is performed in their own containers. Multiple queries and applications can access #LLAP LLAP concurrently.

Image RemovedImage Added

Persistent daemon

To facilitate caching and JIT optimization, and to eliminate most of the startup costs, a daemon runs on the worker nodes on the cluster. The daemon handles I/O, caching, and query fragment execution.

  • These nodes will be are stateless. Any request to an #LLAP LLAP node contains the data location and metadata. It processes local and remote locations; locality is the caller’s responsibility (YARN).
  • Recovery/resiliency. Failure and recovery is simplified because any data node can still be used to process any fragment of the input data. The Tez AM can thus simply rerun failed fragments on the cluster.
  • Communication between nodes. #LLAP LLAP nodes are able to share data (e.g., fetching partitions, broadcasting fragments). This is realized with the same mechanisms used in Tez.

Execution Engine

#LLAP LLAP will work within existing, process-based Hive execution to preserve the scalability and versatility of Hive. It will not replace the existing execution model but enhance it.

  • The daemons are optional. Hive will continue to work without them and will also be able to bypass them even if they are deployed and operational. Feature parity with regard to language features will be maintained.
  • External orchestration and execution engines. #LLAP LLAP is not an execution engine (like MR or Tez). Overall execution will be scheduled and monitored by existing Hive execution engine such as Tez; transparently over both #LLAP LLAP nodes, as well as regular containers. Obviously, #LLAP LLAP level of support will depend on each individual execution engine (starting with Tez). MapReduce support is not planned, but other engines may be added later. Other frameworks like Pig will also have the choice of using #LLAP LLAP daemons.
  • Partial execution. The result of the work performed by an #LLAP LLAP daemon can either form part of the result of Hive query, or be passed on to external Hive tasks, depending on the query.
  • Resource management. YARN will remain responsible for the management and allocation of resources. The YARN container delegation model will be used for users to transfer allocated resources to #LLAPLLAP. To avoid the limitations of JVM memory settings, we will keep cached data, as well as large buffers for processing (e.g., group by, joins), off-heap. This way, the daemon can use a small amount of memory, and additional resources (i.e., CPU and memory) will be assigned based on workload.

...

For partial execution as described above, #LLAP LLAP nodes will execute “query fragments” such as filters, projections, data transformations, partial aggregates, sorting, bucketing, hash joins/semi-joins, etc. Only Hive code and blessed UDFs will be accepted in #LLAPLLAP. No code will be localized and executed on the fly. This is done for stability and security reasons.

  • Parallel execution. The node will allow parallel execution for multiple query fragments from different queries and sessions.
  • Interface. Users can access #LLAP LLAP nodes directly via client API. They will be able to specify relational transformations and read data via record-oriented streams.

...

YARN will be used to obtain resources for different workloads. Once resources (CPU, memory, etc) have been obtained from YARN for a specific workload, the execution engine can choose to delegate these resources to #LLAPLLAP, or to launch Hive executors in separate processes. Resource enforcement via YARN has the advantage of ensuring that nodes do not get overloaded, either by #LLAP LLAP or by other containers. The daemons themselves will be under YARN’s control.

ACID Support

#LLAP LLAP will be aware of transactions. The merging of delta files to produce a certain state of the tables will be performed before the data is placed in cache.

Multiple versions are possible and the request will specify which version is to be used. This has the benefit of doing the merge async and only once for cached data, thus avoiding the hit on the operator pipeline.

Security

#LLAP LLAP servers are a natural place to enforce access control at a more fine-grained level than “per file”. Since the daemons know which columns and records are processed, policies on these objects can be enforced. This is not intended to replace the current mechanisms, but rather to enhance and open them up to other applications as well.

...

Hive Contributor Meetup Presentation

Try Hive LLAP

 

 

 

 

 

Save

Save

Save

Save