Abstract

At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle Dance, a tool for federating multiple Hive clusters and providing the illusion of a unified data catalog from disparate instances. We’ve been running Waggle Dance in production for well over a year and it has formed a critical part of our data platform architecture and infrastructure.

We believe that this type of functionality will be of increasing importance as Hadoop and Hive workloads migrate to the cloud. While Waggle Dance is one solution, significant benefits could be realized if these kinds of abilities were an integral part of the Hive platform.

This proposal outlines why such a feature is needed in Hive, the benefits gained by offering it as a built-in feature, and representation of a possible implementation. Our proposed implementation draws inspiration from the remote table features present in some traditional RDBMSes, which may already be familiar to you.

The cloud, a rich source of new architectures

Waggle Dance was created to overcome the problems created by architectural patterns that arose in cloud deployments of Hive. Unlike the monolithic deployment model typical of on premises deployments, in the cloud users are able to trivially create multiple cluster instances, either by design or unintentionally through organic growth. This ability to create new clusters in the cloud provides new levels of flexibility and agility to organizations. They are no longer bound to inelastic resource pools, specific platform versions, and lengthy upgrade cycles. Different parts of an organization are free to technologically innovate at their own pace.

However, Hive is designed around the monolithic cluster architecture and provides no means to enable cross cluster access of datasets. One Hive cluster cannot concurrently access local datasets and those in another cluster. This results in a set of data silos in the cloud, inhibiting data exploration, discovery and sharing and ultimately limiting an organization's ability to realize the full potential of their datasets.

How Waggle Dance works

Waggle Dance provides a federated view of these disparately located datasets, allowing users in one cluster to explore and access datasets in multiple other clusters. Waggle Dance operates as a request routing metastore proxy service. It implements the Hive metastore Thrift API, so that as far as a workload (ETL, Query, Analytics tool) is concerned, it is communicating directly with a Hive metastore service instance. Waggle Dance connects to multiple metastore services, located in otherwise unconnected clusters. It routes and transforms metadata requests from the workload, to the appropriate metastore using database identifiers encoded in the payloads. Finally it returns the responses from the different metastores to the workload. These responses typically contain data file paths that the workload then uses to consume data. Typically these are located in a region wide object store such as S3, and hence the underlying dataset data can be read fairly seamlessly between clusters.

Problems with Waggle Dance

We’ve been successfully using Waggle Dance in production to federate many Hive instances, often bridging across different AWS accounts in the same region. It has enabled the sharing of large datasets across different organizational units and has restored the ability to easily explore, discover and share datasets as we once did on our on-premises organisation-wide Hive cluster. Our user groups include Data scientists, analysts, and engineers who are interacting with the platform using a diverse range of tools including Hive, Spark, Cascading, HS2+JDBC (+ many clients), Qubole, and Databricks.

While we are very happy with the functionality that Waggle Dance provides, there are some drawbacks:

Additional infrastructure requirements

To deliver Waggle Dance as a highly available service we must operate multiple additional load balanced virtual servers in a cross availability zone arrangement. These servers have a similar memory footprint to our HMS instances as they are required to serialize and deserialize Thrift objects passing between workloads and federated metastore instances. This Waggle Dance infrastructure creates an additional operational burden on our data platform.

Thrift API limits integration opportunities

The HMS Thrift API has proved to be a very convenient integration and interception point for Waggle Dance. However, we are seeing a trends and possible needs for alternative means of integrating data catalogue services with Hive.

  • Amazon’s Glue catalog service uses a custom HiveMetaStoreClientFactory implementation to deliver metadata to Hive and other workloads, side-stepping any need for Thrift API compatibility. Currently this precludes us from federating Glue catalogs into our unified catalog.
  • Additionally, the recently announced iceberg table format from Netflix provides many metastore-like attributes in a highly scalable design, yet requires minimal metadata serving infrastructure other than S3. It’s not hard to imagine that this data format could also be integrated with Hive, partly with a HiveMetaStoreClientFactory implementation. What is clear, is that a Thrift API is redundant in this case and so could not be used as a federation endpoint for Waggle Dance.
  • While the HMS Thrift API has provided excellent backward compatibility across releases, it has not been perfect in all cases. With some new Hive releases, we’ve had to implement work arounds in Waggle Dance so that we can support federation between different versions of Hive. We anticipate that Waggle Dance will be shown to be quite brittle in the event of any significant breaking changes in the HMS Thrift API, and will thus need to adopt a different approach or redesign.

Federation management is externalised

Waggle Dance is a service separate from HMS, any configuration relating to the metastores and databases we wish to expose in a federated view must be managed separately via YAML config files that are loaded by Waggle Dance on start up. This creates an administration overhead, and prevents Hive users from creating and managing federated access to the datasets that they require.

Name overloads

As our multiple cloud-based Hive clusters are independent, they have no shared global schema namespace. Consequently it is possible for database and/or table name overloads to occur when federating multiple metastores; some may well include database or table names that are already in use in one or more of the other metastores in the federation. Waggle Dance handles this by either applying a prefix to remote database names, or by allowing administrators to carefully curate a list of databases that should be exposed locally, so that overloads can at least be excluded. However, both approaches have drawbacks. Prefixing creates localized database names, and consequently encourages the creation of non-portable Hive SQL scripts. The curation of imported databases is an operational burden, and is not useful when one wants to access two databases with the same name in a unified view.

Federation with remote databases and tables

While Waggle Dance is working well for us, its design was highly influenced by our need to quickly prove a concept and unblock groups of users. Therefore we intentionally avoided any design that might require changes to HMS and by implication, extended release cycles. However, in the data warehousing world there are already good examples of features that solve these kind of federation based problems. These features are often tightly integrated into the data warehousing platform, giving users a seamless experience. For example, the Oracle platform includes features such as Database links, and Remote tables that allow catalog entities to be projected and accessed from a remote database server into a local instance.

An implementation of 'remotes' in Hive

We therefore propose that the concept of remotes be added to Hive. Practically this would encapsulate and deliver the proven functionality and utility of Waggle Dance while simultaneously overcoming the deficiencies in the Waggle Dance design. Before exploring the full scope of this idea, let's consider the anatomy of the most typical use case from a user's perspective; creating a link to a table in a remote cluster to enable local access:

    CREATE REMOTE TABLE local_db.local_tbl
    CONNECTED TO remote_db.remote_tbl
    VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory'
    WITH TBLPROPERTIES (
      'hive.metastore.uris' = 'thrift://remote-hms:9083'
    );

Once created the user can expect to access the table remote_db.remote_tbl, located in a remote Hive cluster, as if it were a cluster local entity, using the synonym local_db.local_tbl.

  • Firstly notice that this can be an entirely user-driven process, managed from within a Hive client. If desired, the creation of remotes could possibly be authorized by appropriate GRANTs.
  • The user is free to choose the local synonym and target database for the table. This provides a mechanism to re-map names to avoid issues of overloads, or to make names 'contextually relevant' to the domain of the local cluster. However, an option would be provided to simply adopt the remote names when no re-mapping is desired, yielding a simpler CREATE REMOTE TABLE (CRT) statement.
  • The remote table that will be exposed locally is declared via a CONNECTED TO stanza. The full-qualified table name should exist in the target remote Hive cluster.
  • The statement includes a VIA stanza that describes the client implementation that will be used to connect to the remote metadata catalog. This is the fully qualified name of a class that implements org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory that, along with its dependencies, must be on the classpath. This provides far greater scope for integrations with other metadata catalogs such as AWS Glue and Netflix iceberg, and also provides a migration path for HMS endpoints based around protocols other than Thrift. It also allows the inclusion of additional transport features with appropriate implementations. For example, Waggle Dance provides the option of multi-hop SSH tunneling for HMS Thrift transports, a useful feature when federating data catalogs across accounts and networks.
  • The aforementioned client factories can be configured using the table properties declared in this remote table definition. In this instance they are used to declare the hive.metastore.uris of the HMS in which the remote table resides. However, one can imagine that different client implementations might support a variety of specific properties. In our SSH tunneling example, we might for example wish to declare the tunnel configuration.

Once the remote table has been created the user is free to interact with it as they would any other table (we assume read only at this time). As we demonstrated with Waggle Dance, users are able to efficiently and transparently select, join, and describe tables from multiple distributed clusters in a single local unified view.

The HiveMetaStoreClientFactory abstraction

The proposed implementation leans heavily on the org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory and HiveMetaStoreClient abstractions for delivering different metadata catalog client implementations. At this time this appears to be a construct introduced by Amazon into their EMR platform for the purposes of integrating with their AWS Glue data catalog. This is achieved by specifying the relevant class name under the hive-site.xml key hive.metastore.client.factory.class. It is unclear whether this abstraction, and associated changes will be contributed back to the Apache Hive project. However, irrespective of Amazon's intentions, this or something similar will need to be introduced to allow the integration of different catalog connectors.

Other use cases

Our first example dealt with the simple federating of a single table from one remote metastore. The following examples aim to describe the greater scope of the implementation.

Remote databases

Waggle Dance actually federates databases, and hence sets of tables. We could achieve a similar feat with a CREATE REMOTE DATABASE (CRD) statement. This would expose all tables in the remote database to the local Hive cluster

    CREATE REMOTE DATABASE local_db_name
    CONNECTED TO remote_db_name
    VIA 'org.apache.hadoop.hive.metastore.ThriftHiveMetastoreClientFactory'
    WITH DBPROPERTIES (
      'hive.metastore.uris' = 'thrift://remote-hms:9083'
    );
Statement defaults

The CRT and CRD statements can be simplified if we assume some sensible defaults. Here we assume that if a VIA stanza is not supplied, we'll default to the HMS Thrift implementation. If the CONNECT TO stanza is omitted, the remote database name is assumed to be equal to user supplied local name:

    CREATE REMOTE DATABASE db_name
    WITH DBPROPERTIES (
      'hive.metastore.uris' = 'thrift://remote-hms:9083'
    );

Now, for a remote table we can also derive the local database name from the user's currently selected database, and expect that the remote table name is equal to the user supplied local name:

CREATE REMOTE TABLE tbl_name
    WITH TBLPROPERTIES (
      'hive.metastore.uris' = 'thrift://remote-hms:9083'
    );
SSH Tunneling and bastion hosts

With a suitable connector, remotes could be configured to use a SSH tunnel to access a remote Hive metastore in cases where certain network restrictions prevent a direct connection from the local cluster to the machine running the Thrift Hive metastore service. A SSH tunnel consists of one or more hops or jump-boxes. The connection between each pair of nodes requires a user and a private key to establish the SSH connection.

    CREATE REMOTE TABLE tbl_name
    VIA 'org.apache.hadoop.hive.metastore.SSHThriftHiveMetastoreClientFactory'
    WITH TBLPROPERTIES (
      'hive.metastore.uris' = 'thrift://metastore.domain:9083'
      'ssh.tunnel.route' = 'bastionuser@bastion-host.domain -> user@cluster-node.domain'
      'ssh.tunnel.private.keys' = '/home/user/.ssh/bastionuser-key-pair.pem,/home/user/.ssh/user-key-pair.pem'
      'ssh.tunnel.known.hosts' = '/home/user/.ssh/known_hosts'
    );
Non-Thrift catalog integrations

Using different HiveMetastoreClientFactory we can import database and table entities for other catalog implementations, or HMS endpoints that use alternative protocols such as REST or GRPC. Consider these illustrative examples:

AWS Glue
    CREATE REMOTE TABLE tbl_name
    VIA 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'
    WITH TBLPROPERTIES (
      -- Glue endpoint configuration
    );
Netflix iceberg
    CREATE REMOTE TABLE tbl_name
    VIA 'xxx.yyy.iceberg.hive.zzz.IcebergTableHiveClientFactory'
    WITH TBLPROPERTIES (
      'iceberg.table.path' = 'an-atomic-store:/tables/tbl_name'
    )
Behaviour of DESCRIBE and SHOW operations

On executing DESCRIBE operations on remote tables and databases, we envisage that the user be returned the description from the remote catalog to which the remote configuration is appended.

Summary

The cloud provides a very different environment for the provisioning of Hive clusters. Clusters may be created in minutes by ordinary users, compared with the months that might be required of an operations team with a physical cluster. Therefore it is not uncommon, and should be expected, that a single organization might become dependent on multiple clusters. With Hive's current monolithic architecture, these form data silos which in turn create a barrier for data discovery, exploration, and sharing - decreasing the potential value that organizations can derive from their datasets. With Waggle Dance, we have shown that cross-Hive-cluster federation is both a possible and effective solution to this problem. However, significant benefits could be gained if such a feature were a core offering of the Hive platform.

The proposed solution of Hive 'remotes' offers feature parity with Waggle Dance while overcoming its deficiencies. It also offers opportunities to integrate with more recent HMS alternatives in a complimentary manner.

FAQ

Why not replicate tables between clusters instead?

We could instead replicate tables between our siloed clusters in the cloud. However there are some significant drawbacks. Replication creates large operational overheads. Processes must be deployed, configured, and managed. Greater storage costs are also incurred for each new replica. Finally, one needs to ensure all replicas are in sync, and detect and communicate information concerning any divergent tables.

How is authentication applied?

Waggle Dance has rudimentary support for the propagation of identity to remote federated metastores. It does this by simply passing along the current UGI principal in the Thrift requests. Clearly, this principal needs to be known in the target HMS. The proposed remoting feature could provide similar functionality. At this time Waggle Dance does not support Kerberos. However, we see no technical reason why this could not be implemented both in Waggle Dance and the remote tables feature.

How is authorization applied?

Currently, the only comprehensive authorization scheme that Hive offers is applied in the HiveServer2 component, and not the HMS. Additionally, HMS resident authorization schemes (file based) do not function on the file stores offered by the cloud providers targeted by the remotes feature. Therefore, authorization appears to be a client issue and is not in the scope of this proposal.

What access patterns are supported?

Waggle Dance is primarily used for read only access of tables in remote Hive clusters, and full control of tables in the local cluster. This proposal assumes a similar set of constraints. However, one can imagine that it might be useful to write to AWS Glue, or Netflix iceberg tables for the purposes of incremental migration to those platforms.

Will it work with ACID tables?

Remote tables should work in the context of read only access. To read ACID, one needs only the ValidTxnList from the remote metastore and access to the set of base and delta files. Writing of remote ACID tables does not seem practical as there is no global transaction manager in this architecture. Note that at this time ACID does not function reliably on S3, although this capability has been promised.

  • No labels