...
This document applies only to the Metastore in Hive 3.0 and later releases. For Hive 0.x, 1.x, and 2 .x releases please see the Metastore Administration document.
...
The definition of Hive objects such as databases, tabletables, and functions are stored in the Metastore. Depending on how the system is configured, statistics and authorization data records may also be stored there. Hive, and other execution engines, can then use this data at runtime to determine how to efficiently execute user queries.
...
The Metastore can be configured to embed the Apache Derby RDBMS or connect to a external RDBMS. It The Metastore itself can be embedded entirely in a user process or run as a service for other processes to connect to. Each of these options will be discussed in turn below.
Changes From Hive 2
...
to Hive 3
...
Beginning in Hive 3.0, the Metastore can be run without the rest of Hive being installed. It is provided as a separate release in order to allow non-Hive systems to easily integrate with it. (It is, however, still included in the Hive release for convenience.) Making the Metastore a standalone service involved changing a number of configuration variable parameter names and tool names. All of the old configuration variables parameters and tools still work for previously existing values and functions work in order to maximize backwards compatibility. This document will cover both the old and new names. As new functionality is added it will only be added to the new names.
...
Parameter | Hive 2.0 Parameter | Default Value | Description |
---|---|---|---|
metastore.warehouse.dir | hive.metastore.warehouse.dir | URI of the default location for tables in the default catalog and database. | |
metastore.authorization.storage.checks | hive.metastore.authorization.storage.checks | false | Should the metastore do authorization checks against the underlying storage? For example for a drop-partition it would disallow the drop if the user does not have permissions to delete the corresponding directory from the storage. |
datanucleus.schema.autoCreateAll | datanucleus.schema.autoCreateAll | false | Auto creates the necessary schema in the RDBMS at startup if one does not exist. Set this to false after creating it once. To enable auto create also set hive.metastore.schema.verification=false. Auto creation is not recommended in production; run |
metastore.schema.verification | hive.metastore.schema.verification | true | Enforce metastore schema version consistency. When set to true: verify that version information stored in is compatible with one from Hive jarsthe version of the Metastore jar. Also disable automatic schema migration. Users are required to manually migrate the schema after Hive upgrade, which ensures proper metastore schema migration. |
metastore.hmshandler.retry.attempts | hive.hmshandler.retry.attempts | 10 | The number of times to retry a call to the meastore when there is a connection error. |
metastore.hmshandler.retry.interval | hive.hmshandler.retry.interval | 2 sec | Time between retry attempts. |
metastore.log4j.file | hive.log4j.file | none | Log4j configuration file. If unset will look for metastore-log4j2.properties in $METASTORE_HOME/conf |
metastore.stats.autogather | hive.stats.autogather | true | Whether to automatically gather basic statistics during insert commands. |
...
RDBMS
Option 1: Embedding Derby
The metastore can be run with Apache Derby embedded. This is the default configuration. However, it is not intended for use beyond simple testing. In this configuration only one client can use the Metastore and any changes are not durable beyond the life of the client (since it uses an in memory version of Derby).
Option 2: External RDBMS
For any durable, multi-user installation, an external RDBMS should be used to store Metastore objects. The Metastore connects to an external RDBMS via JDBC. Any jars required by the JDBC driver for your RDBMS should be placed in METASTORE_HOME/lib
or explicilty passed on the command line. The following values need to be configured to connect the Metastore to an RDBMS. (Note: these configuration parameters did not change between Hive 2 and 3.)
Configuration Parameter | Comment |
---|---|
javax.jdo.option.ConnectionURL | Connection URL for the JDBC driver |
javax.jdo.option.ConnectionDriverName | JDBC driver class |
javax.jdo.option.ConnectionUserName | User name to connect to the RDBMS with, often 'hive' is used |
Supported RDBMSs
TRY_DIRECT_SQL_DDL and Postgres
Installing, Upgrading, and Checking Metastore Tables in the RDBMS
Running the Metastore
...
javax.jdo.option.ConnectionPassword | Password to connect to the RDBMS with. The Metastore uses Hadoop's CredentialProvider API so this does not have to be stored in clear text in your configuration file. |
Supported RDBMSs
As the Metastore uses DataNucleus to communicate with the RDBMS, theoretically any storage option supported by DataNucleus would work with the Metastore. However, we only test and recommend the following:
RDBMS | Minimum Version | javax.jdo.option.ConnectionURL | javax.jdo.option.ConnectionDriverName |
---|---|---|---|
MS SQL Server | 2008 R2 | jdbc:sqlserver://<HOST>:<PORT>;DatabaseName=<SCHEMA> | com.microsoft.sqlserver.jdbc.SQLServerDriver |
MySQL | 5.6.17 | jdbc:mysql://<HOST>:<PORT>/<SCHEMA> | com.mysql.jdbc.Driver |
MariaDB | 5.5 | jdbc:mysql://<HOST>:<PORT>/<SCHEMA> | org.mariadb.jdbc.Driver |
Oracle* | 11g | jdbc:oracle:thin:@//<HOST>:<PORT>/xe | oracle.jdbc.OracleDriver |
Postgres | 9.1.13 | jdbc:postgresql://<HOST>:<PORT>/<SCHEMA> | org.postgresql.Driver |
<HOST> = The host the RDBMS is on.
<PORT> = Port the RDBMS is listening for JDBC connections on
<SCHEMA> = The schema (or database) that the Metastore stores its tables in.
*The Oracle values shown are for Oracle's thin JDBC client. If you are using a different client the ConnectionURL and ConnectionDriverName values will differ.
Special Note: When using Postgres you should set the configuration parameter metastore.try.direct.sql.ddl
(previously hive.metastore.try.direct.sql.ddl
) to false, to avoid failures in certain operations.
Installing and Upgrading the Metastore Schema
The Metastore provides the schematool
utility to work with the Metastore schema in the RDBMS. For a full list of options see the -help
option of the tool. The following summarizes what the tool can do. In most cases schematool
can read the configuration from the metastore-site.xml
file, though the configuration can also be passed as options on the command line.
-initSchema
: install a new schema. This should be used when first setting up a Metastore.-upgradeSchema
: upgrade to the newly installed version. For 3.0, upgrades can be done from 1.2, 2.0, 2.1, 2.2, and 2.3 to 3.0. If you need to upgrade from before 1.2, use an older version of Hive'sschematool
to first upgrade your schema to 1.2, then use the current Metastore version to upgrade to 3.0.-createUser
: create the Metastore user and schema. This does not install the tables, it just creates the database user and schema. This likely will not work in a production environment because you likely will not have permissions to create users and schemas. You will likely need your DBA to do this for you.-validate
: check that your Metastore schema is correct for its recorded version
Running the Metastore
Embedded Mode
The Metastore can be embedded directly into a process as a library. This is often done with HiveServer2 to avoid an additional network hop for metadata operations. It can also be done when using the Hive CLI or any other process. This mode is the default and will be used anytime the configuration parameter metastore.uris
is not set.
Except in the case of HiveServer2, using this mode does raise a few concerns. First, having many clients will put a burden on the backing RDBMS since each client will have its own set of connections.
Security Considerations
Metastore Server
...
Security: EXECUTE_SET_UGI, metastore.authorization.storage.checks
Setting up Caching: CACHED*, CATALOGS_TO_CACHE & AGGREGATE_STATS_CACHE*
...