...
- Hive as a table storage layer. This is the use case for Hive's HCatalog API users such as Apache Pig, MapReduce and some MPP databasesMassively Parallel Processing databases (Cloudera Impala, Facebook Presto, Spark SQL etc). In this case, Hive provides a table abstraction and metadata for files on storage (typically HDFS). These users have direct access to HDFS and the metastore server (which provides an API for metadata access). HDFS access is authorized through the use of HDFS permissions. Metadata access needs to be authorized using Hive configuration.
- Hive as a SQL query engine. This is one of the most common use cases of Hive. This is the 'Hive view' of SQL users and BI tools. This use case has the following two subcategories:
- Hive command line users. These users have direct access to HDFS and the Hive metastore, which makes this use case similar to use case 1. Note, that usage of Hive CLI will be officially deprecated soon in favor of Beeline.
- ODBC/JDBC and other HiveServer2 API users (Beeline CLI is an example). These users have all data/metadata access happening through HiveServer2. They don't have direct access to HDFS or the metastore.
...
In use cases 1 and 2a, the users have direct access to the data. Hive configurations don't control the data access. The HDFS permissions act as one source of truth for the table storage access. By enabling Storage Based Authorization in the metastore serverMetastore Server, you can use this single source for truth and have a consistent data and metadata authorization policy. To control metadata access on the metadata objects such as Databases, Tables and Partitions, it checks if you have permission on corresponding directories on the file system. You can also protect access through HiveServer2 (use case 2b above) by ensuring that the queries run as the end user (hive.server2.enable.doAs option should be "true" in HiveServer2 configuration – this is a default value).
Note, that through the use of HDFS ACL (available in Apache Hadoop 2.4 onwards) you have a lot of flexibility in controlling access to the file system, which in turn provides more flexibility with Storage Based Authorization. Also, note that you need the upcoming Hive 0.14 release to make use of the flexibility provided through HDFS ACL (HIVE-7583).This functionality is available as of Hive 0.14 (HIVE-7583).
While relying on Storage based authorization for restricting access, you still need to enable one of the security options 2 or 3 listed below or use FallbackHiveAuthorizer to protect actions within the HiveServer2 instance.
Fall Back Authorizer
You need to use Hive 2.3.4 or 3.1.1 or later to use Fall Back Authorizer.
Admin needs to specify the following entries in
hiveserver2-site.xml:
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.fallback.FallbackHiveAuthorizerFactory</value>
</property>
FallbackHiveAuthorizerFactory will do the following to mitigate above mentioned threat:
- Disallow local file location in sql statements except for admin
- Allow "set" only selected whitelist parameters
- Disallow dfs commands except for admin
- Disallow "ADD JAR" statement
- Disallow "COMPILE" statement
- Disallow "TRANSFORM" statement
2 SQL Standards Based Authorization in HiveServer2
Although storage based authorization Storage Based Authorization can provide access control at the level of Databases, Tables and Partitions, it cannot can not control authorization at finer levels such as columns and views because the access control provided by the file system is at the level of directory and files. A prerequisite for fine grained access control is a data server that is able to provide just the columns and rows that a user needs (or has) access to. In the case of file system access, the whole file is served to the user. HiveServer2 satisfies this condition, as it has an API that understands rows and columns (through the use of SQL), and is able to serve just the columns and rows that your SQL query asked for.
SQL standards based authorizationStandards Based Authorization (introduced in Hive 0.13.0, HIVE-5837) can be used to enable fine grained access control. It is based on the SQL standard for authorization, and uses the familiar grant/revoke statements to control access. It needs to be enabled through HiveServer2 configuration.
Note that for use case 2a (Hive command line) SQL standards based authorization Standards Based Authorization is disabled. This is because secure access control is not possible for the Hive command line using an access control policy in Hive, because users have direct access to HDFS and so they can easily bypass the SQL standards based authorization checks or even disable it altogether. Disabling this avoids giving a false sense of security to users.
3
...
Authorization using Apache Ranger & Sentry
Apache Ranger and Apache Sentry are apache projects that use plugins provided by hive to do authorization.
The policies are maintained under repositories under those projects.
You also get many advanced features using them. For example, with Ranger you can view and manage policies through web interface, view auditing information, have dynamic row and column level access control (including column masking) based on runtime attributes.
4 Old default Hive Authorization (Legacy Mode)
Hive Old Default Authorization is (was default before Hive 2.0.0) is the authorization mode that has been available in earlier versions of Hive. However, this mode does not have a complete access control model, leaving many security gaps unaddressed. For example, the permissions needed to grant privileges for a user are not defined, and any user can grant themselves access to a table or database.
...
- Hive Default Authorization - deprecated authorization mode / Legacy Mode
- also see the design document and Security