Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Hive has a powerful data model that allows users to map logical tables and partitions onto physical directories located on HDFS file systems. As was mentioned earlier, one of the bedrock design principles of this data model is that Hive does not track the individual files that are located in these directories, and instead delegates this task to the HDFS NameNode. The primary motivation for this restriction is that it allows the Metastore to scale by reducing the FS metadata load. However, problems arise when you we try to reconcile this data model with an authorization model that depends on the underlying file system permissions, and which consequently can't ignore the permissions applied to individual files located in those directories.

HCatalog's Storage Based Authorization model is explained in more detail in the HCatalog documentation, but the following set of quotes provides a good high-level overview:

... when a file system is used for storage, there is a directory corresponding to a database or a table. With this authorization model, the read/write permissions a user or group has for this directory determine the permissions a user has on the database or table.
...
For example, an alter table operation would check if the user has permissions on the table directory before allowing the operation, even if it might not change anything on the file system.
...
When the database or table is backed by a file system that has a Unix/POSIX-style permissions model (like HDFS), there are read(r) and write(w) permissions you can set for the owner user, group and ‘other’. The file system’s logic for determining if a user has permission on the directory or file will be used by Hive.

There are several problems with this approach, the first of which is actually hinted at by the inconsistency highlighted in the preceding quote. To determine whether a particular user has read permission on table foo, HCatalog's HdfsAuthorizationProvider class checks to see if the user has read permission on the corresponding HDFS directory /hive/warehouse/foo that contains the table's data. However, in HDFS having read permission on a directory only implies that you have the ability to list the contents of the directory – it doesn't have any affect on your ability to read the files contained in the directory.

Execution container

HCatalog includes a subproject Templeton which exposes two sets of REST APIs: a set to access Hive metadata and a set to launch and manage MapReduce jobs. A metadata REST API is something we want for AccessServer. HCatalog is not the right place for job management. Templeton has copied the Oozie code for job submission and management. We think users should use Oozie's REST APIs to submit jobs to Oozie. The HCatalog plan was to implement JDBC and ODBC on top of the Templeton job control REST API. That would be significant work (while we already have JDBC and ODBC for HiveServer2 that can be used for Pig as well) and would not allow for interactive JDBC or ODBC usage since Templeton executes each instruction as an Oozie job.

...

  • We will need to provide Pig-specific implementations of the metadata operations defined in the CLIService API, e.g. GetTables, GetSchemas, GetTypeInfo, etc. In some cases we will be able to reuse the Hive version of these operations without modification (e.g. GetSchemas). Other metadata operations such as GetTables can be based on the corresponding Hive versions, but must be modified in order to filter out catalog objects such as indexes and views that Pig does not support.
  • The Pig version of the ExecuteStatementOperation will likely require the most effort to implement. This class will function as an adaptor between the AccessServer Session API and an instance of the PigServer class.

Finally, red is used in the preceding diagram to highlight HCatalog components we plan to use: the HCatStorer and HCatLoader modules, and the REST API. These classes function as an adaptor layer that makes Hive’s metadata and SerDes accessible to Pig.