Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

HCatalog Authentication

This page lists use cases for authentication related to HCatalog. It will attempt to outline the required changes to enable those use cases.

Background and Terminology

The Hadoop Security (HadoopS) release uses Kerberos to provide authentication. On a secure cluster, the cluster servers (Namenode (NN), Jobtracker (JT), DataNode, TaskTracker) are themselves Kerberos (service) principals and end users are user principals and users and these services mutually authenticate to each other using Kerberos tickets. HadoopS uses security tokens called "delegation tokens" (these are NOT Kerberos tickets but a Hadoop specific security token) to authenticate the map/reduce tasks. So at job submission time once the job client has provided the user Kerberos ticket to authenticate to the NameNode and JobTracker, it is handed delegation tokens from the NameNode so that the tasks can use these to talk to the NameNode. These delegation tokens are stored in the "credential store" for the job and the JobTracker automatically renews these for the job up to a maximum lifetime of 7 days.

Oozie Use Case

Oozie is a service which users use to submit jobs to the HadoopS cluster. It somewhat resembles the HCatalog server since the HCatalog server also needs to act on behalf of users while accessing the DFS. Users authenticate to Oozie and then the Oozie service acts on behalf of the user while working with JobTracker or NameNode. For this to work, both the NameNode and JobTracker need to recognize the "Oozie" prinicpal to be a "proxy user" principal (i.e. a principal that can act on behalf of other users). In addition NameNode and JobTracker need to know the possibles IPs for the proxy user service, list of users or groups (i.e. all users belonging to the group would be allowed) that the Oozie principal can act on behalf of. This proxy user list and associated information is maintained in a configuration read by the NameNode and JobTracker. Once the user authenticates to Oozie, Oozie authenticates itself to NN/JT using the Oozie principal and also uses the UserGroupInformation.doAs() to secure a JobClient object associated with the real user (it needs the real username for the doAs() which it gets hold of from the user authentication). Through this process, Oozie adds delegation tokens (actually the JobClient code does this in a subsequent submitJob()) for the JT and primary NN into the new JobClient to pass on to the launcher map task for the Pig/MR job. If the Pig script/MR job run needs to access more than the primary name node, an Oozie parameter should be used to specify the list of nns that need to be accessed and Oozie will get delegation tokens for all of them through the JobClient.

Changes Required in HCatalog

  • HCatalog server will need to run as a proxy user principal. So at deployment time, the configuration of NN and JT will need to be updated to recognize the "hcat" principal as as "proxy user" principal. An "hcat" net group (similar to oozie) will be needed and all users who want to use HCatalog will need to add themselves to the "hcat" group.
  • HCatalog server will also need to hand out delegation tokens (like the NN) so that the output committer task can use them to authenticate to the HCatalog server to "publish" partitions. Apart from the output committer, Oozie will also request HCatalog delegation tokens and hand them to the corresponding Pig/MapReduce jobs.
  • End users of HCatalog using Pig/Hive/MapReduce/HCatalog CLI (and not using Oozie) would authenticate to HCatalog using Kerberos tickets in the Thrift API calls. As noted in the point above, the output committer task would authenticate to the HCatalog server using the HCatalog delegation token in the publish_parition API call. So the Thrift calls need to support both Kerberos based and delegation token based authentication. There should be a property which is honored to run metastore without any authentication, preferably this should be the same property that Hadoop uses for non secure operation.
  • HCatalog server code should change to implement UserGroupInformation.doAs() so as to do all operations as the real user. The real user's username would be needed to invoke doAs() (Hopefully there is some way to get this from the Kerberos ticket with which the user authenticated.)
  • HCatOutputFormat will need to get delegation tokens from the HCatalog server in checkOutputSpecs() and store the token into the Hadoop credential store so that it can be passed to the tasks. Specifically the OutputCommitter task will use this token to authenticate to the HCatalog server to invoke the publish_partition API call.
  • The JT should renew the HCatalog delegation token so it is kept valid for long running jobs (this might be difficult since JT will need to make Thrift call to renew the delegation token. For the short term we will simply set the timeout on these delegation tokens to be long. In the future the JT can handle renewing them.

Use Cases with HCatalog

HCatalog Client Running DDL Commands

  • A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
  • The HCatalog client needs to acquire the service ticket to access the HCatalog service (This will happen transparently through HiveMetaStoreClient). This service ticket is used to authenticate the user to the HCatalog server.
  • The HCatalog server after authenticating the user does a UserGroupInformation.doAs() call using the real user's username to perform the action requested.

Pig Script Reading from and Writing to Tables in HCatalog

  • A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
  • The HCatInputFormat needs to acquire the service ticket to access the HCatalog service (This will happen transparently through HiveMetaStoreClient) . This service ticket is used to authenticate the user to the HCatalog server.
  • HCatOutputFormat will need to get delegation tokens from the HCatalog server in checkOutputSpecs() and store the token into the Hadoop credential store so that it can be passed to the tasks. Specifically the OutputCommitter task will use this token to authenticate to the HCatalog server to invoke the publish_partition API call.

Hive Query Reading from and Writing to Tables in HCatalog

  • A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
  • The Hive client needs to acquire the service ticket to access the HCatalog service (This will happen transparently through HiveMetaStoreClient). This service ticket is used to authenticate the user to the HCatalog server.

Java MapReduce Job Reading from and Writing to Tables in HCatalog

  • Same as Pig use case?

Oozie Running a Pig Script Which Reads from or Writes to Tables in HCatalog

How will Oozie know that the Pig script interacts with HCatalog - will need some change in Oozie to allow the work flow xml to indicate this?

  • Once Oozie knows that the Pig script may read/write through HCatalog (maybe through some information in the workflow xml), it should also authenticate to the HCatalog server and get the HCatalog delegation token on behalf of the real user (in addition to the usual JT/NN delegation tokens it gets by doing doAs() for creating the JobClient). The HCatalog delegation token should be added on to the launcher task so it is available on the map task launching the Pig script
  • The HCatInputFormat/HCatOutputFormat code will use the delegation tokens already present to authenticate to HCatalog server.
  • The HCatalog delegation token should get sent to the actual map/reduce tasks of the Pig job and also specifically to an OutputCommitter task so that it can use it to publish partition to the HCatalog server.

Oozie Running a Java MR Job Which Reads from or Writes to Tables in HCatalog

How will Oozie know that the Java MR job interacts with HCatalog - will need some change in Oozie to allow the work flow xml to indicate this?

  • Same as Pig?