Contributors (alphabetical): David Capwell, Francis Liu, Andreas Neuman (author), Mithun Radakrishnan

Secure Bulk Loads for HBase

Motivation

This document discusses the implementation of secure bulk loads into HBase through HCatalog. HBase supports bulk import of files from HDFS under the following conditions:

The file must be written in HFile format. In HCatalog, that can be achieved easily by delegating writes to HFileOutputFormat.
The file must be sorted by HBase's row key. HCatalog has no control over the ordering of a client M/R job's output.
The file must be partitioned in a way that matches the regions of the HBase table, ideally by using TotalOrderPartitioning in the writing M/R job. HCatalog has no influence on the partitioning used by a client job.

Due to these shortcomings, HCatalog must rewrite the output of the client job, to establish correct order and partitioning, before it can perform the bulk import, and it has to do so in a secure way.

Proposed Solution

The Client M/R job writes to a temporary file, without paying attention to ordering or partitioning.
To prepare this file for import into HBase, a Rewrite Job is launched. This is a M/R job that sorts and partitions the data to match HBase's regions, and writes in HFile output.
HBase bulk import moves the files into its directory structure and updates its indexes and meta data.

Security

Note that for this to work, the rewritten file must be readable and writable by the HBase system user, and in addition, all parent directories must be executable. A simple way to achieve that is to make the file world-readable and writable (and the directories world-executable), but that would obviously open up to security breaches. Instead of opening up world-access, we could also use group priviledge for an HBase system group - then only members of the HBase system group can read or write the file. However, in order to use this group id, the client user itself also has to be in that group - and hence all users who can do bulk writes to HBase can read and write each other's HFiles.

What if the Rewrite Job runs under the HBase user id? Then the output would naturally be owned by HBase, which is what we want. But this job needs to be able to read the temporary file written by the Client job. Making this file world/group accessible is not secure for the same reasons as above. The solution is to allow the Rewrite Job to impersonate the Client user id for reading the temp file. We can do that by passing the Client's HDFS delegation tokens to the Rewrite job. The resulting flow is then:

Implementation

How can we wrap all of this into an !OutputFormat? The plan is as follows:

Implement an extension of HFileOutputFormat that calls HBase bulk import in its Output Committer. If that call fails, it will delete the HFiles and fail the job. This OutputFormat is used by the rewrite job to write the HFiles and import them into HBase.
Implement the Rewrite Map/Reduce job, which reads the SequenceFile, sorts and partitions it and converts it into HFile format using HFileOutputFormat. In order to produce the right partitioning, this jobs employs a TotalOrderPartitioning matching the regions of the HBase table. This job writes its output with the HBase system user id. In order to read its input SequenceFile, it needs to impersonate the Client user id. This can be done by passing the Client job's delegation tokens to the rewrite job.
Implement an HBase coprocessor endpoint that allows the HBaseBulkOutputFormat's output committer to start the Rewrite job, running as the HBase system user. The Client delegation token must either a) be passed to the coprocessor as part of the API call, or b) obtained by the coprocessor using a doAs() block (which would require that HBase is configured as a Kerberos super user). This coprocessor call will start the Rewrite job and block until the job is finished.
Implement a modified SequenceFileOutputWriter, which will be used by the Client job to write. We name this the HBaseBulkOutputFormat. The main modification from SequenceFileOutputFormat is that its OutputCommitter starts the Rewrite job using the above coprocessor, waits for it to finish (or fail) and then deletes the SequenceFile.

All of this implementation has no dependency on HCatalog and could be used without HCatalog. Notes:

The Client job writes its temporary SequenceFile as private files, that is, with only user privilege to read, write, and execute (700), into a private temp directory that only the client user can access. Note that for proper isolation , each client user should have his own directory for these temp files.
The Rewrite job writes its HFile output to a temp directory owned by HBase (if we wanted to reuse the client temp directory, then we would have make that directory accessible to HBase). In this case, the same directory can be the used across all clients ids, because only HBase needs access to it.
The Rewrite job is called SequenceFileImporter in the current (insecure) implementation.

Sequence Diagram

Note that the deletion of the HFile in case of failure must happen in the Rewrite Job. It cannot happen in the Client Job (like the deletion of the Sequence file) because the Client user id will not have the access rights to do so.

Alternatives Considered

Run HBase as root. That would allow it to move and chown any file or directory. Dismissed because it is unlikely to get security certification.
Use an HBase group. Run the rewrite job as the client user id, and give group read and write permissions to the Hbase group. This would allow HBase to manipulate the files. But HBase would not be able to chown or chgrp, that is, all files remain owned by the client user. Plus, every client would have to be in the HBase group, and hence all clients have read and write access to each other's data. Thus insecure.

Child pages

HBase Secure Bulk Load