Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Due to these shortcomings, HCatalog must rewrite the output of the client job, to establish correct order and partitioning, before it can perform the bulk import, and it has to do so in a secure way.

Proposed Solution Image Removed

Image Added<img src="%ATTACHURLPATH%/HBase_Bulk_Load_2.png" alt="Insecure Bulk Import"/>

  1. The Client M/R job writes to a temporary file, without paying attention to ordering or partitioning.
  2. To prepare this file for import into HBase, a Rewrite Job is launched. This is a M/R job that sorts and partitions the data to match HBase's regions, and writes in HFile output.
  3. HBase bulk import moves the files into its directory structure and updates its indexes and meta data.

...

What if the Rewrite Job runs under the HBase user id? Then the output would naturally be owned by HBase, which is what we want. But this job needs to be able to read the temporary file written by the Client job. Making this file world/group accessible is not secure for the same reasons as above. The solution is to allow the Rewrite Job to impersonate the Client user id for reading the temp file. We can do that by passing the Client's HDFS delegation tokens to the Rewrite job. The resulting flow is then:

<img src="%ATTACHURLPATH%/HBase_Bulk_Load_3.png" alt="Secure Bulk Import"/>

Implementation

How can we wrap all of this into an !OutputFormat? The plan is as follows:

...

  • The Client job writes its temporary SequenceFile as private files, that is, with only user privilege to read, write, and execute (700), into a private temp directory that only the client user can access. Note that for proper isolation , each client user should have his own directory for these temp files.
  • The Rewrite job writes its HFile output to a temp directory owned by HBase (if we wanted to reuse the client temp directory, then we would have make that directory accessible to HBase). In this case, the same directory can be the used across all clients ids, because only HBase needs access to it.
  • The Rewrite job is called SequenceFileImporter in the current (insecure) implementation.

Sequence Diagram Image Removed

Image Added<img src="%ATTACHURLPATH%/bulk-uml.png" alt="Sequence Diagram" width='556' height='562' />

Note that the deletion of the HFile in case of failure must happen in the Rewrite Job. It cannot happen in the Client Job (like the deletion of the Sequence file) because the Client user id will not have the access rights to do so.

...