Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

  1. The file must be written in HFile format. In HCatalog, that can be %GREEN% achieved easily %ENDCOLOR% by delegating writes to ! HFileOutputFormat.
  2. The file must be sorted by HBase's row key. HCatalog has %RED% no control over the ordering %ENDCOLOR% of a client M/R job's output.
  3. The file must be partitioned in a way that matches the regions of the HBase table, ideally by using ! TotalOrderPartitioning in the writing M/R job. HCatalog has %RED% no influence on the partitioning %ENDCOLOR% used by a client job.

Due to these shortcomings, HCatalog must rewrite the output of the client job, to establish correct order and partitioning, before it can perform the bulk import, and it has to do so in a secure way.

Proposed Solution

<img src="%ATTACHURLPATH%/HBase_Bulk_Load_2.png" alt="Insecure Bulk Import"/>Image Added

  1. The Client M/R job writes to a temporary file, without paying attention to ordering or partitioning.
  2. To prepare this file for import into HBase, a Rewrite Job is launched. This is a M/R job that sorts and partitions the data to match HBase's regions, and writes in HFile output.
  3. HBase bulk import moves the files into its directory structure and updates its indexes and meta data.

...

What if the Rewrite Job runs under the HBase user id? Then the output would naturally be owned by HBase, which is what we want. But this job needs to be able to read the temporary file written by the Client job. Making this file world/group accessible is not secure for the same reasons as above. The solution is to allow the Rewrite Job to impersonate the Client user id for reading the temp file. We can do that by passing the Client's HDFS delegation tokens to the Rewrite job. The resulting flow is then:

<img src="%ATTACHURLPATH%/HBase_Bulk_Load_3.png" alt="Secure Bulk Import"/>Image Added

Implementation

...

  1. Implement an extension of ! HFileOutputFormat that calls HBase bulk import in its Output Committer. If that call fails, it will delete the HFiles and fail the job. This ! OutputFormat is used by the rewrite job to write the HFiles and import them into HBase.
  2. Implement the Rewrite Map/Reduce job, which reads the ! SequenceFile, sorts and partitions it and converts it into HFile format using ! HFileOutputFormat. In order to produce the right partitioning, this jobs employs a ! TotalOrderPartitioning matching the regions of the HBase table. This job writes its output with the HBase system user id. In order to read its input ! SequenceFile, it needs to impersonate the Client user id. This can be done by passing the Client job's delegation tokens to the rewrite job.
  3. Implement an HBase coprocessor endpoint that allows the ! HBaseBulkOutputFormat's output committer to start the Rewrite job, running as the HBase system user. The Client delegation token must either a) be passed to the coprocessor as part of the API call, or b) obtained by the coprocessor using a doAs() block (which would require that HBase is configured as a Kerberos super user). This coprocessor call will start the Rewrite job and block until the job is finished.
  4. Implement a modified ! SequenceFileOutputWriter, which will be used by the Client job to write. We name this the ! HBaseBulkOutputFormat. The main modification from ! SequenceFileOutputFormat is that its ! OutputCommitter starts the Rewrite job using the above coprocessor, waits for it to finish (or fail) and then deletes the ! SequenceFile.

All of this implementation has no dependency on HCatalog and could be used without HCatalog. Notes:

  • The Client job writes its temporary SequenceFile as private files, that is, with only user privilege to read, write, and execute (700), into a private temp directory that only the client user can access. Note that for proper isolation , each client user should have his own directory for these temp files.
  • The Rewrite job writes its HFile output to a temp directory owned by HBase (if we wanted to reuse the client temp directory, then we would have make that directory accessible to HBase). In this case, the same directory can be the used across all clients ids, because only HBase needs access to it.
  • The Rewrite job is called SequenceFileImporter in the current (insecure) implementation.

Sequence Diagram

<img src="%ATTACHURLPATH%/bulk-uml.png" alt="Sequence Diagram" width='556' height='562' />Image Added

Note that the deletion of the HFile in case of failure must happen in the Rewrite Job. It cannot happen in the Client Job (like the deletion of the Sequence file) because the Client user id will not have the access rights to do so.

...