Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The file must be written in HFile format. In HCatalog, that can be %GREEN% achieved easily %ENDCOLOR% by delegating writes to !HFileOutputFormat.
  2. The file must be sorted by HBase's row key. HCatalog has %RED% no control over the ordering %ENDCOLOR% of a client M/R job's output.
  3. The file must be partitioned in a way that matches the regions of the HBase table, ideally by using !TotalOrderPartitioning in the writing M/R job. HCatalog has %RED% no influence on the partitioning %ENDCOLOR% used by a client job.

Due to these shortcomings, HCatalog must rewrite the output of the client job, to establish correct order and partitioning, before it can perform the bulk import, and it has to do so in a secure way.

Proposed Solution Image Added

<img src="%ATTACHURLPATH%/HBase_Bulk_Load_2.png" alt="Insecure Bulk Import"/>

...

<img src="%ATTACHURLPATH%/HBase_Bulk_Load_3.png" alt="Secure Bulk Import"/> Image Added

Implementation

...

  • The Client job writes its temporary SequenceFile as private files, that is, with only user privilege to read, write, and execute (700), into a private temp directory that only the client user can access. Note that for proper isolation , each client user should have his own directory for these temp files.
  • The Rewrite job writes its HFile output to a temp directory owned by HBase (if we wanted to reuse the client temp directory, then we would have make that directory accessible to HBase). In this case, the same directory can be the used across all clients ids, because only HBase needs access to it.
  • The Rewrite job is called SequenceFileImporter in the current (insecure) implementation.

Sequence Diagram Image Added

<img src="%ATTACHURLPATH%/bulk-uml.png" alt="Sequence Diagram" width='556' height='562' />

...