Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The file must be written in HFile format. In HCatalog, that can be achieved easily by delegating writes to ! HFileOutputFormat.
  2. The file must be sorted by HBase's row key. HCatalog has no control over the ordering of a client M/R job's output.
  3. The file must be partitioned in a way that matches the regions of the HBase table, ideally by using ! TotalOrderPartitioning in the writing M/R job. HCatalog has no influence on the partitioning used by a client job.

...

  1. Implement an extension of ! HFileOutputFormat that calls HBase bulk import in its Output Committer. If that call fails, it will delete the HFiles and fail the job. This ! OutputFormat is used by the rewrite job to write the HFiles and import them into HBase.
  2. Implement the Rewrite Map/Reduce job, which reads the ! SequenceFile, sorts and partitions it and converts it into HFile format using ! HFileOutputFormat. In order to produce the right partitioning, this jobs employs a ! TotalOrderPartitioning matching the regions of the HBase table. This job writes its output with the HBase system user id. In order to read its input ! SequenceFile, it needs to impersonate the Client user id. This can be done by passing the Client job's delegation tokens to the rewrite job.
  3. Implement an HBase coprocessor endpoint that allows the ! HBaseBulkOutputFormat's output committer to start the Rewrite job, running as the HBase system user. The Client delegation token must either a) be passed to the coprocessor as part of the API call, or b) obtained by the coprocessor using a doAs() block (which would require that HBase is configured as a Kerberos super user). This coprocessor call will start the Rewrite job and block until the job is finished.
  4. Implement a modified ! SequenceFileOutputWriter, which will be used by the Client job to write. We name this the ! HBaseBulkOutputFormat. The main modification from ! SequenceFileOutputFormat is that its ! OutputCommitter starts the Rewrite job using the above coprocessor, waits for it to finish (or fail) and then deletes the ! SequenceFile.

All of this implementation has no dependency on HCatalog and could be used without HCatalog. Notes:

...