You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Overview

One the main motivations for using HBase is random access. Among other uses, random reads enable the efficient implementation of dimension stores by performing efficient skewed joins.

This design outlines a Random Read framework enhancement to HCatalog. StorageHandlers implementing the framework would enable users to leverage the underlying storage's random access capability.

Design

Class diagram describing the new classes and integration with HCat.

RandomReader

A RandomReader instance enables a user to perform random reads on a table. A instance is permanently bound to one table and output schema. Presently it contains a single method, which lets users retrieve a row via it's row Key:

HCatRecord getRecord(K key)

ie

String rowKey = ...
HCatRecord myRecord = readerReader.getRecord(rowKey);
RandomAccessible

StorageHandler developers wishing to expose their random read functionality will make use of the RandomAccessible mixin. Developer wills then have to implement the following method:

RandomReader getRandomReader(String tableAlias)

Developers can then construct their implementation specific RandomReader using the information in the configuration object which is passed to the storageHandler via setConf(). The most important piece of information is OutputJobInfo which is serialized as a property in the configuration object.

HCatOuputFormat.addRandomAccessTable()

User will need to declare the tables they wish random reads from prior to MR job submission. This is prevents each map task from having to perform the setup and possibly strain the metastore server on top of overhead incurred. This is done using:

public static void addRandomAccessTable(Configuration conf, String databaseName, String tableName, HCatSchema outputSchema, String tableAlias)

ie

//in client job setup
HCatOutputFormat.addRandomAccessTable(conf,"myDatabase","myTable",outputSchema, "myTableAlias");

During each call the method will:

  • Query the metastore for the table information
  • verify the StorageHandler is randomAccessible (instanceof)
  • verify the schema and then create a new OutputJobInfo.
  • Deserialize the main OutputJobInfo object from configuration object obtained via setConf()
  • Deserialize the randomAccessTableMap from the main OutputJobInfo object's properties or create one if none exists
  • Update the randomAccessTableMap with new tableAlias->OutputJobInfo pair
  • Serialize randomAccessTableMap back into main OutputJobInfo object
  • Serialize main OutputJobInfo object back into configuration object
  • One table can have multiple entries, as long as they have different aliases. The potential use case of this would be that they may have different outputSchemas.
HCatOutputFormat.getRandomReader()

After setup, a RandomReader instance can be retreived using:

public static RandomReader getRandomReader(Configuration conf, String tableAlias, String keyFieldName)

ie

//Usage would look like
RandomReader reader = HCatOutputFormat.getRandomReader(conf, "myTableAlias", "userId");

This method call will perform the following:

  • Retrieve the randomAccessTableMap
  • Retrieve the OutputJobInfo instance identified by tableAlias
  • Using OutputJobInfo.getStorerInfo(), instantiate the StorageHandler class, then configure
  • call storageHandler.getRandomReader() and return the value

– Main.FcliuYahoo – 06 Feb 2012

  • No labels