You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Overview

One the main motivations for using HBase is random access. Among other uses, random reads enable the efficient implementation of dimension stores by performing efficient skewed joins.

This design outlines a Random Read framework enhancement to HCatalog. StorageHandlers implementing the framework would enable users to leverage the underlying storage's random access capability.

Design

Class diagram describing the new classes and integration with HCat.

RandomAccess

A RandomAccess instance enables a user to perform random reads on a table. A instance is permanently bound to a single table and output schema. Presently it contains a single method, which lets users retrieve a row via it's row Key:

HCatRecord getRecord(K key)

ie

String rowKey = ...
HCatRecord myRecord = readerReader.getRecord(rowKey);
RandomAccessible

StorageHandler developers wishing to expose random read functionality will make use of the RandomAccessible mixin. Developers will then have to implement the following factory method:

RandomAccess getRandomAccess(InputJobInfo inputJobInfo,  HCatFieldSchema key)

Developers can then construct their implementation specific RandomReader.

HCatInputFormat.addRandomAccessTable()

User will need to declare the tables they wish random reads from prior to MR job submission. This prevents each map task from having to perform the setup and possibly strain the metastore server. This is done using:

public static void addRandomAccessTable(Configuration conf, String databaseName, String tableName)

ie

//in client job setup
HCatOutputFormat.addRandomAccessTable(conf,"myDatabase","myTable");

During each call the method will:

  • Query the metastore for the table information
  • verify the StorageHandler is randomAccessible (instanceof)
  • create a new InputJobInfo.
  • Deserialize the main OutputJobInfo object from configuration object obtained via setConf()
  • Deserialize the randomAccessTableMap from the main OutputJobInfo object's properties or create one if none exists
  • Update the randomAccessTableMap with new db.table->InputJobInfo pair
  • Serialize randomAccessTableMap back into main InputJobInfo object
  • Serialize main InputJobInfo object back into configuration object
HCatInputFormat.getRandomAccess()

After setup, a RandomAccess instance can be retrieved using:

public static RandomReader getRandomReader(Configuration conf, String tableAlias, String keyFieldName)

ie

//Usage would look like
RandomAccess access = HCatInputFormat.getRandomAccess(conf, "myTableAlias", "userId");

This method call will perform the following:

  • Retrieve the randomAccessTableMap
  • Retrieve the InputJobInfo instance
  • Using getStorerInfo(), instantiate the StorageHandler class, then configure
  • return storageHandler.getRandomAccess()

Sample Usage

In job setup:

Job job = new Job("random read job");
....
//define the table you'd like to have random reads
HCatInputFormat.addRandomAccessTable(job.getConfiguration,"myDB","foreignTable");
...

In MR job:

  public class TestMR extends Mapper {
      RandomAccess ra;

      @Override
      protected void setup(Context context) throws IOException, InterruptedException {
          HCatFieldSchema key = ...
          HCatSchema outputSchema = ...
          ra = HCatInputFormat.getRandomAccess(context.getConfiguration(),
              "myDB","foreignTable",key,outputSchema);
      }

      @Override
      protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
            HCatRecord record = ra.getRecord(key);
            ....
      }
  }
  • No labels