Overview
One the main motivations for using HBase is random access. Among other uses, random reads enable the efficient implementation of dimension stores by performing efficient skewed joins.
This design outlines a Random Read framework enhancement to HCatalog. StorageHandlers implementing the framework would enable users to leverage the underlying storage's random access capability. This is an implementation of Generic MR Random Access Framework
Design
Class diagram describing the new classes and integration with HCat.
HCatRandomAccess
A RandomAccess instance enables a user to perform random reads on a table. A instance is permanently bound to a single table and output schema. Presently it contains a single method, which lets users retrieve a row via it's row Key:
HCatRecord get(Object key)
ie
String rowKey = ... HCatRecord myRecord = randomAccess.get(rowKey);
RandomAccessible
StorageHandler developers wishing to expose random read functionality will make use of the RandomAccessible mixin. Developers will then have to implement the following factory method:
public RandomAccess getRandomAccess(String databaseName, String tableName, Map<String, String> properties)
Developers can then construct their implementation specific RandomReader.
HCatRandomAccess.createProperties()
User will need to declare the tables they wish random reads from prior to MR job submission. This prevents each map task from having to perform the setup and possibly strain the metastore server. This is done using:
public static Map<String,String> createProperties(String db, String table, HCatSchema inputSchema, HCatSchema outputSchema, Map<String, String> properties) throws IOException;
During each call the method will:
- Query the metastore for the table information
- verify the StorageHandler is RandomAccessible (instanceof)
- create a new InputJobInfo.
- Deserialize the main InputJobInfo object from configuration object
- Deserialize the randomAccessTableMap from the main InputJobInfo object's properties
- Update the randomAccessTableMap with new db.table->InputJobInfo pair
- Serialize randomAccessTableMap back into main InputJobInfo object
- Serialize main InputJobInfo object back into configuration object
HCatRandomAccess.initialize()
public void initialize(Map<String, String> properties) throws IOException;
This method call will perform the following:
- Retrieve the randomAccessTableMap
- Retrieve the InputJobInfo instance for the selected table
- Using getStorerInfo(), instantiate the StorageHandler class, then configure
- return storageHandler.getRandomAccess()
Sample Usage
In job setup:
JobConf jobConf = new JobConf(); ... jobConf.setOutputFormat(TextOutputFormat.class); FileOutputFormat.setOutputPath(jobConf,new Path("/foo")); Map<String,String>properties = new HashMap<String,String>(); HCatSchema inputSchema = ...; HCatSchema outputSchema = ...; RandomAccessManager.add("myTableAlias", HBaseRandomAccess.class, HCatRandomAccess.createProperties("my_db","my_table",inputSchema,outputSchema,properties), jobConf); JobClient.runJob(jobConf);
In MR job:
public void map(LongWritable longWritable, Text text, OutputCollector<LongWritable, Text> longWritableTextOutputCollector, Reporter reporter) throws IOException { RandomAccess access = RandomAccessManager.get("myTable", Object.class, HCatRecord.class, HCatRecord.class); ..... access.get("row1"); access.put(null,record); }