HCatalog Journal

This document tracks the development of HCatalog. It summarizes work that has been done in previous releases, what is currently being worked on, and proposals for future work in HCatalog.

Completed Work

Feature

Available in

Comments

Read/write of data from Map Reduce

0.1

 

Read/write of data from Pig

0.1

 

Read from Hive

0.1

 

Support pushdown of columns to be projected into storage format

0.1

 

Support for RCFile storage

0.1

 

Add a CLI

0.1

 

Partition Pruning

0.1

 

Support data read across partitions with different storage formats

0.1

 

Authentication

0.1

See HCatalogAuthentication

Authorization

0.1

See HCatalogAuthorizationProposal

Data Import/Export

Not yet released

See HCAT-16

Work in Progress

Feature

References

Dynamic Partitioning

See HCatalog02Design

Compaction of partitions

See HCatalog02Design

Mark set of partitions as done

See HCatalog02Design

Notification

See HCatalog02Design

Proposed Work

The following describe tasks that are proposed for future work on HCatalog. They are ordered by what we currently believe to be their priority, with the most important tasks being listed first.

Support for more file formats
At least one row format and text format need to be supported.

Allow specification of general storage type
Currently Hive allows the user to specify specific storage formats for a table. For example, the user can say `STORED AS RCFILE`. We would like to enable users to select general storage types (columnar, row, or text) without needing know the underlying format being used. Thus it would be legal to say `STORED AS ROW` and let the administrators decide whether sequence file or tfile is used to store data in row format.

Utility APIs
Grid managers will want to build tools that use HCatalog to help manage their grids. For example, one might build a tool to do replication between two grids. Other tools will include data cleaning systems, data archiving systems, etc. Such tools will want to use HCatalog's metadata. HCatalog needs to provide an appropriate API for these types of tools.

Pushing filters into storage formats
In columnar compression performance can be improved when a row selection predicate can be evaluated against the relevant columns before the remaining columns are decompressed and deserialized and the row is constructed. When the filter itself can be applied on a compressed and serialized version of the column the performance boost is significant. When the underlying storage format supports these, HCatalog needs to push the filters from Pig and Hive. Columnar storage formats that HCatalog commonly uses will also need to be modified to support these features.

Separate compression for separate columns
One of the values of columnar compression is the ability to select compression formats that are optimal for different columns in the data. HCatalog needs to support a variety of data specific compression formats and allow users to select different formats for different columns in a table.

Indices for sorted tables
Providing the first record in each block for a sorted table enables a number of performance optimizations in the query engine accessing the data (such as Pig's merge join). In HCatalog's standard formats we may need to provide this functionality. It is also possible that the index functionality already being added to Hive could be used for this.

Statistics Storage
Data statistics should be accessible through HCatalog. Compact statistics (e.g. number of rows) can be stored in the db. Large statistics (e.g. histograms) would have to be stored in HDFS, HBase or some other store. Hive has also done some work in this area in this area. We need to integrate with this.

Schema Evolution
Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns. It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column.

Support for streaming
Currently HCatalog does not support Hadoop streaming users. It should.

Integration with Hbase
Currently HCatalog does not support Hbase tables. It needs to have storage drivers so that !HCatInputFormat and !HCatLoader can do bulk reads and !HCatOutputFormat and !HCatStorer can do bulk writes. We also need to understand what, if any, interface it makes sense for HCatalog to expose for point reads and writes for HCatalog tables that use Hbase as a storage mechanism. Also we need to figure out how to do this in a way that provides consistent reads for MR jobs.

  • No labels