Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

This roadmap is proposed and has not yet been accepted/approved by the HCatalog community.

Table of Contents

HCatalog Roadmap

Hive today provides a simple and familiar database like tabular model of data management to its users. HCatalog seeks to generalize this table model, so that Hive tables become Hadoop tables. Hadoop tables can be backed by HDFS or alternate storage systems such as cloud stores, NoSQL stores or databases. Hadoop tables can be easily used by all current and future Hadoop programing and data management frameworks. Programing frameworks, including Map-Reduce, Pig, Streaming and many others. HCatalog APIs are being created to enable future data management frameworks to allow data migrations, replication, transformation, archival and other services.

...

  1. Access to statistics concerning data sets via HCatLoader and the ability to generate and store statistics via HCatStorer. 1
  2. Access to statistics concerning data sets via HCatInputFormat and the ability to generate and store statistics via HCatOutputFormat. 1
  3. Ability for MapReduce streaming users to read data from and write data to HCatalog tables. Schema information should also be communicated via environment variables. 1,2
  4. A REST interface that will allow parallel read and write of records where the parallelism is determined by the reader or writer. This interface must also support partition pruning predicates, simple predicates (equality, inequality, is/is not null, boolean) on columns, and column projections. 4
  5. APIs for and reference implementations of data lifecycle management tools such as cleaners that remove old data, archivers that archive data by reformatting data on HDFS (partition aggregation, erasure coding, compression, ...)or by relocating data to another storage system, and replication tools that mirror data between clusters. These APIs will be in REST or a similar language independent format. 3
  6. Allowing the user to assert a desired schema at the time the data is read. For formats where the schema is stored in HCatalog's metadata this will mean merging of the known schema with the user asserted schema. Where the schema is not stored in the metadata but is stored in the data it will mean merging the user asserted schema with the data stored schema. For data where the schema is not stored in the metadata nor in the data, it will mean parsing the data in a user provided way (such as CSV). This schema merging and parsing must gracefully handle the case where columns specified by the user are missing. It must not assume that every record will have the same fields in the same order. It must allow the user to specify what action to take when a desired field is missing (e.g. insert a null, discard the row, fail). 6
  7. Provide an authorization model that allows finer grained access to data than the current storage based model while not running user provided code effectively as super-user. In the case of HDFS stored code this finer grained control means not relying on POSIX group semantics for file access but rather allowing individual users to be granted specific access rights on a table or partition. It also means providing columnar and potentially row wise access controls. 8
  8. Support transition of column types over time without requiring restatement of existing data. Not all type transitions would be supported, but many should (such as integer to long, long to floating point, integer or long to fixed point, etc). 6
  9. Add support for a fixed point type. 7
  10. Connect the existing boolean and datetime types that exist in Hive and Pig together via HCatalog. 7
  11. Expand support for connecting to HBase tables to include the ability to alter tables and push down non-key predicates. 5
  12. Implement StorageHandlers to connect HCatalog to the metadata of RDBMS NoSQL stores, etc. 5
  13. Add a REGISTER FUNCTION command to Hive (in addition to the REGISTER TEMP FUNCTION that exists) and devise way for Hive to store code associated with those functions. 11
  14. Integrate Pig and MR with registered functions to allow them to make use of the code stored by users. 11

Wishlist

A discussion on hcatalog-dev about the future project goals yielded a number of good ideas. Here's an attempt to summarize the suggestions; please update with new ideas, or correct ones I didn't transfer correctly.

Multiple cluster capabilities:

  • Process data from one cluster and store into another cluster. i.e HCatLoader() reads from one cluster's hcat server and HCatStorer() writes to another hcat server.
  • HCat Metastore server to handle multiple cluster metadata. We would eventually like to have one hcat server per colo. (question: Didn't the Hive folks look into this and abandon that effort? Any clue why?)

...

  1. Ability to import/export metadata into and out of of

...

  1. HCatalog. Some level of support for this was added in

...

  1. version 0.2 but currently broken. This can be useful for user backing up

...

  1. HCatalog server metadata

...

  1. and the ability to import into another

...

  1. cluster, etc instead of having to replay add/alter partitions. For eg: We would like to move a project from one cluster to another and it has 1+ years worth of data

...

  1. . Copying the data can be easily done with distcp by copying the

...

  1. top level table directory but copying the metadata is

...

  1. more cumbersome.

Features Under Discussion

This section contains features the HCatalog community has discussed but has not yet committed to adding to HCatalog. These features
may be added to HCatalog or it may be determined that they belong in different projects or tools.

  1. Process data from one cluster and store into another cluster. I.e HCatLoader() reads from one cluster's metastore server and HCatStorer() writes to another metastore server.
  2. Metastore server to handle multiple cluster metadata so that one HCatalog instance could exist per group of co-located rather than one per cluster.

...

  1. Ability to store data provenance/lineage apart from statistics on the data.
  2. Ability to discover data. E.

...

  1. g.: If a new user needs to know where click data for search ads is stored, he needs to go through twikis or mail userlists to find where exactly it is stored. Need the ability to query on keywords/producer of data and find which table contains the data.

...

  1. Consolidate the many service components into a smaller number to simplify running HCat in production (HiveMetaStore, Hive thrift service, HiveWebInterface, webhcat).