Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Access to statistics concerning data sets via HCatLoader and the ability to generate and store statistics via HCatStorer. 1
  2. Access to statistics concerning data sets via HCatInputFormat and the ability to generate and store statistics via HCatOutputFormat. 1
  3. Ability for MapReduce streaming users to read data from and write data to HCatalog tables. Schema information should also be communicated via environment variables. 1,2
  4. A REST interface that will allow parallel read and write of records where the parallelism is determined by the reader or writer. This interface must also support partition pruning predicates, simple predicates (equality, inequality, is/is not null, boolean) on columns, and column projections. 4
  5. APIs for and reference implementations of data lifecycle management tools such as cleaners that remove old data, archivers that archive data by reformatting data on HDFS (partition aggregation, erasure coding, compression, ...)or by relocating data to another storage system, and replication tools that mirror data between clusters. These APIs will be in REST or a similar language independent format. 3
  6. Allowing the user to assert a desired schema at the time the data is read. For formats where the schema is stored in HCatalog's metadata this will mean merging of the known schema with the user asserted schema. Where the schema is not stored in the metadata but is stored in the data it will mean merging the user asserted schema with the data stored schema. For data where the schema is not stored in the metadata nor in the data, it will mean parsing the data in a user provided way (such as CSV). This schema merging and parsing must gracefully handle the case where columns specified by the user are missing. It must not assume that every record will have the same fields in the same order. It must allow the user to specify what action to take when a desired field is missing (e.g. insert a null, discard the row, fail). 6
  7. Provide an authorization model that allows finer grained access to data than the current storage based model while not running user provided code effectively as super-user. In the case of HDFS stored code this finer grained control means not relying on POSIX group semantics for file access but rather allowing individual users to be granted specific access rights on a table or partition. It also means providing columnar and potentially row wise access controls. 8
  8. Support transition of column types over time without requiring restatement of existing data. Not all type transitions would be supported, but many should (such as integer to long, long to floating point, integer or long to fixed point, etc). 6
  9. Add support for a fixed point type. 7
  10. Connect the existing boolean and datetime types that exist in Hive and Pig together via HCatalog. 7
  11. Expand support for connecting to HBase tables to include the ability to alter tables and push down non-key predicates. 5
  12. Implement StorageHandlers to connect HCatalog to the metadata of RDBMS NoSQL stores, etc. 5
  13. Add a REGISTER FUNCTION command to Hive (in addition to the REGISTER TEMP FUNCTION that exists) and devise way for Hive to store code associated with those functions. 11
  14. Integrate Pig and MR with registered functions to allow them to make use of the code stored by users. 11

Wishlist

A discussion on hcatalog-dev about the future project goals yielded a number of good ideas. Here's an attempt to summarize the suggestions; please update with new ideas, or correct ones I didn't transfer correctly.

Multiple cluster capabilities:

  • Process data from one cluster and store into another cluster. i.e HCatLoader() reads from one cluster's hcat server and HCatStorer() writes to another hcat server.
  • HCat Metastore server to handle multiple cluster metadata. We would eventually like to have one hcat server per colo. (question: Didn't the Hive folks look into this and abandon that effort? Any clue why?)

Export/Import:

  • Ability to import/export metadata out of hcat. Some level of support for this was added in hcat 0.2 but currently broken. This can be useful for user backing up their hcat server metadata, ability to import into another hcat metastore server, etc instead of having to replay add/alter partitions. For eg: We would like to move a project from one cluster to another and it has 1+ years worth of data in hcat server. Copying the data can be easily done with distcp by copying the toplevel table directory but copying the metadata is going to be more cumbersome.

Data Provenance/Lineage and Data Discovery:

  • Ability to store data provenance/lineage apart from statistics on the data.
  • Ability to discover data. For eg: If a new user needs to know where click data for search ads is stored, he needs to go through twikis or mail userlists to find where exactly it is stored. Need the ability to query on keywords/producer of data and find which table contains the data.

Production issues

  • Consolidate the many service components into a smaller number to simplify running HCat in production (HiveMetaStore, Hive thrift service, HiveWebInterface, webhcat).