You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Apache HCatalog's behaviour can be modified through use of a few config parameters specified in jobs submitted to it. This document details all the various knobs that users have available to them, and what they accomplish. 

 

Setup:

The properties described in this page are meant to be job-level properties set on HCat through the jobConf passed into it. This means that this page is relevant for pig users of HCatLoader/HCatStorer, or mapreduce users of HCatInputFormat/HCatOutputFormat. For a mapreduce user of HCat, these must be present as key-values in the Configuration (JobConf/Job/JobContext) used to instantiate HCatOutputFormat or HCatInputFormat. For pig users of HCatStorer, these parameters are set using the pig "set" command before instantiating a HCatLoader/HCatStorer

 

Storage directives: 

PropertyDefaultDescription
 hcat.pig.storer.external.locationNot setAn override to specify where HCatStorer will write to, defined from pig jobs, either directly by user, or by using org.apache.hive.hcatalog.pig.HCatStorerWrapper. HCat will write to this specified directory, rather than writing to the table/partition directory specified/calculatable by the metadata. This will be used in lieu of the table directory if this is a table-level write (unpartitioned table write) or in lieu of the partition directory if this is a partition-level write. This parameter is used only for non-dynamic-partitioning jobs which have multiple write destinations.
 hcat.dynamic.partitioning.custom.patternNot set

For dynamic partitioning jobs, simply specifying a custom directory is not good enough, since it writes to multiple destinations, and thus, instead of a directory specification, it requires a pattern specification. That's where this parameter comes in. For example, if one had a table that was partitioned by keys country and state, with a root directory location of /apps/hive/warehouse/geo/ , then a dynamic partition write into it that writes partitions (country=US,state=CA) & (country=IN,state=KA) would create two directories: /apps/hive/warehouse/geo/country=US/state=CA/ and /apps/hive/warehouse/geo/country=IN/state=KA/ . If we wanted a different patterned location, and specified hcat.dynamic.partitioning.custom.pattern="/ext/geo/${country}-${state}", it would create the following two partition dirs: /ext/geo/US-CA and /ext/geo/IN-KA . Thus, it allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable it sees when attempting to create a destination location for the partitions. See Dynamic Partitioning: External Tables for another example.

 

 

Cache behaviour directives:

HCatalog maintains a cache of HiveClients to talk to the metastore, managing a cache of 1 metastore client per thread, defaulting to an expiry of 120 seconds. For people that wish to modify the behaviour of this cache, a few parameters are provided:

PropertyDefaultDescription
hcatalog.hive.client.cache.expiry.time120Allows users to override the expiry time specified -- this is an int, and specifies number of seconds.
hcatalog.hive.client.cache.disabledfalseAllows people to disable the cache altogether if they wish to. This is useful in highly multithreaded usecases. 

Input Split Generation Behaviour: 

PropertyDefaultDescription
hcat.desired.partition.num.splitsnot setThis is a hint/guidance that can be provided to HCatalog to pass on to underlying InputFormats, to produce a "desired" number of splits per partition. This is useful when we have a few large files and we want to increase parallelism by increasing the number of splits generated. It is not yet so useful in cases where we would want to reduce the number of splits for a large number of files. It is not at all useful, also, in cases where there are a large number of partitions that this job will read. Also note that this is merely an optimization hint, and it is not guaranteed that the underlying layer will be capable of using this optimization. Also, mapreduce parameters mapred.min.split.size and mapred.max.split.size can be used in conjunction with this parameter to tweak/optimize jobs.

 

Data Promotion Behaviour: 

 

In some cases where a user of HCat (such as some older versions of pig) does not support all the datatypes supported by hive, there are a few config parameters provided to handle data promotions/conversions to allow them to read data through HCatalog. On the write side, it is expected that the user pass in valid HCatRecords with data correctly.

PropertyDefaultDescription
hcat.data.convert.boolean.to.integerfalsepromotes boolean to int on read from HCatalog
hcat.data.tiny.small.int.promotionfalsepromotes tinyint/smallint to int on read from HCatalog

 

HCatRecordReader Error Tolerance Behaviour: 

 

While reading, it is understandable that data might contain errors, but we may not want to completely abort a task due to a couple of errors. These parameters configure how many errors we can accept before we fail the task.

PropertyDefaultDescription
hcat.input.bad.record.threshold0.0001fA float parameter, defaults to 0.0001f, which means we can deal with 1 error every 10,000 rows, and still not error out. Any greater, and we will.
hcat.input.bad.record.min2An int parameter, defaults to 2, which is the minimum number of bad records we encounter before applying hcat.input.bad.record.threshold parameter, this is to prevent an initial/early bad record from resulting in a task abort because the ratio of errors it got was too high. 
  • No labels