Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Table of Contents

a

Sampling Syntax

Sampling Bucketized Table

...

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

Code Block
SELECT * 
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; 

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

...

Code Block
    TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) 

would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

...

Code Block
    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) 

would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling

It is a feature that is still on trunk and is not yet in any release version.Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121

Code Block
block_sample: TABLESAMPLE (n PERCENT)

...

In the following example the input size 0.1% or more will be used for the query.

Code Block
SELECT * 
FROM source TABLESAMPLE(0.1 PERCENT) s; 

Sometimes you want to sample the same data with different blocks, you can change this seed number:

Code Block
set hive.sample.seednumber=<INTEGER>;

Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

Code Block

block_sample: TABLESAMPLE (ByteLengthLiteral)

ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')

In the following example the input size 100M or more will be used for the query.

Code Block

SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

Code Block

block_sample: TABLESAMPLE (n ROWS)

For example, the following query will take the first 10 rows from each input split.

Code Block

SELECT * FROM source TABLESAMPLE(10 ROWS);