Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

Code Block
SELECT * 
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; 

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

...

Code Block
    TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) 

would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

...

Code Block
    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) 

would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121Image Removed

Code Block
block_sample: TABLESAMPLE (n PERCENT)

...

In the following example the input size 0.1% or more will be used for the query.

Code Block
SELECT * 
FROM source TABLESAMPLE(0.1 PERCENT) s; 

Sometimes you want to sample the same data with different blocks, you can change this seed number:

Code Block
set hive.sample.seednumber=<INTEGER>;

Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

Code Block

block_sample: TABLESAMPLE (ByteLengthLiteral)

ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')

In the following example the input size 100M or more will be used for the query.

Code Block

SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

Code Block

block_sample: TABLESAMPLE (n ROWS)

For example, the following query will take the first 10 rows from each input split.

Code Block

SELECT * FROM source TABLESAMPLE(10 ROWS);