...
In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; |
Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
...
Code Block |
---|
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)
|
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
...
Code Block |
---|
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)
|
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.
Block Sampling
Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
Code Block |
---|
block_sample: TABLESAMPLE (n PERCENT) |
...
In the following example the input size 0.1% or more will be used for the query.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s; |
Sometimes you want to sample the same data with different blocks, you can change this seed number:
Code Block |
---|
set hive.sample.seednumber=<INTEGER>;
|
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Code Block |
---|
block_sample: TABLESAMPLE (ByteLengthLiteral)
ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
|
In the following example the input size 100M or more will be used for the query.
Code Block |
---|
SELECT *
FROM source TABLESAMPLE(100M) s;
|
Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
Code Block |
---|
block_sample: TABLESAMPLE (n ROWS)
|
For example, the following query will take the first 10 rows from each input split.
Code Block |
---|
SELECT * FROM source TABLESAMPLE(10 ROWS);
|