HCatMix

Objectives

HCatMix is the performance testing framework for hcatalog.

The objective is to know

Establish a baseline performance Monitor change in performance as new releases are made
Overhead of using HCatalog
Find out limitations of HCatalog in terms of number of parallel connections that can be handled, number of allowed partitions etc.

In order to meet the above objective following would be measured:

Time taken for the basic use case operations
Throughput under high concurrent usage
Find the overhead of using HCatalog Loader/Storer in pig

Touch points

Following are the touch points, where HCatalog code gets executed when it is being used in pig.
Other operations in pig doesn't trigger anything in HCatalog.

Using the HcatStorer
Using the HcatLoader
filter by clause on partitioned table

Test Scenario

Stress Testing

To test performance of HCatalog the load would be generated in the following categories:
1. Tables with huge amount of data
1. Tables with large number of partitions

Test setup

Test setup needs to perform two tasks:

Create HCatalog tables by providing the schema
Generate data that conforms to the schema to be loaded in HCatalog

Both of these are driven by configuration file. Following is an example of setup xml file.

!HCatMix Test Setup Configuration

<database>
    <tables>
        <table>
            <namePrefix>page_views_1brandnew</namePrefix>
            <dbName>default</dbName>
            <columns>
                <column>
                    <name>user</name>
                    <type>STRING</type>
                    <avgLength>20</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>1000000</cardinality>
                    <percentageNull>10</percentageNull>
                </column>
                <column>
                    <name>timespent</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>5000000</cardinality>
                    <percentageNull>25</percentageNull>
                </column>
                <column>
                    <name>query_term</name>
                    <type>STRING</type>
                    <avgLength>5</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>10000</cardinality>
                    <percentageNull>0</percentageNull>
                </column>
                 .
                 .
                 .
            </columns>
            <partitions>
                <column>
                    <name>timestamp</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>1000</cardinality>
                </column>
                <partition>
                    <name>action</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>8</cardinality>
                </partition>
                 .
                 .
                 .
            </partitions>
            <instances>
                <instance>
                    <size>1000000</size>
                    <count>1</count>
                </instance>
                <instance>
                    <size>100000</size>
                    <count>1</count>
                </instance>
                 .
                 .
                 .
            </instances>
        </table>
    </tables>
</database>

A column/parition has the following details:

name: of the column
type: Type of data (string/int/map etc)
avgLength: average length if the type is string
distribution distribution type. Either =uniform= or =zipf= to generate data that follows [http://en.wikipedia.org/wiki/Zipf's_lawZipf's distribution]
cardinality Size of the sample space
percentageNull what percentage should be null

The instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.

Test plan

There would be four tables of varying size 100MB/5GB/50GB and 100GB. There would be three tables for each size, one with no partition column, one with 3 partition column and 1 with 5 partition column.

	Data			Number of partitions	Number of partition column
		None	3	5
100MB	100	1	1	1
5GB	5k	1	1	1
50GB	50k	1	1	1
100GB	100k	1	1	1

The following pig scripts would be executed and the time taken monitored. The pig scripts would have only Load, Store and filter by operations:
1. Load data using PigStorage then store to an HCat table using !HCatStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatStorage().
1. Use dynamic partitioning
1. Use static partitioning
1. Load data from !HCatTable using HCatLoader() and store it using PigStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatLoader().
1. Use =filter by= on one of the partition columns of the tables, compare this with =filter by= on normal files.

Repeat the experiment for all the mentioned tables 10 times each and find the average time.

For concurrency test a client would be doing show partitions repeatedly. And the number of such concurrent client would be increased from 1 to 255 in steps of 5. The throughput would be recorded.

Questions

Does these comparison need to be done on RCFiles vs text files ?
Do we need a clean setup for each of the experiments or generating the date and then doing the tests is okay

Child pages