You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Objectives

HCatMix is the performance testing framework for hcatalog.

The objective is to know

  • Establish a baseline performance Monitor change in performance as new releases are made
  • Overhead of using HCatalog
  • Find out limitations of HCatalog in terms of number of parallel connections that can be handled, number of allowed partitions etc.

In order to meet the above objective following would be measured:

  • Time taken for the basic use case operations
  • Throughput under high concurrent usage
  • Find the overhead of using HCatalog Loader/Storer in pig

Touch points

Following are the touch points, where HCatalog code gets executed when it is being used in pig.
Other operations in pig doesn't trigger anything in HCatalog.

  • Using the HcatStorer
  • Using the HcatLoader
  • filter by clause on partitioned table

Test Scenario

Stress Testing

To test performance of HCatalog the load would be generated in the following categories:
1. Tables with huge amount of data
1. Tables with large number of partitions

Test setup

Test setup needs to perform two tasks:

  • Create HCatalog tables by providing the schema
  • Generate data that conforms to the schema to be loaded in HCatalog

Both of these are driven by configuration file. Following is an example of setup xml file.

!HCatMix Test Setup Configuration

<database>
    <tables>
        <table>
            <namePrefix>page_views_1brandnew</namePrefix>
            <dbName>default</dbName>
            <columns>
                <column>
                    <name>user</name>
                    <type>STRING</type>
                    <avgLength>20</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>1000000</cardinality>
                    <percentageNull>10</percentageNull>
                </column>
                <column>
                    <name>timespent</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>5000000</cardinality>
                    <percentageNull>25</percentageNull>
                </column>
                <column>
                    <name>query_term</name>
                    <type>STRING</type>
                    <avgLength>5</avgLength>
                    <distribution>zipf</distribution>
                    <cardinality>10000</cardinality>
                    <percentageNull>0</percentageNull>
                </column>
                 .
                 .
                 .
            </columns>
            <partitions>
                <column>
                    <name>timestamp</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>1000</cardinality>
                </column>
                <partition>
                    <name>action</name>
                    <type>INT</type>
                    <distribution>zipf</distribution>
                    <cardinality>8</cardinality>
                </partition>
                 .
                 .
                 .
            </partitions>
            <instances>
                <instance>
                    <size>1000000</size>
                    <count>1</count>
                </instance>
                <instance>
                    <size>100000</size>
                    <count>1</count>
                </instance>
                 .
                 .
                 .
            </instances>
        </table>
    </tables>
</database>

A column/parition has the following details:

  • name: of the column
  • type: Type of data (string/int/map etc)
  • avgLength: average length if the type is string
  • distribution distribution type. Either =uniform= or =zipf= to generate data that follows [http://en.wikipedia.org/wiki/Zipf's_lawZipf's distribution]
  • cardinality Size of the sample space
  • percentageNull what percentage should be null

The instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.

Test plan

  • There would be four tables of varying size 100MB/5GB/50GB and 100GB. There would be three tables for each size, one with no partition column, one with 3 partition column and 1 with 5 partition column.

     

    Data

     

     

    Number of partitions

     

     

    Number of partition column

     

     

     

    None

    3

    5

    100MB

    100

    1

    1

    1

    5GB

    5k

    1

    1

    1

    50GB

    50k

    1

    1

    1

    100GB

    100k

    1

    1

    1

The following pig scripts would be executed and the time taken monitored. The pig scripts would have only Load, Store and filter by operations:
1. Load data using PigStorage then store to an HCat table using !HCatStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatStorage().
1. Use dynamic partitioning
1. Use static partitioning
1. Load data from !HCatTable using HCatLoader() and store it using PigStorage(). Compare this to a pig script that uses both load/store using pigStorage. This would give overhead of HCatLoader().
1. Use =filter by= on one of the partition columns of the tables, compare this with =filter by= on normal files.

Repeat the experiment for all the mentioned tables 10 times each and find the average time.

For concurrency test a client would be doing show partitions repeatedly. And the number of such concurrent client would be increased from 1 to 255 in steps of 5. The throughput would be recorded.

Questions

  • Does these comparison need to be done on RCFiles vs text files ?
  • Do we need a clean setup for each of the experiments or generating the date and then doing the tests is okay
  • No labels