Background

In accuracy measure algorithm, we optimize this O(m*n) issue by spiting data asset into smaller work set.

The ideal split strategy is to find some combination of columns to make our data asset unique and uniform distributed.

We introduce groupby columns in our implements try to fulfill this strategy.

Without any group, for each row of source data, it will try to match each row of target data, which is O(m*n).

If we group by some columns first, the matching measure only need to process data in smaller bucketed groups.

Therefore, if users can provide the appropriate groupby columns for target dataset, it should helps to improve the performance of our measure process.

However, some users don't have any insights of the data, Apache Griffin will try to find out which fields are appropriate as groupby columns first, to improve the process performance.

The following is the experiment regarding how to select groupby columns in Apache Griffin.

Experiment about benchmark of accuracy groupby column determination

Step 1. Generate test data

Here we create 2 types of test data with groupby column in different distributions, marked as below.

Data Source TypeGroupby Columns Distribution Type
Uniformuniform
Exponentialexponential

The other fields of data are all unique in each row, to ensure the complexity of find process is O(n).

Step 2. Insert the test data into hive table, with the same tables as target data.

Step 3. Run griffin measure algorithm for these data sets, record time of calculation.


Benchmark of Experiment

The 2 types of test data's groupby column "id" frequency distribution and the calculation result are as below:

Uniform Data Source:

id is generated evenly, each group has the same number of rows, the number of rows could be set manually. (Especially, when frequency = 1, all rows are unique)

There are 3 experiments for uniform data source, their general conditions are different. We'll introduce the result from left to right.

Exp 1. Using 100K rows of data, split into 100 files.

When the frequency of "id" column data is 1, which means all the data are unique, calculation time is 11.576s.

When the frequency of "id" column data is 100, which means each id has 100 rows, calculation time is 15.546s.

Exp 2. Using 1M rows of data, split into 100 files.

The smaller frequency of each id, calculation the faster.

Exp 3. Using 1M rows of data, split into 8 files.

The smaller frequency of each id, calculation the faster.

Furthermore, the less number of partitions brings faster process.

 

Exponential Data Source:

id is generated exponentially, few groups have lots of rows, while most of them have much less rows.

There is only 1 experiments for exponential data source.

Exp 1. Using 100K rows of data, split into 100 files.

Even for only 100K rows of data, the calculation time is 251.878s, much slower than uniform distributional data. 

It means that some peak frequency of ids leads to a long calculation time.

 

Conclusion

The best performance belongs to random distributional data source, which means it works better when grouped by more unique columns.

Therefore, for griffin automatically selection of groupby columns, our best strategy should be for all the exactly match fields, group by them in accuracy calculation, then we can get the best performance.

  • No labels