Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In this release, there are more than 80+ new feature and improvements , more than 100+ bug fixes , please find the detail at :

New Features

New load data solution

The old CarbonData load solution depends on Kettle engine, but Kettle engine is not designed for handling big data domain and the code maintainability is complex in this flow. So in the 1.0 version, a new data loading solution without kettle dependency is added and makes more modular and improved performance.

Support Spark2.1 integration in carbon

Spark 2.1 has added many features and improved the performance. CarbonData also gets the advantage of it after upgrading.

Data update/delete SQL support

Now user can delete and update the carbon table using standard sql syntax. This feature currently is supported in Spark 1.5/1.6 integration, it will be support in Spark 2.1 integration soon.

Support adaptive data compression for int/bigint/decimal to increase compression ratio

This feature can adapt the data to the smaller data type that fits the value, and it also supports delta compression technique to reduce the store size.

Support to define Date/Timestamp format for different columns

Now user can provide Date/Timestamp format for each column while loading the data. Provide option in the create table DDL itself to define the format for each Timestamp column, also provide defaults so that users can create table with Timestamp columns without having to always define the Date/Timestamp format.

Implement LRU cache for B-Tree

Btree in CarbonData keeps the information of blocks and blocklets of carbon tables inside memory. If number of tables increases or data increases there is a possibility of going out of memory. LRU cache of Btree now keep only recently or frequently used block/blocklet information in memory and evicts the unused or less used block/blocklet information.

Performance Improvement

CarbonData V2 format to improve first time query performance

This V2 format is more organized and maintains less metadata(reads metadata on demand) so that first time queries are faster. And also it has less IO cost compare to V1. Several testcases show that first time query response time reduced around 50%.

Vectorized reader support

It reads the data in batches, column by column. This feature reduces GC time and improve performance during data scan.

Fast join using bucket table

This feature enable bucket table support for CarbonData. It can improve the join query performace by avoiding shuffling if both tables are bucketed on same column with same number of buckets.It is supported in Spark 2.1 version.

Leveraging off-heap memory to reduce GC

By leveraging off-heap memory, it improves both loading and reading performance. In data loading, it improves data sorting performance and in reading, also it reduces GC overhead as it stores data in off-heap

Support single-pass loading

Currently data loading happens in 2 jobs (generate dictionary first, then do the actual data loading), this feature enables single job to finish the data loading with dictionary generation on the fly. It can improve the performance for the scenario that data loading with less incremental updates on dictionary, which usually is this case after initial data load.

Support pre-generated dictionary for data loading

User can use the generated dictionary, this feature also supports with customized dictionary by users to improve data load efficiency.