Page History

...

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

...

CarbonData added multiple optimisations optimizations to improve query and compaction performance.

In this version of CarbonData, more than 78 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.

CarbonData Core

Support

...

Custom Column Compressor

Carbondata supports customised customized column compressor so that user can add their own implementation of compressor. To customise customize compressor, user can directly use its full class name while creating table or setting it to carbon property.

Performance Improvements

...

Optimized Carbondata Scan Performance

Carbondata scan performance is improved by avoiding multiple data copies in case of vector flow. This is achieved through short circuit the read and vector filling, it means fill the data directly to vector after reading the data from file with out any intermediate copies.

Now row level filter processing is handled in execution engine, only blocklet and page pruning is handled in CarbonData for vector flow. This is controlled by property carbon.push.rowfilters.for.vector and default it is false.

...

Optimized Compaction Performance

Compaction performance is optimised optimized through prefetching pre-fetching the data while reading carbon files.

Improved Blocklet DataMap

...

Pruning in

...

Driver

Blocklet DataMap pruning is improved using multi-thread processing in driver.

CarbonData SDK

SDK Supports C++ Interfaces for

...

Writing CarbonData files

To enable integration with non java based execution engines, CarbonData supports C++ JNI wrapper to write the CarbonData files. It can be integrated with any execution engine and write data to CarbonData files without the dependency on Spark or Hadoop.

...

Added more CLI enhancements by adding more options.
Supported fallback mechanism, when offheap memory is not enough then switch to onheap on heap instead of failing the job
Supported a separate audit log.
Support read batch row in CSDK to improve performance.

...

Behavior Change

Enable Local dictionary by default.
Make inverted index false by default.
Sort temp files during data loading are now compressed by default with Snappy compression to improve IO.

...

Configuration name	Default Value	Range
carbon.push.rowfilters.for.vector	false	NA
carbon.max.driver.threads.for.block.pruning	4	1-4

Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344320

...