Apache CarbonData 1.5.1 Release [DRAFT]

Apache CarbonData community is pleased to announce the release of the Version 1.5.1 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://dist.apache.org/repos/dist/release/carbondata/1.5.1/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.5.1?

CarbonData 1.5.1 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have added support to write CarbonData files from c++ libraries.

CarbonData added multiple optimisations to improve query and compaction performance.

In this version of CarbonData, more than 78 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.

CarbonData Core

Support custom column compressor

Carbondata supports customised column compressor so that user can add their own implementation of compressor. To customise compressor, user can directly use its full class name while creating table or setting it to carbon property.

Performance Improvements

Optimised carbondata scan performance

Carbondata scan performance is improved by avoiding multiple data copies in case of vector flow. This is achieved through short circuit the read and vector filling, it means fill the data directly to vector after reading the data from file with out any intermediate copies.

Now row level filter processing is handled in execution engine, only blocklet and page pruning is handled in CarbonData for vector flow. This is controlled by property carbon.push.rowfilters.for.vector and default it is false.

Optimised compaction performance

Compaction performance is optimised through prefetching the data while reading carbon files.

CarbonData SDK

SDK Supports C++ Interfaces for writing CarbonData files

To enable integration with non java based execution engines, CarbonData supports C++ JNI wrapper to write the CarbonData files. It can be integrated with any execution engine and write data to CarbonData files without the dependency on Spark or Hadoop.

Multi-Thread Read API in SDK

To improve the read performance when using SDK, CarbonData supports multi-thread read APIs. This enables the applications to read data from multiple CarbonData files in parallel. It significantly improves the SDK read performance.

Other Improvements

Added more CLI enhancements by adding more options.
Supported fallback mechanism, when offheap memory is not enough then switch to onheap instead of failing the job
Supported a separate audit log.
Support read batch row in CSDK to improve performance.

Behaviour Change

Enable Local dictionary by default.
Make inverted index false by default.

New Configuration Parameters

Configuration name	Default Value	Range
carbon.push.rowfilters.for.vector	false	NA

Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12341006

Page tree