You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


Apache CarbonData community is pleased to announce the release of the Version 1.5.1 in The Apache Software Foundation (ASF). 

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://dist.apache.org/repos/dist/release/carbondata/1.5.1/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.5.1?

CarbonData 1.5.1 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have added support to write CarbonData files from c++ libraries.

CarbonData added multiple optimisations to improve query and compaction performance.

In this version of CarbonData, more than 78 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.

CarbonData Core

Support custom column compressor

Carbondata supports customised column compressor so that user can add their own implementation of compressor. To customise compressor, user can directly use its full class name while creating table or setting it to carbon property.

Performance Improvements

Optimised carbondata scan performance

Carbondata scan performance is improved by avoiding multiple data copies in case of vector flow. This is achieved through short circuit the read and vector filling, it means fill the data directly to vector after reading the data from file with out any intermediate copies.  

Now row level filter processing is handled in execution engine, only blocklet and page pruning is handled in CarbonData for vector flow. This is controlled by property  carbon.push.rowfilters.for.vector and default it is false. 

Optimised compaction performance

Compaction performance is optimised through prefetching the data while reading carbon files.

CarbonData SDK

SDK Supports C++ Interfaces for writing CarbonData files

To enable integration with non java based execution engines, CarbonData supports C++ JNI wrapper to write the CarbonData files. It can be integrated with any execution engine and write data to CarbonData files without the dependency on Spark or Hadoop.

Multi-Thread Read API in SDK 

To improve the read performance when using SDK, CarbonData supports multi-thread read APIs. This enables the applications to read data from multiple CarbonData files in parallel. It significantly improves the SDK read performance.

Other Improvements

  • Added more CLI enhancements by adding more options.
  • Supported fallback mechanism, when offheap memory is not enough then switch to onheap instead of failing the job
  • Supported a separate audit log.
  • Support read batch row in CSDK to improve performance.

Behaviour Change

  • Enable Local dictionary by default.
  • Make inverted index false by default.

New Configuration Parameters

Configuration nameDefault ValueRange
carbon.push.rowfilters.for.vectorfalseNA


Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12341006

Sub-task

Bug

New Feature

Improvement

Task

  • No labels