Apache CarbonData 1.5.3 Release

Apache CarbonData community is pleased to announce the release of the Version 1.5.3 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://dist.apache.org/repos/dist/release/carbondata/1.5.3/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.5.3?

CarbonData 1.5.3 intention was to move closer to unified analytics. We are allowing DDL to operate on LRU cache for the user to handle LRU cache as per his requirement. We have also upgraded the integration support for Presto latest version. More importantly, we have further improved the CarbonData performance.

In this version of CarbonData, more than 20 JIRA tickets related to new features, improvements, and bugs have been resolved. Following are the summary.

CarbonData Core

DDL Support on CarbonData LRU Cache

Before, though the user could set the cache size, the functionality was limited as the user did not have a clear picture of how much cache should be set for his/her requirement.

From this version, we support DDL on CarbonData LRU Cache which allows you to do the following operations:

Show the current cache used per table.
Showing current cache used for a specific table.
Clearing cache for a specific table.

Supports SDK Read from Different Schema

This version allows the user to read two or more CarbonData files in a location with different schema.

Performance Improvements

Improved Single/Concurrent Query Performance

When the number of segments are more, query performance reduces due to higher memory footprint, multi-thread pruning, retrieval from unsafe Datamap, and so on.

In this version we have improved the query performance by following modifications:

Reduced memory footprints during the query.
Added multi-thread pruning in case of nonfilter query.
Updated driver cache unsafe storage format for faster retrieval of data.

Improved Count(*) Query Performance

Before for count(*), the prune used to be the same as a select * query which is very time-consuming due to different processes involved.

In this version, we have optimized the count(*) query performance by reading blocklet row count directly from DataMapRow. This reduces query time and improves the query performance.

Other Improvements

Presto Version Upgrade

Now CarbonData integrates with the Presto version 0.217.

Behavior Change

None

Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344322

Bug

[CARBONDATA-3202] - updated schema is not updated in session catalog after add, drop or rename column.
[CARBONDATA-3223] - Datasize and Indexsize showing 0B for 1.1 store when show segments is done
[CARBONDATA-3284] - Workaround for Create-PreAgg Datamap Fail
[CARBONDATA-3287] - Remove the validation of same chema data files in location for external table and file format
[CARBONDATA-3298] - Logs are getting printed when clean files is executed for old stores
[CARBONDATA-3301] - Array<date> column is giving null data in case of spark carbon file format
[CARBONDATA-3313] - count(*) is not invalidating the invalid segments cache
[CARBONDATA-3314] - Index Cache Size printed in SHOW METACACHE ON TABLE DDL is not accurate
[CARBONDATA-3315] - Range Filter query with two between clauses with an OR gives wrong results
[CARBONDATA-3320] - number of partitions are always zero in describe formatted for hive native partition
[CARBONDATA-3322] - After renaming table, "SHOW METACACHE ON TABLE" still works for old table
[CARBONDATA-3323] - Output is null when cache is empty
[CARBONDATA-3328] - Performance issue with merge small files distribution
[CARBONDATA-3330] - Fix Invalid exception when SDK reader is trying to clear the datamap
[CARBONDATA-3332] - Concurrent update and compaction failure
[CARBONDATA-3333] - Fixed No Sort Store Size issue and Compatibility issue after alter addd column done in 1.1 and load in 1.5

New Feature

[CARBONDATA-3300] - ClassNotFoundException when using UDF on spark-shell
[CARBONDATA-3305] - DDLs to Operate on CarbonLRUCache
[CARBONDATA-3329] - DeadLock is observed when a query fails.

Improvement

[CARBONDATA-3293] - Prune datamaps improvement for count(*)
[CARBONDATA-3318] - Decoupling of Cache Commands
[CARBONDATA-3321] - Improve Single/Concurrent query performance

Page tree