Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, etc. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 5PB 3PB data (more than 10 5 trillion records)  with response time less than 3 seconds!

We encourage you to use the release release https://distarchive.apache.org/repos/dist/release/carbondata/1.4.0/, and feedback through the CarbonData user mailing lists!

...

In this version of CarbonData, following are the new features added for performance improvements, compatibility, and usability of CarbonDatamore than 230 JIRA tickets for new feature, improvement and bugs has been resolved. Following are the summary.

Carbon Core

...

Improved Data Load

...

Performance

Data loading performance has been improved dramatically due to various enhancements, including sorting temp file improvement, sort boundary mechanism, direct write without data move, and others. In one of the production environment, we have observed as much as 300% improvement comparing to last version, from 35MB/s/node to 102MB/s/node data loading throughput.

DataMap Management Enhancement

.

Improved Compaction Performance

By employing data prefetching and various improvement in vectorized reader during compaction, compaction operation on CarbonData table is improved up to 500% compare to last version. In one of the production environment, application can support 5 minutes data loading (100s of GB) while maintaining second level query performance by automatic compaction for every 30 and 60 minutes (configured with "carbon.compaction.level.threshold" set to "6,2") to reduce number of segments.

DataMap Management Enhancement

A new syntax 'DEFERRED REBUILD' is introduced in CREATE DATAMAP statement, this enables user choose DataMap management mechanism (automatic or manual) when creating the DataMap. When 'DEFERRED REBUILD' is specified, DataMap is by default disabled, and data loading to main table will not trigger loading to DataMap until A new syntax 'DEFERRED REBUILD' is introduced in CREATE DATAMAP statement, this enables user choose DataMap management mechanism (automatic or manual) when creating the DataMap. When 'DEFERRED REBUILD' is specified, DataMap is by default disabled, and data loading to main table will not trigger loading to DataMap until user performs REBUILD DATAMAP.

...

You can specify Cloud Storage as external table location, such as storing in AWS S3, HuaweiCloud OBS, etc.

Supports SDK for Standalone Application

Provided Carbon SDK to write and read CarbonData files through Java API without Hadoop and Spark dependency, supporting writing CarbonData files fromuser can use this SDK in standalone Java application to convert existing data into CarbonData files. It supports write to local disk or cloud storage, from following formats.

  1. CSV data, schema specified by user.
  2. JSON data, schema defined by Avro.

...

Materialized View (Alpha feature)

Materialized View is integrated as a DataMap in CarbonData, it provides pre-aggregation for SPJGH-is integrated as a DataMap in CarbonData. Comparing to "preaggregate" DataMap introduced in version 1.3.0, it covers more scenario for OLAP application. Currently, it supports SPJGH-like table (select-prodicatepredicate-join-groupby-having) .  User can use existing DataMap statement to and user can follow existing DataMap statement (similar to CTAS statement) to  create, drop, show Materialized Viewsthem. While querying, system finds good MVs based on cost for query rewrite to improve query performance.query rewrite to improve query performance. As an initial release of this feature, while we encourage user to try and test, it still has some limitation and will be improved in coming release.  


Enhancement for Detail Record Analysis

...

Lucene is a high performance, full featured text search engine. In this version, Lucene is integrated to CarbonData as an DataMap and managed along with main tables. User can create Lucene DataMap to improve query performance on string columns which has content of more length. So, user can search tokenized word or pattern of it using lucene query on text content. For more detail, please refer to Lucene DataMap Guide

Supports Search Mode (Alpha feature)

In order to improve concurrent filter query performance, CarbonData supports a "Search Mode" to do query scheduling and query execution without using Spark Scheduler and Executor. In one of our test, it can achieve half latency comparing to using Spark for filter query on indexed column, from 1 second to 500 milliseconds.

...

pattern of it using lucene query on text content. For more detail, please refer to Lucene DataMap Guide

Supports Search Mode (Alpha feature)

In order to improve concurrent filter query performance, CarbonData supports a "Search Mode" to do query scheduling and query execution without using Spark Scheduler and Executor. In one of our test, it can achieve half latency comparing to using Spark for filter query on indexed column, from 1 second to 500 milliseconds.


Other Important Improvements

  • Improved EXPLAIN command output to show: whether the query is rewritten to use Pre-Aggregate Table or Materialized View, which index DataMap is hit, and how many blocklets are pruned.
  • Added log for performance tuning information, including driver side parsing and optimizer time, bock distribution info, carbon file IO read time, number of blocklet scanned, result filling time to spark, etc
  • Support compaction and loading in parallel.
  • Support separating visible and invisible segment metadata into two files and show them separately in SHOW SEGMENTS command
  • Support global sort option on partition table

  • Reduced object generation in global sorted table

  • Optimization on DESC command to show partition value and location for partitioned table


Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12341005&styleName=Html&projectId=12320220&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7C72f8d21d9927bf947fc8c0dfb7f69263d4048efb%7Clout

...