Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this version of CarbonData, following are the new features added for performance improvements, compatibility, and usability of CarbonData.

Enhancement for BI

Supports Streaming on Pre-Aggregate Table

Now you can create pre-aggregate table on streaming tables. While CarbonData's streaming ingest feature reduces the time for data availability, now you can enjoy query performance improvement also by leveraging Pre-Aggregate Table. After creating Pre-Aggregate Table by using 'preaggregate' DataMap, the data conversion in streaming table will include automatic aggregation. Queries on this table will be rewritten into two parts, one part on the streaming data and another part on the pre-aggregated data. Since the pre-aggregated data is much less than original data, the query will be much faster. 

Supports Partition on Pre-Aggregate Table

If you create a Pre-Aggregate Table ('preaggregate' DataMap) on a partitioned main table, the Pre-Aggregate Table is also partitioned based on the same column. Since the partition is aligned, when you perform data management operation like create/drop/overwrite on the main table, the same operation will be done automatically on the aggregate table, keeping both in sync.

Materialized View (Alpha feature)

Materialized View is integrated as a DataMap in CarbonData, it provides pre-aggregation for SPJGH-like table (select-prodicate-join-groupby-having).  User can use existing DataMap statement to create, drop, show Materialized Views. While querying, system finds good MVs based on cost for query rewrite to improve query performance.

DataMap Management Enhancement

A new syntax 'DEFERRED REBUILD' is introduced in CREATE DATAMAP statement, this enables user choose DataMap management mechanism (automatic or manual) when creating the DataMap. When 'DEFERRED REBUILD' is specified, DataMap is by default disabled, and data loading to main table will not trigger loading to DataMap until user performs REBUILD DATAMAP.

Enhanced Data Load performance

Data loading performance has been improved dramatically due to various enhancements, including sorting temp file improvement, sort boundary mechanism, direct write without data move, and others. In one of the production environment, we have observed as much as 300% improvement comparing to last version, from 35MB/s/node to 102MB/s/node data loading throughput.

Supports External Table with Location

Now you can create external table by specifying the location of Carbon data files.

Supports SDK

Provided Carbon SDK to write and read CarbonData files through Java API, supporting Avro schema and JSON data.

Supports Lucene Index for Text Search (Alpha feature)

This feature allows you to perform text search on Carbon data.

Supports S3 Read on CarbonData Files

Supports Search Mode (Alpha feature)

Supports search mode to improve concurrent queries performance.

Supports Bloom Filter Index (Alpha feature)

Carbon Core

Enhanced Data Load performance

Data loading performance has been improved dramatically due to various enhancements, including sorting temp file improvement, sort boundary mechanism, direct write without data move, and others. In one of the production environment, we have observed as much as 300% improvement comparing to last version, from 35MB/s/node to 102MB/s/node data loading throughput.

DataMap Management Enhancement

A new syntax 'DEFERRED REBUILD' is introduced in CREATE DATAMAP statement, this enables user choose DataMap management mechanism (automatic or manual) when creating the DataMap. When 'DEFERRED REBUILD' is specified, DataMap is by default disabled, and data loading to main table will not trigger loading to DataMap until user performs REBUILD DATAMAP.

Supports External Table with Location

Now you can create external table by specifying storage location of Carbon data files. Carbon External Table usage is the same as Hive External Table.

Support Cloud Storage 

You can specify Cloud Storage as external table location, such as storing in AWS S3, HuaweiCloud OBS, etc.

Supports SDK

Provided Carbon SDK to write and read CarbonData files through Java API, supporting writing CarbonData files from

  1. CSV data, schema specified by user.
  2. JSON data, schema defined by Avro.


Enhancement for OLAP

Supports Streaming on Pre-Aggregate Table

Now you can create pre-aggregate table on streaming tables. While CarbonData's streaming ingest feature reduces the time for data availability, now you can enjoy query performance improvement also by leveraging Pre-Aggregate Table. After creating Pre-Aggregate Table by using 'preaggregate' DataMap, the data conversion in streaming table will include automatic aggregation. Queries on this table will be rewritten into two parts, one part on the streaming data and another part on the pre-aggregated data. Since the pre-aggregated data is much less than original data, the query will be much faster. 

Supports Partition on Pre-Aggregate Table

If you create a Pre-Aggregate Table ('preaggregate' DataMap) on a partitioned main table, the Pre-Aggregate Table is also partitioned based on the same column. Since the partition is aligned, when you perform data management operation like create/drop/overwrite on the main table, the same operation will be done automatically on the aggregate table, keeping both in sync.

Materialized View (Alpha feature)

Materialized View is integrated as a DataMap in CarbonData, it provides pre-aggregation for SPJGH-like table (select-prodicate-join-groupby-having).  User can use existing DataMap statement to create, drop, show Materialized Views. While querying, system finds good MVs based on cost for query rewrite to improve query performance.


Enhancement for Detail Record Analysis

Supports Bloom Filter DataMap (Alpha feature)

CarbonData introduce BloomFilter as an index datamap to enhance the performance of querying with precise value. It is well suitable for queries that do precise match on high cardinality columns(such as Name/ID). In concurrent filter query scenario (on high cardinality column), we observes 3~5 times improvement in concurrent queries per second comparing to last version. For more detail, please refer to BloomFilter DataMap Guide

Supports Lucene DataMap for Text Search (Alpha feature)

Lucene is a high performance, full featured text search engine. In this version, Lucene is integrated to CarbonData as an DataMap and managed along with main tables. User can create Lucene DataMap to improve query performance on string columns which has content of more length. So, user can search tokenized word or pattern of it using lucene query on text content. For more detail, please refer to Lucene DataMap Guide

Supports Search Mode (Alpha feature)

In order to improve concurrent filter query performance, CarbonData supports a "Search Mode" to do query scheduling and query execution without using Spark Scheduler and Executor. In one of our test, it can achieve half latency comparing to using Spark for filter query on indexed column, from 1 second to 500 millisecondsThis feature fastens blocklet pruning.

 

Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12341005&styleName=Html&projectId=12320220&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7C72f8d21d9927bf947fc8c0dfb7f69263d4048efb%7Clout

...