Apache CarbonData 2.1.0 Release

Created by Kunal Kapoor, last modified by Liang Chen on Mar 19, 2022

Apache CarbonData community is pleased to announce the release of the Version 2.1.0 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenarios, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://archive.apache.org/dist/carbondata/2.1.0/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 2.1.0?

In CarbonData 2.1.0, 134 JIRA tickets related to improvements, and bugs have been resolved. Please find the summary of the important features that are release with this release.

Transactional write support using Presto

CarbonData now supports writing in transactional mode from presto servers. This is a positive step in presto integration as now the tables can be read from spark/hive engines without the need to recreate the tables.

Presto local dictionary and reading for complex types

Carbondata now supports local dictionary on complex types and reading(only array and struct). For now only single level array and struct types would be supported for reading.

Make GeoID visible to the user

Generated geohash column will now be included in the schema. Alter commands, Indexes, MV and other table properties are not supported on this column.

Support loading data from parquet, ORC, CSV, Avro and JSON using CarbonData SDK

Now CarbonData supports loading of data from parquet, ORC, CSV, Avro and JSON formats directly in Carbon format. This would enable users to migrate data directly from the mentioned formats to Carbon.

Support delete and update from CarbonData SDK

Updating and Deleting rows is now supported from carbondata SDK.

Support array<string> complex type with Secondary Index

Secondary Index can now be created on an array<string> data type to accelerate queries which have an array_contains filter. Data would be stored in a flattened format in Secondary Index for the array cloumn.

Support IndexServer with Presto Engine

Improve index caching for presto engine using index server. Now the indexes for the table being scanned can be cached in index server reducing the presto server memory footprint.

Support Change Column Comment

Column Comments can now be changed using the alter command.

Support global sort for Secondary index table

Using global sort for SI table can improve the query performance by accelerating the filter process.

Reorder filter according to the column storage ordinal to improve reading

Reorder the filter according to the column storage ordinal to avoid backward seek. This will be helpful in cloud scenarios where scanning is relatively very coslty.

Implementing a new Reindex command to repair the missing SI Segments

Support a separate SQL reindex command(reindex [index_table] on table maintable) to call the SI repair logic without load/insert.

Support order limit by push down for secondary index queries

Improve SI scan time by reducing output size by pushing down limit and order by when Limit is present and order by column and all the filter column is SI column.

Please find the detailed JIRA list here.

No labels

Page tree