Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We encourage you to use the release https://distarchive.apache.org/repos/dist/release/carbondata/1.6.0/, and feedback through the CarbonData user mailing lists!

...

CarbonData 1.6.0 intention was to move closer to unified analytics. We have added new binary datatype to store binary objects like imagesindex server to distribute the index cache. We have also allowed users to change sort columns of an existing table for better flexibility as per user needs. we are now compacting the segments which are loaded using range sortsupported incremental loading on MV datamaps to improve the loading time of datamap. we are now supporting reading cabondata tables from Hive and also supported Arrow format form SDK.

In this version of CarbonData, around 60 75 JIRA tickets related to new features, improvements, and bugs have been resolved. Following are the summary.

...

Index Server to distribute the index cache and parallelisethe index pruning 

Carbon currently prunes and caches all block/blocklet datamap index information into the driver for normal table, for Bloom/Index datamaps the JDBC driver will launch a job to prune and cache the datamaps in executors.

This causes the driver to become a bottleneck in the following ways:

If the cache size becomes huge(70-80% of the driver memory) then there can be excessive GC in the driver which can slow down the

...

queries and the driver may even go OutOfMemory.

...

If multiple JDBC drivers want to read from same tables then every JDBC server needs to maintain their own copy of the cache.

...

Distributed Index Cache Server aims to solve the above mentioned problemsTo solve these problems we have introduced distributed Index Cache Server. It is separate scalable server stores only index information and all the drivers can connect and prune the data using cached index information.

Incremental data loading on MV datamaps

MV tables are created as DataMaps and managed as tables internally by CarbonData. User can create limitless MV datamaps on a table to improve query performance provided the storage requirements and loading time is acceptable.

MV datamap can be a lazy or a non-lazy datamap. Once MV datamaps are created, CarbonData's CarbonAnalyzer helps to select the most efficient MV datamap based on the user query and rewrite the SQL to select the data from MV datamap instead of main table. Since the data size of MV datamap is smaller and data is pre-processed, user queries are much faster.

For incremental loads to main table, data to datamap will be loaded once the corresponding main table load is completed

Currently, MV datamaps can only be loaded with full load for any new data load on the parent table. Now we have supported incremental loading on MV datamaps so for any new load on parent table triggers the load on MV datamap only for incrementally added data. 

Supported Arrow format from Carbon SDK

ISDK reader also now supports reading carbondata files and filling it to apache arrow vectors. This helps to avoid unnecessary intermediate serialisations when accessing from other execution engines or languages.

...

CarbonData files can be read from the Hive. This helps users to easily migrate to CarbonData format on existing Hive deployments using other formats.

...