Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://distarchive.apache.org/repos/dist/release/carbondata/1.5.1/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.5.1?

CarbonData 1.5.1 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have added support to read write CarbonData files from c++ libraries. Additionally CarbonData files can be read using Java SDK, Spark FileFormat interface, Spark, Presto.

CarbonData added multiple optimisations to reduce the store size so that query can take advantage of lesser IO. Several enhancements have been made to Streaming support from CarbonDataoptimizations to improve query and compaction performance.

In this version of CarbonData, more than 150 78 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.

CarbonData Core

...

Support Custom Column Compressor

Carbondata supports customized column compressor so that user can add their own implementation of compressor. To customize compressor, user can directly use its full class name while creating table or setting it to carbon property.

Performance Improvements

Optimized Carbondata Scan Performance

Carbondata scan performance is improved by avoiding multiple data copies in case of vector flow. This is achieved through short circuit the read and vector filling, it means fill the data directly to vector after reading the data from file with out any intermediate copies.  

Row Filter pruning Now row level filter processing is handled in execution engine after pruning the , only blocklet and pages using the filter in carbonpage pruning is handled in CarbonData for vector flow. This is controlled by property  property  carbon.push.rowfilters.for.vector vector and default it is false. 

Support custom column compressor

Carbondata supports customised column compressor so that user can add their own implementation of compressor. To customise compressor, user can directly use its full class name while creating table or setting it to carbon property.

Optimised compaction performance

Optimized Compaction Performance

Compaction performance is optimised optimized through prefetching pre-fetching the data while reading carbon files.

Improved Blocklet DataMap Pruning in Driver

Blocklet DataMap pruning is improved using multi-thread processing in driver.

CarbonData SDK

SDK Supports C++ Interfaces for

...

Writing CarbonData files

To enable integration with non java based execution engines, CarbonData supports C++ writer JNI wrapper to write the CarbonData files. These writers It can be integrated with any execution engine and write data to CarbonData files without the dependency on Spark or Hadoop.

Multi-Thread Read API in SDK 

To improve the read performance when using SDK, CarbonData supports multi-thread read APIs. This enables the applications to read data from multiple CarbonData files in parallel. It significantly improves the SDK read performance.

Other Improvements

  • Added more CLI enhancements by adding more options.
  • Supported fallback mechanism, when offheap memory is not enough then switch to onheapto on heap instead of failing the job
  • Supported a separate audit log.
  • Support read batch row in CSDK to improve performance.

Behavior Change

  • Enable Local dictionary by default.
  • Make inverted index false by default.
  • Supported a separate audit log.
  • Support read batch row in CSDK to improve performanceSort temp files during data loading are now compressed by default with Snappy compression to improve IO.

New Configuration Parameters

Configuration nameDefault ValueRange
carbon.push.rowfilters.for.vectorfalse

NA

carbon.max.driver.threads.for.block.pruning41-4


Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=1234100612344320

Sub-task

  • [CARBONDATA-25122930] - Support long_string_columns in sdk[CARBONDATA-2633] - Bugs are found when bloomindex column is dictionary/sort/date columncustomize column compressor
  • [CARBONDATA-2634] - Provide more information about the datamap when showing datamaps2981] - Support read primitive data type in CSDK
  • [CARBONDATA-26352997] - Support different provider based index datamaps on same columnread schema from index file and data file in CSDK
  • [CARBONDATA-26373000] - Fix bugs for deferred rebuild for bloomfilter datamapProvide C++ interface for writing carbon data
  • [CARBONDATA-2650] - explain query shows negative skipped blocklets for bloomfilter datamap3003] - Suppor read batch row in CSDK
  • [CARBONDATA-26533004] - Fix bugs in incorrect blocklet number in bloomfilter
  • [CARBONDATA-2654] - Optimize output for explaining query with datamap
  • [CARBONDATA-2655] - Support `in` operator for bloomfilter datamap
  • [CARBONDATA-2657] - Loading/Filtering empty value fails on bloom index columns
  • [CARBONDATA-2660] - Support filtering on longstring bloom index columns
  • [CARBONDATA-2675] - Support config long_string_columns when create datamap
  • [CARBONDATA-2681] - Fix loading problem using global/batch sort fails when table has long string columns
  • [CARBONDATA-2683] - Fix data convertion problem for Varchar
  • [CARBONDATA-2685] - make datamap rebuild for all segments in parallel
  • [CARBONDATA-2687] - update document for bloomfilter
  • [CARBONDATA-2693] - Fix bug for alter rename is renameing the existing table on which bloomfilter datamp exists
  • [CARBONDATA-2694] - show long_string_columns in desc table command
  • [CARBONDATA-2702] - Fix bugs in clear bloom datamap
  • [CARBONDATA-2706] - clear bloom index file after segment is deleted
  • [CARBONDATA-2708] - clear index file if dataloading is failed
  • [CARBONDATA-2790] - Optimize default parameter for bloomfilter datamap
  • [CARBONDATA-2811] - Add query test case using search mode on table with bloom filter
  • [CARBONDATA-2835] - Block MV datamap on streaming table
  • [CARBONDATA-2844] - SK AK not getting passed to executors for global sort
  • [CARBONDATA-2845] - Merge bloom index files of multi-shards for each index column
  • [CARBONDATA-2851] - support zstd as column compressor
  • [CARBONDATA-2852] - support zstd on legacy store
  • [CARBONDATA-2853] - Add min/max index for streaming segment
  • [CARBONDATA-2859] - add sdv test case for bloomfilter datamap
  • [CARBONDATA-2869] - SDK support for Map DataType
  • [CARBONDATA-2894] - Add support for complex map type through spark carbon file format API
  • [CARBONDATA-2922] - support long string columns with spark FileFormat and SDK with "long_string_columns" TableProperties
  • [CARBONDATA-2935] - Write is_sorted field in file footer
  • [CARBONDATA-2942] - Add read and write support for writing min max based on configurable bytes count
  • [CARBONDATA-2952] - Provide CarbonReader C++ interface for SDK
  • [CARBONDATA-2957] - update document about zstd support in carbondata

Bug

  • [CARBONDATA-1787] - Carbon 1.3.0- Global Sort: Global_Sort_Partitions parameter doesn't work, if specified in the Tblproperties, while creating the table.
  • [CARBONDATA-2418] - Presto can't query Carbon table when carbonstore is created at s3
  • [CARBONDATA-2478] - Add datamap-developer-guide.md file in readme
  • [CARBONDATA-2515] - Filter OR Expression not working properly in Presto integration
  • [CARBONDATA-2516] - Filter Greater-than for timestamp datatype not generating Expression in PrestoFilterUtil
  • [CARBONDATA-2528] - MV Datamap - When the MV is created with the order by, then when we execute the corresponding query defined in MV with order by, then the data is not accessed from the MV. 
  • [CARBONDATA-2530] - [MV] Wrong data displayed when parent table data are loaded 
  • [CARBONDATA-2531] - [MV] MV not hit when alias is in use
  • [CARBONDATA-2534] - MV Dataset - MV creation is not working with the substring() 
  • [CARBONDATA-2539] - MV Dataset - Subqueries is not accessing the data from the MV datamap.
  • [CARBONDATA-2540] - MV Dataset - Unionall queries are not fetching data from MV dataset.
  • [CARBONDATA-2542] - MV creation is failed for other than default database
  • [CARBONDATA-2550] - [MV] Limit is ignored when data fetched from MV, Query rewrite is Wrong
  • [CARBONDATA-2560] - [MV] Exception in console during MV creation but MV registered successfully
  • [CARBONDATA-2568] - [MV] MV datamap is not hit when ,column is in group by but not in projection 
  • [CARBONDATA-2576] - MV Datamap - MV is not working fine if there is more than 3 aggregate function in the same datamap.
  • [CARBONDATA-2610] - DataMap creation fails on null values 
  • [CARBONDATA-2614] - There are some exception when using FG in search mode and the prune result is none
  • [CARBONDATA-2616] - Incorrect explain and query result while using bloomfilter datamap
  • [CARBONDATA-2629] - SDK carbon reader don't support filter in HDFS and S3
  • [CARBONDATA-2644] - Validation not present for carbon.load.sortMemory.spill.percentage parameter 
  • [CARBONDATA-2658] - Fix bug in spilling in-memory pages
  • [CARBONDATA-2674] - Streaming with merge index enabled does not consider the merge index file while pruning. 
  • [CARBONDATA-2703] - Fix bugs in tests
  • [CARBONDATA-2711] - carbonFileList is not initalized when updatetablelist call
  • [CARBONDATA-2715] - Failed to run tests for Search Mode With Lucene in Windows env
  • [CARBONDATA-2729] - Schema Compatibility problem between version 1.3.0 and 1.4.0
  • [CARBONDATA-2758] - selection on local dictionary fails when column having all null values more than default batch size.
  • [CARBONDATA-2769] - Fix bug when getting shard name from data before version 1.4
  • [CARBONDATA-2802] - Creation of Bloomfilter Datamap is failing after UID,compaction,pre-aggregate datamap creation
  • [CARBONDATA-2823] - Alter table set local dictionary include after bloom creation fails throwing incorrect error
  • [CARBONDATA-2854] - Release table status file lock before delete physical files when execute 'clean files' command
  • [CARBONDATA-2862] - Fix exception message for datamap rebuild command
  • [CARBONDATA-2866] - Should block schema when creating external table
  • [CARBONDATA-2874] - Support SDK writer as thread safe api
  • [CARBONDATA-2886] - select filter with int datatype is showing incorrect result in case of table created and loaded on old version and queried in new version
  • [CARBONDATA-2888] - Support multi level sdk read support for carbon tables
  • [CARBONDATA-2901] - Problem: Jvm crash in Load scenario when unsafe memory allocation is failed.
  • [CARBONDATA-2902] - Fix showing negative pruning result for explain command
  • [CARBONDATA-2908] - the option of sort_scope don't effects while creating table by data frame
  • [CARBONDATA-2910] - Support backward compatability in fileformat and support different sort colums per load
  • [CARBONDATA-2924] - Fix parsing issue for map as a nested array child and change the error message in sort column validation for SDK
  • [CARBONDATA-2925] - Wrong data displayed for spark file format if carbon file has mtuiple blocklet
  • [CARBONDATA-2926] - ArrayIndexOutOfBoundException if varchar column is present before dictionary columns along with empty sort_columns.
  • [CARBONDATA-2927] - Multiple issue fixes for varchar column and complex columns that grows more than 2MB
  • [CARBONDATA-2932] - CarbonReaderExample throw some exception: Projection can't be empty
  • [CARBONDATA-2933] - Fix errors in spelling
  • [CARBONDATA-2940] - Fix BufferUnderFlowException for ComplexPushDown
  • [CARBONDATA-2955] - bug for legacy store and compaction with zstd compressor and adaptiveDeltaIntegralCodec
  • [CARBONDATA-2956] - CarbonReader can't support use configuration to read S3 data
  • [CARBONDATA-2967] - Select is failing on pre-aggregate datamap when thrift server is restarted.
  • [CARBONDATA-2969] - Query on local dictionary column is giving empty data
  • [CARBONDATA-2974] - Bloomfilter not working when created bloom on multiple columns and queried
  • [CARBONDATA-2975] - DefaultValue choosing and removeNullValues on range filters is incorrect
  • [CARBONDATA-2979] - select count fails when carbondata file is written through SDK and read through sparkfileformat for complex datatype map(struct->array->map)
  • [CARBONDATA-2980] - clear bloomindex cache when dropping datamap
  • [CARBONDATA-2982] - CarbonSchemaReader don't support Array<string>
  • [CARBONDATA-2984] - streaming throw NPE when there is no data in the task of a batch
  • [CARBONDATA-2986] - Table Properties are lost when multiple driver concurrently creating table 
  • [CARBONDATA-2990] - JVM crashes when rebuilding the datamap.
  • [CARBONDATA-2991] - NegativeArraySizeException during query execution 
  • [CARBONDATA-2992] - Fixed Between Query Data Mismatch issue for timestamp data type
  • [CARBONDATA-2993] - Concurrent data load throwing NPE randomly.
  • [CARBONDATA-2994] - Unify property name for badrecords path in create and load.
  • [CARBONDATA-2995] - Queries slow down after some time due to broadcast issue

New Feature

Improvement

Task

Bug

  • [CARBONDATA-2996] - readSchemaInIndexFile can't read schema by folder path
  • [CARBONDATA-2998] - Refresh column schema for old store(before V3) for SORT_COLUMNS option
  • [CARBONDATA-3002] - Fix some spell error and remove the data after test case finished running
  • [CARBONDATA-3007] - Fix error in document
  • [CARBONDATA-3025] - Add SQL support for cli, and enhance CLI , add more metadata to carbon file
  • [CARBONDATA-3026] - clear expired property that may cause GC problem
  • [CARBONDATA-3029] - Failed to run spark data source test cases in windows env
  • [CARBONDATA-3036] - Carbon 1.5.0 B010 - Select query fails when min/max exceeds and index tree cached
  • [CARBONDATA-3040] - Fix bug for merging bloom index
  • [CARBONDATA-3058] - Fix some exception coding in data loading
  • [CARBONDATA-3060] - Improve CLI and fix other bugs in CLI tool
  • [CARBONDATA-3062] - Fix Compatibility issue with cache_level as blocklet
  • [CARBONDATA-3065] - by default disable inverted index for all the dimension column
  • [CARBONDATA-3066] - ADD documentation for new APIs in SDK
  • [CARBONDATA-3069] - fix bugs in setting cores for compaction
  • [CARBONDATA-3077] - Fixed query failure in fileformat due stale cache issue
  • [CARBONDATA-3078] - Exception caused by explain command for count star query without filter
  • [CARBONDATA-3081] - NPE when boolean column has null values with Vectorized SDK reader
  • [CARBONDATA-3083] - Null values are getting replaced by 0 after update operation.
  • [CARBONDATA-3084] - data load with float datatype falis with internal error
  • [CARBONDATA-3098] - Negative value exponents giving wrong results
  • [CARBONDATA-3106] - Written_BY_APPNAME is not serialized in executor with GlobalSort
  • [CARBONDATA-3117] - Rearrange the projection list in the Scan
  • [CARBONDATA-3120] - apache-carbondata-1.5.1-rc1.tar.gz Datamap's core and plan project, pom.xml, is version 1.5.0, which results in an inability to compile properly
  • [CARBONDATA-3122] - CarbonReader memory leak
  • [CARBONDATA-3123] - JVM crash when reading through CarbonReader
  • [CARBONDATA-3124] - Updated log message in Unsafe Memory Manager and changed faq.md accordingly.
  • [CARBONDATA-3132] - Unequal distribution of tasks in case of compaction
  • [CARBONDATA-3134] - Wrong result when a column is dropped and added using alter with blocklet cache.

New Feature

Improvement

  • [CARBONDATA-3008] - make yarn-local and multiple dir for temp data enable by default
  • [CARBONDATA-3009] - Optimize the entry point of code for MergeIndex
  • [CARBONDATA-3019] - Add error log in catch block to avoid to abort the exception which is thrown from catch block when there is an exception thrown in finally block
  • [CARBONDATA-3022] - Refactor ColumnPageWrapper
  • [CARBONDATA-3024] - Use Log4j directly
  • [CARBONDATA-3030] - Remove no use parameter in test case
  • [CARBONDATA-3031] - Find wrong description in the document for 'carbon.number.of.cores.while.loading'
  • [CARBONDATA-3032] - Remove carbon.blocklet.size from properties template
  • [CARBONDATA-3034] - Combing CarbonCommonConstants
  • [CARBONDATA-3035] - Optimize parameters for unsafe working and sort memory
  • [CARBONDATA-3039] - Fix Custom Deterministic Expression for rand() UDF
  • [CARBONDATA-3041] - Optimize load minimum size strategy for data loading
  • [CARBONDATA-3042] - Column Schema objects are present in Driver and Executor even after dropping table
  • [CARBONDATA-3046] - remove outdated configurations in template properties
  • [CARBONDATA-3047] - UnsafeMemoryManager fallback mechanism in case of memory not available
  • [CARBONDATA-3048] - Added Lazy Loading For 2.2/2.1
  • [CARBONDATA-3050] - Remove unused parameter doc
  • [CARBONDATA-3051] - unclosed streams cause tests failure in windows env
  • [CARBONDATA-3052] - Improve drop table performance by reducing the namenode RPC calls during physical deletion of files
  • [CARBONDATA-3053] - Un-closed file stream found in cli
  • [CARBONDATA-3054] - Dictionary file cannot be read in S3a with CarbonDictionaryDecoder.doConsume() codeGen
  • [CARBONDATA-3061] - Add validation for supported format version and Encoding type to throw proper exception to the user while reading a file
  • [CARBONDATA-3064] - Support separate audit log
  • [CARBONDATA-3067] - Add check for debug to avoid string concat
  • [CARBONDATA-3071] - Add CarbonSession Java Example
  • [CARBONDATA-3074] - Change default sort temp compressor to SNAPPY
  • [CARBONDATA-3075] - Select Filter fails for Legacy store if DirectVectorFill is enabled
  • [CARBONDATA-3087] - Prettify DESC FORMATTED output
  • [CARBONDATA-3088] - enhance compaction performance by using prefetch
  • [CARBONDATA-3104] - Extra Unnecessary Hadoop Conf is getting stored in LRU (~100K) for each LRU entry
  • [CARBONDATA-3112] - Optimise decompressing while filling the vector during conversion of primitive types
  • [CARBONDATA-3113] - Fixed Local Dictionary Query Performance and Added reusable buffer for direct flow
  • [CARBONDATA-3118] - Parallelize block pruning of default datamap in driver for filter query processing
  • [CARBONDATA-3121] - CarbonReader build time is huge
  • [CARBONDATA-3136] - JVM crash with preaggregate datamap
  • [CARBONDATA-2756] - Add BSD license for ZSTD external dendency
  • [CARBONDATA-2839] - Add custom compaction example