The Apache CarbonData community is pleased to announce the availability of CarbonData 1.0.0 which is the 4th stable release.
We encourage everyone to download the release, and feedback through the CarbonData mailing lists!
In this release, there are more than 80+ new feature and improvements , more than 100+ bug fixes , please find the detail at :
New Features
New load data solution
The old CarbonData load solution depends on Kettle engine, but Kettle engine is not designed for handling big data domain and the code maintainability is complex in this flow. So in the 1.0 version, a new data loading solution without kettle dependency is added and makes more modular and improved performance.
Support Spark2.1 integration in carbon
Spark 2.1 has added many features and improved the performance. CarbonData also gets the advantage of it after upgrading.
Data update/delete SQL support
Now user can delete and update the carbon table using standard sql syntax. This feature currently is supported in Spark 1.5/1.6 integration, it will be support in Spark 2.1 integration soon.
Support adaptive data compression for int/bigint/decimal to increase compression ratio
This feature can adapt the data to the smaller data type that fits the value, and it also supports delta compression technique to reduce the store size.
Support to define Date/Timestamp format for different columns
Now user can provide Date/Timestamp format for each column while loading the data. Provide option in the create table DDL itself to define the format for each Timestamp column, also provide defaults so that users can create table with Timestamp columns without having to always define the Date/Timestamp format.
Implement LRU cache for B-Tree
Btree in CarbonData keeps the information of blocks and blocklets of carbon tables inside memory. If number of tables increases or data increases there is a possibility of going out of memory. LRU cache of Btree now keep only recently or frequently used block/blocklet information in memory and evicts the unused or less used block/blocklet information.
Performance Improvement
CarbonData V2 format to improve first time query performance
This V2 format is more organized and maintains less metadata(reads metadata on demand) so that first time queries are faster. And also it has less IO cost compare to V1. Several testcases show that first time query response time reduced around 50%.
Vectorized reader support
It reads the data in batches, column by column. This feature reduces GC time and improve performance during data scan.
Fast join using bucket table
This feature enable bucket table support for CarbonData. It can improve the join query performace by avoiding shuffling if both tables are bucketed on same column with same number of buckets.It is supported in Spark 2.1 version.
Leveraging off-heap memory to reduce GC
By leveraging off-heap memory, it improves both loading and reading performance. In data loading, it improves data sorting performance and in reading, also it reduces GC overhead as it stores data in off-heap
Support single-pass loading
Currently data loading happens in 2 jobs (generate dictionary first, then do the actual data loading), this feature enables single job to finish the data loading with dictionary generation on the fly. It can improve the performance for the scenario that data loading with less incremental updates on dictionary, which usually is this case after initial data load.
Support pre-generated dictionary for data loading
User can use the generated dictionary, this feature also supports with customized dictionary by users to improve data load efficiency.
Please find the detailed JIRA list : https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12338020
Sub-task
- [CARBONDATA-296] - 1.Add CSVInputFormat to read csv files.
- [CARBONDATA-297] - 2. Add interfaces for data loading.
- [CARBONDATA-298] - 3. Add InputProcessorStep which should iterate recordreader and parse the data as per the data type.
- [CARBONDATA-299] - 4. Add dictionary generator interfaces and give implementation for pre created dictionary.
- [CARBONDATA-300] - 5. Add EncodeProcessorStep which encodes the data with dictionary.
- [CARBONDATA-301] - 6. Add SortProcessorStep which sorts the data as per dimension order and write the sorted files to temp location.
- [CARBONDATA-302] - 7. Add DataWriterProcessorStep which reads the data from sort temp files and creates carbondata files.
- [CARBONDATA-305] - Switching between kettle flow and new data loading flow make configurable
- [CARBONDATA-308] - Use CarbonInputFormat in CarbonScanRDD compute
- [CARBONDATA-318] - Implement an InMemory Sorter that makes maximum usage of memory while sorting
- [CARBONDATA-357] - Write unit test for ValueCompressionUtil
- [CARBONDATA-377] - Improve code coverage for Core.Cache.Dictionary
- [CARBONDATA-429] - Eliminate unnecessary file name check in dictionary cache
- [CARBONDATA-431] - Improve compression ratio for numeric datatype
- [CARBONDATA-453] - Implement DAT(Double Array Trie) for Dictionary
- [CARBONDATA-461] - Clean partitioner in RDD package
- [CARBONDATA-462] - Clean up carbonTableSchema.scala before moving to spark-common package
- [CARBONDATA-463] - Extract spark-common module
- [CARBONDATA-467] - CREATE TABLE extension to support bucket table.
- [CARBONDATA-469] - Leveraging Carbondata's bucketing info for optimized Join operation
- [CARBONDATA-473] - spark 2 stable datasource api integration
- [CARBONDATA-489] - spark2 decimal issue
- [CARBONDATA-491] - do not use runnablecommand in spark2
- [CARBONDATA-499] - CarbonData-DML-Delete-Record-Support
- [CARBONDATA-500] - CarbonData-DML-Update-Support
- [CARBONDATA-501] - CarbonData-Create-Delete-DeltaFile-Support
- [CARBONDATA-502] - CarbonData-Create-Update-DeltaFile-Support
- [CARBONDATA-503] - CarbonData-Cleanup-DeltaFiles-Support
- [CARBONDATA-504] - CarbonData-Cleanup-DeltaFiles-Support
- [CARBONDATA-505] - CarbonData-Implicit-TupleID-Creation
- [CARBONDATA-506] - CarbonData-Exclude-DeletedRecords-On-Query
- [CARBONDATA-507] - CarbonData-Include-UpdatedRecords-On-Query
- [CARBONDATA-508] - CarbonDatat-Compaction-Delete-DeltaFiles
- [CARBONDATA-509] - CarbonDatat-Compaction-Update-DeltaFiles
- [CARBONDATA-510] - CarbonData-Exclude-Invalid-Btree-After-Compaction
- [CARBONDATA-517] - Use carbon property to get the store path/kettle home
- [CARBONDATA-520] - Executor can not get the read support class
- [CARBONDATA-521] - Depends on more stable class of spark in spark2
- [CARBONDATA-549] - code improvement for bigint compression
- [CARBONDATA-566] - clean up code for carbon-spark2 module
- [CARBONDATA-568] - clean up code for carbon-core module
- [CARBONDATA-569] - clean up code for carbon-processing module
- [CARBONDATA-570] - clean up code for carbon-hadoop module
- [CARBONDATA-571] - clean up code for carbon-spark module
- [CARBONDATA-572] - clean up code for carbon-spark-common module
- [CARBONDATA-588] - cleanup WriterCompressModel
- [CARBONDATA-605] - Add Update-delete related documentation
- [CARBONDATA-607] - Cleanup ValueCompressionHolder class and all sub-classes
Bug
- [CARBONDATA-333] - Unable to perform compaction
- [CARBONDATA-341] - CarbonTableIdentifier being passed to the query flow has wrong tableid
- [CARBONDATA-362] - Optimize the parameters' name in CarbonDataRDDFactory.scala
- [CARBONDATA-374] - Short data type is not working.
- [CARBONDATA-375] - Dictionary cache not getting cleared after task completion in dictionary decoder
- [CARBONDATA-381] - Unnecessary catalog metadata refresh and array index of bound exception in drop table
- [CARBONDATA-390] - Float Data Type is Not Working
- [CARBONDATA-404] - Data loading from DataFrame to carbon table is FAILED
- [CARBONDATA-405] - Data load fail if dataframe is created with LONG datatype column .
- [CARBONDATA-412] - in windows, when load into table whose name has "_", the old segment will be deleted.
- [CARBONDATA-418] - Data Loading performance issue
- [CARBONDATA-421] - Timestamp data type filter issue with format other than "-"
- [CARBONDATA-442] - Query result mismatching with Hive
- [CARBONDATA-448] - Solve compilation error for spark2 integration
- [CARBONDATA-451] - Can not run query on windows now
- [CARBONDATA-456] - Select count(*) from table is slower.
- [CARBONDATA-459] - Block distribution is wrong in case of dynamic allocation=true
- [CARBONDATA-471] - Optimize no kettle flow and fix issues in cluster
- [CARBONDATA-474] - Implement unit test cases for core.datastorage package
- [CARBONDATA-476] - storeLocation start with file:/// cause table not found exceptioin
- [CARBONDATA-481] - [SPARK2]fix late decoder and support whole stage code gen
- [CARBONDATA-486] - Reading dataframe concurrently will lead to wrong data
- [CARBONDATA-487] - spark2 integration is not compiling
- [CARBONDATA-492] - When profile spark-2.0 is avtive , CarbonExample have error in intellij idea
- [CARBONDATA-493] - Insert into select from a empty table cause exception
- [CARBONDATA-497] - [Spark2] fix datatype issue of CarbonLateDecoderRule
- [CARBONDATA-518] - CarbonExample of spark moudle can not run as kettlehome and storepath shoug get form carbonproperties now
- [CARBONDATA-522] - New data loading flowcauses testcase failures like big decimal etc
- [CARBONDATA-532] - When set use_kettle=false , the testcase [TestEmptyRows] run failed
- [CARBONDATA-536] - Initialize GlobalDictionaryUtil.updateTableMetadataFunc for Spark 2.x
- [CARBONDATA-537] - Bug fix for DICTIONARY_EXCLUDE option in spark2 integration
- [CARBONDATA-539] - Return empty row in map reduce application
- [CARBONDATA-544] - Delete core/.TestFileFactory.carbondata.crc,core/Testdb.carbon
- [CARBONDATA-552] - Unthrown FilterUnsupportedException in catch block
- [CARBONDATA-557] - Option use_kettle is not work when use spark-1.5
- [CARBONDATA-558] - Load performance bad when use_kettle=false
- [CARBONDATA-560] - In QueryExecutionException, can not use executorService.shutdownNow() to shut down immediately.
- [CARBONDATA-562] - Carbon Context initialization is failed with spark 1.6.3
- [CARBONDATA-563] - Select Queries are not working with spark 1.6.2.
- [CARBONDATA-573] - To fix query statistic issue
- [CARBONDATA-574] - Add thrift server support to Spark 2.0 carbon integration
- [CARBONDATA-577] - Carbon session is not working in spark shell.
- [CARBONDATA-581] - Node locality cannot be obtained in group by queries
- [CARBONDATA-582] - Able to create table When Number Of Buckets is Given in negative
- [CARBONDATA-585] - Dictionary file is locked for Updation
- [CARBONDATA-589] - carbon spark shell is not working with spark 2.0
- [CARBONDATA-593] - Select command seems to be not working on carbon-spark-shell . It throws a runtime error on select query after show method is invoked
- [CARBONDATA-595] - Drop Table for carbon throws NPE with HDFS lock type.
- [CARBONDATA-600] - Should reuse unit test case for integration module
- [CARBONDATA-608] - Compliation Error with spark 1.6 profile
- [CARBONDATA-609] - CarbonDataFileVersionIssue
- [CARBONDATA-611] - mvn clean -Pbuild-with-format package does not work
- [CARBONDATA-614] - Fix dictionary locked issue
- [CARBONDATA-617] - Insert query not working with UNION
- [CARBONDATA-618] - Add new profile to build all modules for release purpose
- [CARBONDATA-619] - Compaction API for Spark 2.1 : Issue in compaction type
- [CARBONDATA-620] - Compaction is failing in case of multiple blocklet
- [CARBONDATA-621] - Compaction is failing in case of multiple blocklet
- [CARBONDATA-622] - Should use the same fileheader reader for dict generation and data loading
- [CARBONDATA-627] - Fix Union unit test case for spark2
- [CARBONDATA-628] - Issue when measure selection with out table order gives wrong result with vectorized reader enabled
- [CARBONDATA-629] - Issue with database name case sensitivity
- [CARBONDATA-630] - Unable to use string function on string/char data type column
- [CARBONDATA-632] - Fix wrong comments of load data in CarbonDataRDDFactory.scala
- [CARBONDATA-633] - Query Crash issue in case of offheap
- [CARBONDATA-634] - Load Query options invalid values are not consistent behaviour.
- [CARBONDATA-635] - ClassCastException in Spark 2.1 Cluster mode in insert query when name of column is changed and When the orders of columns are changed in the tables
- [CARBONDATA-636] - Testcases are failing in spark 1.6 and 2.1 with no kettle flow.
- [CARBONDATA-639] - "Delete data" feature doesn't work
- [CARBONDATA-641] - DICTIONARY_EXCLUDE is not working with 'DATE' column
- [CARBONDATA-643] - When we are passing ALL_DICTIONARY_PATH' in load query ,it is throwing null pointer exception.
- [CARBONDATA-644] - Select query fails randomly on spark shell
- [CARBONDATA-648] - Code Clean Up
- [CARBONDATA-650] - Columns switching error in performing the string functions
- [CARBONDATA-654] - Add data update and deletion example
- [CARBONDATA-667] - after setting carbon property carbon.kettle.home in carbon.properties , while loading data, it is not referring to the carbon.properties file in carbonlib directory
- [CARBONDATA-668] - Dataloads fail when no. of column in load query is greater than the no. of column in create table
- [CARBONDATA-669] - InsertIntoCarbonTableTestCase.insert into carbon table from carbon table union query random test failure
- [CARBONDATA-671] - Date data is coming as null when date data is before 1970
- [CARBONDATA-673] - Reverting big decimal compression as it has below issue
- [CARBONDATA-674] - Store compatibility 0.2 to 1.0
Improvement
- [CARBONDATA-83] - please support carbon-spark-sql CLI options
- [CARBONDATA-100] - BigInt compression
- [CARBONDATA-108] - Remove unnecessary Project for CarbonScan
- [CARBONDATA-159] - carbon should support primary key & keep mapping table table_property
- [CARBONDATA-218] - Remove Dependency: spark-csv and Unify CSV Reader for dataloading
- [CARBONDATA-270] - [Filter Optimization] double data type value comparison optimization
- [CARBONDATA-284] - Abstracting Index and Segment interface
- [CARBONDATA-285] - Use path parameter in Spark datasource API
- [CARBONDATA-287] - Save the sorted temp files to multi local dirs to improve dataloading perfomance
- [CARBONDATA-328] - Improve Code and Fix Warnings
- [CARBONDATA-343] - Optimize the duplicated definition code in GlobalDictionaryUtil.scala
- [CARBONDATA-347] - Remove HadoopFileInputMeta
- [CARBONDATA-348] - Remove useless step in kettle and delete them in plugin.xml
- [CARBONDATA-350] - Remove org.apache.carbondata.processing.sortdatastep
- [CARBONDATA-351] - name of thrift file is not unified
- [CARBONDATA-353] - Update doc for dateformat option
- [CARBONDATA-355] - Remove unnecessary method argument columnIdentifier of PathService.getCarbonTablePath
- [CARBONDATA-356] - Remove Two Useless Files ConvertedType.java and QuerySchemaInfo.java
- [CARBONDATA-367] - Add support alluxio(tachyon) file system(enhance ecosystem integration)
- [CARBONDATA-368] - Should improve performance of DataFrame loading
- [CARBONDATA-369] - Remove Useless Files in carbondata.scan.expression
- [CARBONDATA-388] - Remove Useless File CarbonFileFolderComparator.java
- [CARBONDATA-397] - Use of ANTLR instead of CarbonSqlParser for parsing queries
- [CARBONDATA-401] - Look forward to support reading csv file only once in data loading
- [CARBONDATA-403] - add example for data load without using kettle
- [CARBONDATA-413] - Implement unit test cases for scan.expression package
- [CARBONDATA-414] - Access array elements using index than Loop
- [CARBONDATA-420] - Remove unused parameter in config template file
- [CARBONDATA-423] - Added Example to Load Data to carbon Table using case class
- [CARBONDATA-434] - Update test cases for AllDataTypesTestCase2
- [CARBONDATA-435] - improve integration test case for AllDataTypesTestCase4
- [CARBONDATA-443] - Enable non-sort data loading
- [CARBONDATA-447] - Use Carbon log service instead of spark Logging
- [CARBONDATA-449] - Remove unnecessary log property
- [CARBONDATA-458] - Improving carbon first time query performance
- [CARBONDATA-465] - Spark streaming dataframe support
- [CARBONDATA-470] - Add unsafe offheap and on-heap sort in carbodata loading
- [CARBONDATA-480] - Add file format version enum
- [CARBONDATA-490] - Unify all RDD in carbon-spark and carbon-spark2 module
- [CARBONDATA-495] - Unify compressor interface
- [CARBONDATA-498] - Refactor compression model
- [CARBONDATA-512] - Reduce number of Timestamp formatter
- [CARBONDATA-513] - Reduce number of BigDecimal objects for scan
- [CARBONDATA-524] - improve integration test case of AllDataTypesTestCase5
- [CARBONDATA-528] - to support octal escape delimiter char
- [CARBONDATA-531] - Eliminate spark dependency in carbon core
- [CARBONDATA-535] - Enable Date and Char datatype support for Carbondata
- [CARBONDATA-538] - Add test case to spark2 integration
- [CARBONDATA-542] - Parsing values for measures and dimensions during data load should adopt a strict check
- [CARBONDATA-545] - Carbon Query GC Problem
- [CARBONDATA-546] - Extract data management command to carbon-spark-common module
- [CARBONDATA-547] - Add CarbonSession and enabled parser to use all carbon commands
- [CARBONDATA-561] - Merge the two CarbonOption.scala into one under spark-common
- [CARBONDATA-564] - long time ago, carbon may use dimension table csv file to make dictionary, but now unsed, so remove
- [CARBONDATA-576] - Add mvn build guide
- [CARBONDATA-579] - Handle Fortify issues
- [CARBONDATA-606] - Add a Flink example to read CarbonData files
- [CARBONDATA-616] - Remove the duplicated class CarbonDataWriterException.java
- [CARBONDATA-624] - Complete CarbonData document to be present in git and the same needs to sync with the carbondata.apace.org and for further updates.
- [CARBONDATA-637] - Remove table_status file
- [CARBONDATA-638] - Move and refine package in carbon-core module
- [CARBONDATA-651] - Fix the license header of java file to be same with scala's
- [CARBONDATA-655] - Make nokettle dataload flow as default in carbon
- [CARBONDATA-656] - Simplify the carbon session creation
- [CARBONDATA-670] - Add new MD files for Data Types and File Structure.
New Feature
- [CARBONDATA-2] - Remove kettle for loading data
- [CARBONDATA-37] - Support Date/Time format for Timestamp columns to be defined at column level
- [CARBONDATA-163] - Tool to merge Github Pull Requests
- [CARBONDATA-322] - Integration with spark 2.x
- [CARBONDATA-440] - Provide Update/Delete functionality support in CarbonData
- [CARBONDATA-441] - Add module for spark2
- [CARBONDATA-478] - Separate SparkRowReadSupportImpl implementation for integrating with Spark1.x vs. Spark 2.x
- [CARBONDATA-484] - Implement LRU cache for B-Tree
- [CARBONDATA-488] - add InsertInto feature for spark2
- [CARBONDATA-516] - [SPARK2]update union class in CarbonLateDecoderRule for Spark 2.x integration
- [CARBONDATA-519] - Enable vector reader in Carbon-Spark 2.0 integration and Carbon layer
- [CARBONDATA-540] - Support insertInto without kettle for spark2
- [CARBONDATA-580] - Support Spark 2.1 in Carbon
Task
- [CARBONDATA-444] - Improved integration test-case for AllDataTypesTestCase1
- [CARBONDATA-445] - Improved integration test-case for AllDataTypesTestCase3
Test
- [CARBONDATA-340] - Implement test cases for load package in core module
- [CARBONDATA-345] - improve code-coverage for core.carbon
- [CARBONDATA-346] - Update unit test for core module
- [CARBONDATA-371] - Write unit test for ColumnDictionaryInfo
- [CARBONDATA-379] - Test Cases to be added for Scan package under org.apache.carbondata.core
- [CARBONDATA-386] - Write unit test for Util Module
- [CARBONDATA-393] - Write Unit Test cases for core.keygenerator package
- [CARBONDATA-395] - Unit Test cases for package org.apache.carbondata.scan.expression.ExpressionResult
- [CARBONDATA-410] - Implement test cases for core.datastore.file system
- [CARBONDATA-416] - Add unit test case for result.impl package
- [CARBONDATA-438] - Add unit test for scan.scanner.impl package
- [CARBONDATA-446] - Add Unit Tests For Scan.collector.impl package
- [CARBONDATA-450] - Increase Test Coverage for Core.reader module
- [CARBONDATA-460] - Add Unit Tests For core.writer.sortindex package
- [CARBONDATA-472] - Improve code coverage for core.cache package.
- [CARBONDATA-475] - Implement unit test cases for core.carbon.querystatics package
- [CARBONDATA-482] - improve integration test case of AllDataTypesTestCase6
- [CARBONDATA-483] - Add Unit Tests For core.carbon.metadata package
- [CARBONDATA-496] - Implement unit test cases for core.carbon.datastore package
- [CARBONDATA-525] - Fix timestamp based test cases
- [CARBONDATA-575] - Remove integration-testcases module
- [CARBONDATA-601] - Should reuse unit test case for integration module
Wish
- [CARBONDATA-85] - please support insert into carbon table from other format table