Welcome to the CarbonData
Overview
- Release plan
- What is CarbonData?
- What problem does CarbonData solve?
- What are the key technology benefits of CarbonData?
- CarbonData vs popular Hadoop Data Stores
Release plan
Each three month will give a stable release
Apache CarbonData wiki. If you are interested in contributing to CarbonData, visit the contributing to CarbonData page to learn more.
Release plan(Around 3 months for one release)
Date | Version |
---|
number | |
---|---|
Aug 2016 | Apache CarbonData 0.1.0-incubating |
Sep 2016 | Apache CarbonData 0.1.1-incubating |
Nov 2016 | Apache CarbonData 0.2.0-incubating |
Jan 2017 | Apache CarbonData 1.0 |
.0-incubating |
May 2017 |
What is CarbonData?
CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
What problem does CarbonData solve?
For big data interactive analysis scenarios, many customers expect sub-second response to query TB-PB level data on general hardware clusters with just a few nodes.
In the current big data ecosystem, there are few columnar storage formats such as ORC and Parquet that are designed for SQL on Big Data. Apache Hive’s ORC format is a columnar storage format with basic indexing capability. However, ORC cannot meet the sub-second query response expectation on TB level data, because ORC format performs only stride level dictionary encoding and all analytical operations such as filtering and aggregation is done on the actual data. Apache Parquet is columnar storage can improve performance in comparison to ORC, because of more efficient storage organization. Though Parquet can provide query response on TB level data in a few seconds, it is still far from the sub-second expectation of interactive analysis users. Cloudera Kudu can effectively solve some query performance issues, but kudu is not hadoop native, can’t seamlessly integrate historic HDFS data into new kudu system.
However, CarbonData uses specially engineered optimizations targeted to improve performance of analytical queries which can include filters, aggregation and distinct counts, the required data to be stored in an indexed, well organized, read-optimized format, CarbonData’s query performance can achieve sub-second response.
What are the key technology benefits of CarbonData?
The key aspects of Carbon’s technology that enables such dramatic performance benefits are summarized as follows:
- Global Dictionary Encoding with Lazy Conversion: Most databases and big data SQL data stores employ columnar encoding to achieve data compression by storing small integers numbers (surrogate value) instead of full string values. However, almost all existing databases and data stores divide the data into row groups containing anywhere from few thousand to a million rows and employ dictionary encoding only within each row group. Hence, the came column value can have different surrogate values in different row groups. So, while reading the data, conversion from surrogate value to actual value needs to be done immediately after the data is read from the disk. But Carbon employs global surrogate key which means that a common dictionary is maintained for the full store on one machine/node. So carbon can perform all the query processing work such as grouping/aggregation, sorting etc.. on light weight surrogate values. The conversion from surrogate to actual values needs to be done only on the final result. This procedure improves performance on two aspects.
- Conversion from surrogate values to actual values is done only for the final result rows which are much less than the actual rows read from the store.
- All query processing and computation such as grouping/aggregation, sorting, and so on is done on lightweight surrogate values which requires less memory and CPU time compared to actual values.
- Unique Data Organization: Though Carbon stores data in Columnar format, it differs from traditional Columnar format that the columns in each row-group(Data Block) is sorted independent of the other columns. Though this arrangement requires carbon to store the row-number mapping against each column value, it makes possible to use binary search for faster filtering and Since the values are sorted, same/similar values come together which yields better compression and offsets the storage overhead required by the row number mapping.
- Multi Level Indexing: Carbon uses multiple indices at various levels to enable faster search and speed up query processing.
- Global Multi Dimensional Keys(MDK) based B+Tree Index for all non- measure columns:Aids in quickly locating the row groups(Data Blocks) that contain the data matching search/filter criteria.
- Min-Max Index for all columns: Aids in quickly locating the row groups(Data Blocks) that contain the data matching search/filter criteria.
- Data Block level Inverted Index for all columns: Aids in quickly locating the rows that contain the data matching search/filter criteria within a row group(Data Blocks).
- Advanced Push Down Optimizations: Carbon pushes as much of query processing as possible close to the data to minimize the amount of data being read, processed, converted and transmitted/shuffled.
- Projection and Filters: Since carbon uses columnar format, it reads only the required columns form the store and also reads only the rows that match the filter conditions provided in the query.
Besides remarkable performance on a variety of database workloads, Carbon includes several other features designed to offer performance, scalability, reliability, and ease of use. These include:
- A shared nothing, grid-based database architecture based on Spark that allows Carbon to scale effectively on clusters of commodity CPUs.
CarbonData vs popular Hadoop Data Stores
Structure Comparison
Carbon file format has lot of similarities in the structure with Parquet and ORC formats, yet there are some significant differences which make carbon several times faster for queries than Parquet or ORC.
Performance Comparison
Carbon performs much better than ORC and Parquet in most query scenarios, however the performance advantage is more evident in the following:
...
Apache CarbonData 1.1.0 | |
Aug-Sep 2017 | Apache CarbonData 1.2.0 |
Jan-Feb 2018 | Apache CarbonData 1.3.0 |
Mar 2018 | Apache CarbonData 1.3.1 |
May 2018 | Apache CarbonData 1.4.0 |
Aug 2018 | Apache CarbonData 1.4.1 |
Oct 2018 | Apache CarbonData 1.5.0 |
Dec 2018 | Apache CarbonData 1.5.1 |
Jan 2019 | Apache CarbonData 1.5.2 |
Mar 2019 | Apache CarbonData 1.5.3 |
May 2019 | Apache CarbonData 1.5.4 |
Aug 2019 | Apache CarbonData 1.6.0 |
Oct 2019 | Apache CarbonData 1.6.1 |
May 2020 | Apache CarbonData 2.0.0 |
Jun 2020 | Apache CarbonData 2.0.1 |
Nov 2020 | Apache CarbonData 2.1.0 |
Mar 2021 | Apache CarbonData 2.1.1 |
Aug 2021 | Apache CarbonData 2.2.0 |
Jan 2022 | Apache CarbonData 2.3.0 |
Road map plan:
1.0.x:
- Support 2.1 integration in CarbonData
- Remove kettle, support new data load solution
- Support data update and delete SQL in Spark 1.6
1.1.x:
- Add page in blocklet for improving scan cases' performance.
- Support V3 format for improving TPC-H performance.
- Support vector features by default
- Support data update and delete SQL in Spark 2.1
1.2.x
- Support to specify sort column for MDK(Multi-Dimension Key index )
- Support partition
- Support Presto integration
- Support Hive integration
- Data loading optimization using ColumnPage in write step and make it off heap
1.3.x:
- Support streaming ingestion data to CarbonData
- Provide index framework for supporting user to add more index.
- Support local dictionary
- Ecosystem integration(eg. latest Apache Spark version 2.x)
1.4.x:
- Support create carbondata on cloud storage(AWS S3, Huawei OBS)
- Provide index framework for supporting user to add more index, like : text index using lucene
- Ecosystem integration
1.5.x:
- Support MV(Materialized View), Bloom Filter (in production features)
- Support CarbonData engine for improving concurrent visit and point query.
- Ecosystem integration
- Support alter add column in carbon file format
- Supports multiple character separators in csv file during data loading
- Compaction support for segments created with range_sort and global_sort
- Support DDLs to operate on Driver Cache (Get cache size, clear cache)
- Support building datamaps and data load in parallel to reduce the overall time taken
- Summary of loaded and bad records data after data loading
1.6.x:
- Support storing of carbon Min-Max indexes in external system
- MV DataMap Enhancements and Stabilisation
- Query performance enhancements
- Deeper Presto integration and stabilisation
- UDF and UDAF support in Pre-aggregate tables
- Support Read from hive
2.0.x:
- Support Write into hive
- Load performance improvements
- TPCDS [Query, load] performance improvements
- Carbon Advisor for auto suggestion of ideal table schema including MV, index, sort col, range col, compression ...
- Delete and update support in CarbonData SDK
- Support C engine reader for CarbonData SDK
- ES based datamap management
- Support Spark DataSource API V2
- Support CarbonData metadata management using DB or other external OLTP system
- Support MV on Streaming tables, partition tables, Time Series
- Support MV creation from another MV
2.1.x:
- Presto read support for complex columns
- Make GeoID visible to the user
- Support Carbondata SDK to load data from parquet, ORC, CSV, Avro and JSON.
- Implement delete and update feature in carbondata SDK.
- Support array<string> with SI
- Support IndexServer with Presto Engine
- Implementing a new Reindex command to repair the missing SI Segments
- Support Change Column Comment
- Support Local dictionary for presto complex datatypes
- Block Pruning for geospatial polygon expression
- Improve concurrent query performance
- Support global sort for Secondary index table
- Filter reordering
- Geospatial index algorithm improvement and UDFs enhancement
- CarbonData Trash support
- Support Writing Flink Stage data into Hdfs file system
- Support MERGE INTO SQL Command
- Support Complex DataType when Save DataFrame
- Adding global sort support for SI segments data files merge operation.
2.2.x:
- Support Add, Drop and rename column support for the complex column
- Spark-3.1 support
- Secondary Index Support for Presto
- CDC Performance improvement
- Local sort Partition Load and Compaction improvement
- Geo Spatial Query enhancements
- Improve table status and metadata writing
2.3.x:
- Support spatial index creation using data frame
- Introduce Streamer tool for Carbondata
- Upgrade prestosql to 333 version
- Multi-level complex schema support
- Support for Dynamic Partition Pruning
Pages Link
Committers
Releases
CarbonData Performance Reports
Apache CarbonData Performance Benchmark(0.1.0)
Events(Summit and Meetup materials)
Use cases and shared articles
...