Seamless Integration with Big Data Eco-System

INTRODUCTION

Dataframe & SQL Compliance.

DESCRIPTION

It has built-in spark integration for Spark 1.6.2, 2.1 and interfaces for Spark SQL, DataFrame API and query optimization. It supports bulk data ingestion and allows saving of spark dataframes as CarbonData files.

CARBONDATA-SPARK INTEGRATION

Figure 1 : CarbonData -Spark Integration

Apache CarbonData uses spark for data management and query optimisation. It has its own reader and writer and Hadoop stores all the files on the HDFS. Apache CarbonData acts as a SparkSQL Data Source

CARBONDATA AS A SPARKSQL DATA SOURCE

Figure 2 : Roles of Spark Components in CarbonData

i) Parser/Analyzer

Parser : Parser parses every incoming query e.g. insert, update and delete.The parser is internally hooked to the query that is fired. And it is used to parse the new SQL syntaxes. Following below are the new SQL Syntax's like update/delete, Compaction, etc. Once the syntax has been parsed, we start further processing of the query.

Resolve Relation : Carbon data source that enable buildScan and insert.

ii) Optimise and Physical Planning

The next step is to find what can be optimised. One simple optimisation is lazy decoding.

Lazy decoding: Decoding the dictionary values only when it is required to fetch data, while continuing all the work on the dictionary values.

Example query :

The lazy decode leveraging the global dictionary.

Figure 3 : Lazy decoding

All the search and filter values can be applied to the dictionary values, and they can be converted to the actual values when you want them to be given to the spark.

iii) Execution

Spark Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

Figure 4: Sequence Diagram (Spark 1.6.2)

Class	Description	Package
CarbonScanRDD	Query execution RDD, Leveraging multi level index for efficient filtering and scan DML related RDDs.	org.apache.carbondata.spark.rdd.CarbonScanRdd
DetailQueryExecutor	Carbon query interface	org.apache.carbondata.core.scan.executor.QueryExecutor
DetailQueryResult	Internal query interface, execute the query return the iterator over query result	org.apache.carbondata.core.scan.result.iterator.AbstractDetailQueryResultIterat
DetailBlockIterator	Blocklet iterator to process the blocklet	org.apache.carbondata.core.scan.processor.AbstractDataBlockIterator
FilterScanner	Interface for scanning the blocklet, there are two type of scanner non filter scanner and filter scanner.	org.apache.carbondata.core.scan.scanner.BlockletScanner
DictionaryBasedResultCollector	Prepares the query results from scanned result	org.apache.carbondata.core.scan.collector.ScannedResultCollecto
FilterExecutor	Interface for executing the filter in executor side	org.apache.carbondata.core.scan.filter.executor.FilterExecuter
DimensionColumnChunkReader	Reader interface for reading and uncompressing the blocklet dimension column data.	org.apache.carbondata.core.datastore.chunk.reader.DimensionColumnChunkReader
MeasureColumnChunkReader	Reader interface for reading and uncompressing the blocklet measure column data.	org.apache.carbondata.core.datastore.chunk.reader.MeasureColumnChunkReader

Page tree

Seamless Integration with Big Data Eco-System