INTRODUCTION

Dataframe & SQL Compliance.

DESCRIPTION

It has built-in spark integration for Spark 1.6.2, 2.1 and interfaces for Spark SQL, DataFrame API and query optimization. It supports bulk data ingestion and allows saving of spark dataframes as CarbonData files.

CARBONDATA-SPARK INTEGRATION

Figure 1 : CarbonData -Spark Integration

Apache CarbonData uses spark for data management and query optimisation. It has its own reader and writer and Hadoop stores all the files on the HDFS. Apache CarbonData acts as a SparkSQL Data Source

CARBONDATA AS A SPARKSQL DATA SOURCE

Figure 2 : Roles of Spark Components in CarbonData

i) Parser/Analyzer

  • Parser : Parser parses every incoming query e.g. insert, update and delete.The parser is internally hooked to the query that is fired. And it is used to parse the new SQL syntaxes. Following below are the new SQL Syntax's like update/delete, Compaction, etc. Once the syntax has been parsed, we start further processing of the query.
  • Resolve Relation : Carbon data source that enable buildScan and insert.

ii) Optimise and Physical Planning

The next step is to find what can be optimised. One simple optimisation is lazy decoding.

  • Lazy decoding: Decoding the dictionary values only when it is required to fetch data, while continuing all the work on the dictionary values.

          Example query :

          The lazy decode leveraging the global dictionary.

 

Figure 3 : Lazy decoding  

           All the search and filter values can be applied to the dictionary values, and they can be converted to the actual values when you want them to be given to the spark.

iii) Execution

  • Spark Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
  • The features of RDDs (decomposing the name):

            Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

            Distributed with data residing on multiple nodes in a cluster.

            Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

  Figure 4: Sequence Diagram  (Spark 1.6.2)

ClassDescriptionPackage
CarbonScanRDD

Query execution RDD, Leveraging multi level index for efficient filtering and scan DML related RDDs.

org.apache.carbondata.spark.rdd.CarbonScanRdd


DetailQueryExecutor

Carbon query interface

org.apache.carbondata.core.scan.executor.QueryExecutor
DetailQueryResult

Internal query interface, execute the query return the iterator over query result

org.apache.carbondata.core.scan.result.iterator.AbstractDetailQueryResultIterat

DetailBlockIterator

Blocklet iterator to process the blocklet

org.apache.carbondata.core.scan.processor.AbstractDataBlockIterator

FilterScanner

Interface for scanning the blocklet, there are two type of scanner non filter scanner and filter scanner.

org.apache.carbondata.core.scan.scanner.BlockletScanner
DictionaryBasedResultCollector

Prepares the query results from scanned result

org.apache.carbondata.core.scan.collector.ScannedResultCollecto

FilterExecutor

Interface for executing the filter in executor side

org.apache.carbondata.core.scan.filter.executor.FilterExecuter

DimensionColumnChunkReaderReader interface for reading and uncompressing the blocklet dimension column data.

org.apache.carbondata.core.datastore.chunk.reader.DimensionColumnChunkReader

MeasureColumnChunkReaderReader interface for reading and uncompressing the blocklet measure column data.

org.apache.carbondata.core.datastore.chunk.reader.MeasureColumnChunkReader


  • No labels