Challenges faced in Big Data world

In the current Big data world, the major challenges that we are facing include

  • The queries are executed on TBs and PBs of data which involve large number of columns in the tables
  • Queries include filters on multiple columns
  • There are scenarios that involve aggregate expressions.
  • Slow response time on queries with filters and aggregations

For interactive analysis to be successful, we require responses faster than seconds. We tried using Apache Impala, Apache Kudu and Apache HBase to meet our enterprise needs, but we ended up with queries taking a lot of time. Apache Spark SQL also did not fit well into our domain because of being structural in nature, while bulk of our data was Nosql in nature. The nature of our enterprise required us to provide results to queries which were generated out of results of previously generated queries(adhoc in nature).

How we were trying to meet our requirements

Initially we deployed three different engines to run different types of queries with efficiency.

We deployed

  • Hive for batch queries,
  • Impala and Spark SQL for Interactive Queries and
  • Hbase for Operational Queries.

The biggest challenge before us was each of these engines required us to store data for them individually. This led to inflated cost of storing duplicate data along with the cost involved for separate processing of data in each of these engines. Storing data was not the only challenge for each of the engine, the conversion required in order to make data suitable for each engine itself was a process with cost and time. This cost also escalated due to hardware requirements for providing support to each of these engine and running them in parallel was a costly affair needing several nodes etc.

Another challenging aspect for us was seamless integration of Hive, Impala, Spark and Hbase into our system. Integrating multiple engines without hampering each other was a difficult feat and often even trivial configuration changes lead to failure of one or the other engine.

How Apache CarbonData helped us

Apache CarbonData is an application that is suitable for all three types of queries – sequential, interactive and Random access queries. It enabled us to ingest large volume of structured and unstructured data for analysis based on adhoc queries. Apache CarbonData addressed the issue for us by providing a data storage format which provided equal efficiency to all use cases and eliminated the need for data duplicity for each engine. Apache CarbonData also resolved the issue of performance by providing us sub-second responses to all query scenarios.

The ability of Apache CarbonData to integrate seamlessly in our system freed us from hassle of integrating multiple engines seamlessly. And its complete ecosystem with Apache Hadoop and Apache Spark helped us meet our enterprise needs effectively.

Features of Apache CarbonData that satisfied our requirements

With use of Apache CarbonData the difference in the query performance was very prominent. The query performance of our system was enhanced several times because of the following Apache CarbonData features:

  • Data are stored along MDK (multi-dimensional keys). Data is stored as index in columnar format.
  • Multiple columns form a column group which saves implicit joins and also saves stitching cost for reconstructing row.
  • Table level Global dictionary facilitates speedup aggregation, reduce run-time memory footprint, and enable deferred decoding.

Support from Apache CarbonData

During the entire phase of integration of Apache CarbonData into our system, we were very aptly and timely supported by the Community of Apache CarbonData. The timely responses to our queries and setup issues helped us integrate the Apache CarbonData very quicky into our system.

The Apache CarbonData Project has a very active developer community and the application is evolving at a very fast pace. The support for following features in the future versions has lead us to select Apache CarbonData.

  • IUD,
  • streaming ingestion and
  • integration with all kinds of tools along with Apache Projects.

Also the constant development on the Apache CarbonData is helping us meet our needs for a unified storage system in a more proficent manner each day.

Conclusion

In the interactive OLAP query analysis where quick response is very essential, Apache CarbonData has proved to be very essential and very effective.


  • No labels