Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Gluten project utilizes Apache Spark's plugin mechanism to intercept and send query plans to native engines for execution, bypassing Apache Spark's less efficient execution path. The project supports multiple native engines as backends, including Velox, ClickHouse, and Apache Arrow. For operations that the native engines cannot handle, Gluten falls back to Spark's normal execution path. In terms of thread models, Gluten utilizes JNI (Java Native Interface) library calls to directly invoke native code directly within Spark executor task threads, avoiding the introduction of complex thread models.

...

Apache Spark is a stable, mature project that has been under development for many years. The project has proven to be one of the best frameworks for processing petabyte-scale datasets. However, the Spark community has had to address performance challenges that required various optimizations over time. A key optimization introduced in Spark 2.0 replaced Volcano mode with whole-stage code - generation to achieve a 2x speedup. Most of the optimization works at the query plan level.

...

There are numerous mature open-source native SQL engine products and libraries available in the market, including Velox, ClickHouse, and Apache Arrow, among others. Gluten has opted for Velox and ClickHouse as backend support but remains open to expanding its support to incorporate other esteemed open-source native SQL engines.

Meta has launched Velox (https://github.com/facebookincubator/velox), an open-source unified execution engine designed to enhance data management system efficiency and simplify development.

ClickHouse (https://clickhouse.com/) is  is an open-source column-oriented database management system designed for high-performance analytics and data warehousing, capable of handling massive amounts of data with lightning-fast query processing.

Plan Conversion

Gluten uses Substrait.io(https://github.com/substrait-io/substrait) to build an  to build a unified query plan tree and connect it to an individual backend engine. Gluten converts Spark’s physical plan to a Substrait plan for each backend, then shares the Substrait plan over JNI to trigger the execution pipeline in the native library.

...

Gluten also integrated with Apache Celeborn(incubating)(https://celeborn.apache.org), which is a mature general-purpose Remote Shuffle Service that can effectively address the stability, performance, and elasticity issues present in local shuffling of big data engines. The Apache Celeborn community and the Gluten community have been cooperating with each other for some time, successfully integrating Celeborn into Gluten. This integration allows Spark to better embrace the Cloud Native approach.

...

Relationships with Other Apache Products

  • Apache Spark (https://spark.apache.org/): Gluten's endorsement of Spark as its primary big data framework of choice stems from Spark's reputation as a potent, open-source distributed computing framework, integral to the core of big data analytics.
  • Apache Arrow (https : //arrow.apache.org/): Gluten utilizes Apache Arrow as a data format to empower high-performance data interchange across diverse programming languages, frameworks, and backends.
  • Apache Celeborn(incubating) (https://celeborn.apache.org/): Gluten  Gluten is closely integrated with Apache Celeborn for remote shuffle service support. The design goal of integrating Gluten with Celeborn is to simultaneously preserve the core designs of Gluten Columnar Shuffle and Celeborn Remote Shuffle, allowing the advantages of both to be combined.
  • Apache Uniffle(incubating) (https://uniffle.apache.org/) : Uniffle, a project offering high performance remote shuffle service capabilities, represents another promising integration opportunity that Gluten is considering. Gluten will be supported in the Apache Uniffle v0.8 release.
  • Apache Flink(https://flink.apache.org/): Apache Flink emerges as another promising big data framework that Gluten aims to incorporate as an intermediary layer, facilitating the seamless offloading of data processing to the native engine.

...