Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Remove unnecessary heading blank

...

Gluten is a middle layer responsible for offloading Apache Spark SQL queries to native engines. This project aims to address the CPU computational bottleneck to offload Apache Spark SQL operators to native engines in data loading scenarios based on Apache Spark. With advancements in IO technologies, especially the widespread use of SSDs and 10GbE NICs or higher bandwidth, CPU computation has gradually become the primary limiting factor for performance. However, optimizing CPU instructions based on the JVM is relatively challenging compared to other native languages like C++, as the JVM provides fewer optimization capabilities.

 Proposal

The Gluten project utilizes Apache Spark's plugin mechanism to intercept and send query plans to native engines for execution, bypassing Apache Spark's less efficient execution path. The project supports multiple native engines as backends, including Velox, ClickHouse, and Apache Arrow. For operations that the native engines cannot handle, Gluten falls back to Spark's normal execution path. In terms of thread models, Gluten utilizes JNI (Java Native Interface) library calls to directly invoke native code within Spark executor task threads, avoiding the introduction of complex thread models.

 Background

Apache Spark is a stable, mature project that has been under development for many years. The project has proven to be one of the best frameworks for processing petabyte-scale datasets. However, the Spark community has had to address performance challenges that required various optimizations over time. A key optimization introduced in Spark 2.0 replaced Volcano mode with whole-stage code-generation to achieve a 2x speedup. Most of the optimization works at the query plan level.

...

https://oap-project.github.io/gluten/

 Rationale

The Gluten project aims to bridge the gap between Spark SQL's scalability and native libraries' performance benefits. By reusing Spark's control flow and JVM code while offloading compute-intensive data processing to native code, we seek to significantly improve performance without requiring changes to existing SparkSQL jobs. This approach involves transforming Spark's physical plan into a Substrait plan and passing it to native libraries, enabling the seamless execution of SparkSQL jobs with enhanced performance.

...

  1. Implement a robust mechanism to transform Spark's physical plan into Substrait plan.
  2. Develop a seamless integration of native libraries for offloading performance-critical data processing.
  3. Define clear JNI interfaces for efficient communication between SparkSQL and native libraries.
  4. Enable easy switching between available native backends to enhance flexibility and performance optimization.
  5. Implement a data-sharing mechanism between JVM and native code to manage data effectively.
  6. Extend support to native accelerators for enhanced performance gains in specific use cases.
  7. Provide detailed documentation and guides for users to seamlessly configure and utilize Gluten within their SparkSQL environments.
  8. Expanding our support to encompass a broader range of big data frameworks, including Flink, Trino, and more.
  9. To cultivate an active and vibrant Apache community, one that empowers development teams and fortifies the project's strength.

 Current Status

Gluten has achieved a v1.1.0 release in Nov. 2023 with below major features:

...

The list is very long so put it in the "table 1" at the end of this page.

Cryptography

Gluten does not currently include any cryptography-related code.

Required Resources

Mailing lists:

...