Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update the name for BIGO

Abstract

Gluten is a middle layer responsible for offloading Apache Spark SQL queries JVM-based SQL engines' execution to native engines. This project aims to address the CPU computational bottleneck to offload Apache Spark SQL JVM operators to native engines in data loading and various scenarios based on Apache Spark. With advancements in IO technologies, especially the widespread use of SSDs and 10GbE NICs or higher bandwidth, CPU computation has gradually become the primary limiting factor for performance. However, optimizing CPU instructions based on the JVM is relatively challenging compared to other native languages like C++, as the JVM provides fewer optimization capabilities. At this moment, Apache Spark is the first engine it can plug into. Support for other engines like Trino, Apache Flink are on the roadmap.

Proposal

The Gluten project utilizes JVM-based SQL engines' (currently Apache Spark's ) plugin mechanism to intercept and send query plans to native engines for execution, bypassing Apache Sparkthe original engine's less efficient execution path. The project supports multiple native engines as backends, including Velox, ClickHouse, and Apache Arrow. For operations that the native engines cannot handle, Gluten falls back to Sparkthe SQL engine's normal execution path. In terms of thread models, Gluten utilizes JNI (Java Native Interface) library calls to invoke native code directly within Spark original engine's executor task threads, avoiding the introduction of complex thread models.

...

However, there is a need to address query performance more broadly. The industry understands the current performance bottleneck in the existing Spark. Databricks did create Photon as a high-performance native vectorized query engine, but it is commercial software and close source as well. This motivated Intel and Kyligence to initiate the Gluten project to unleash the power of Advanced Vector Extensions (AVX) technology using SIMD instructions within a vectorized SQL engine, which enables Apache Spark (as well as other engines in the future) to break through its row-based data processing and JVM limitations. 

...

Gluten boasts a diverse developer base spanning multiple organizations, including but not limited to Intel, Kyligence, BigOBIGO, Alibaba, Meituan, Microsoft, Baidu, and Netease. Many of these developers hold key roles as PMC members, and actively contribute not only to Gluten but also to various other Apache projects.

...

“Gluten” is Latin for glue. Main goal of project Gluten is to “glue" the SparkSQL JVM based SQL engine like Spark SQL and native libraries. So we can take use of and benefit from the high scalability of Spark SQL framework, as well as the high performance of native libraries.

...

Expect to enter incubation in two months and graduate in 12 18-24 months.

Homogenous Developers

...

Upon Gluten's approval to join the Apache Incubator, Intel and Kyligence will submit a Software Grant Agreement (SGA) , and and CCLA (Kyligence has aleady signed; Intel agreed to sign it once entered the incubator); our initial committers will promptly submit their iCLA. Rest assured, the codebase is already licensed under the Apache License 2.0, ensuring compliance and seamless integration.

...

Required Resources

Mailing lists:

Git Repositories:

Upon entering incubation, we want to move the existing repo to the Apache Software Foundation:

...

  • Hongze Zhang (Github ID: zhztheplayer) <Hongze.Zhang at intel dot com >
  • Rui Mo (Github ID: rui-mo) <rui.mo at intel dot com >
  • Rong Ma (Github ID:marin-ma) <rong.ma at intel dot com >
  • Feilong He (Github ID: PHILO-HE)) <Feilong.He at intel dot com >
  • Zhichao Zhang (Github ID: zzcclp) <zhangzc at apache dot org>
  • Jia Ke (Github ID: JkSelf) <ke.a.jia at intel dot com >
  • Yang Li (Github ID:taiyang-li) <liyang910910 at gmail dot com>
  • Chuan Yang Zhang (Github ID: Yohahaha) <yangchuan.zy at alibaba-inc dot com>
  • Yuan Zhou (Github ID:zhouyuan) <yuan.zhou at intel dot com >
  • Xiduo You (Github ID: ulysses-you) <ulyssesyou at apache dot org>
  • Jiabiao Liang (Github ID: lgbo-ustc) <lgbo.ustc at gmail dot com>
  • Chunwei Zuo (Github ID:zuochunwei) <zuochunwei at meituan dot com >
  • Chang Chen (Github ID: baibaichen) <chang.chen at kyligence dot io>
  • Shuai Li (Github ID: loneylee) <shuai.li at kyligence dot io>
  • Binwei Yang (Github ID: FelixYBW) <binwei.yang at intel dot com>
  • Hongbin Ma (Github ID: binmahone) <mahongbin at apache dot org>
  • Neng Liu (Github ID:liuneng1994) <neng.liu at kyligence dot io>
  • Zhen Li (Github ID: zhli1142015) <zhli at microsoft dot com >
  • Weiting Chen (Github ID: weiting-chen) <weiting.chen at intel dot com >
  • Jacky Lee (Github ID: jackylee-ch) <qcsd2011 at gmail dot com>
  • Zhibiao Zhang (Github ID: zhanglistar) < zhanglinuxstar at gmail dot com>
  • Kuo Zhao (Github ID: kecookier) <zhaokuo_game at 163 dot com >: Meituan team leader who has helped to adopt Gluten into Meituan production ready environment as Gluten’s 1st real use case.
  • Keyong Zhou (Github ID: waitinfuture) <zky.zhoukeyong at alibaba-inc dot com >: Alibaba team leader and Apache Celeborn Committer who has helped to integrate Gluten into Alibaba EMR and brought Celeborn support with Gluten.

...

  • Intel: Binwei Yang, Feilong He, Hongze Zhang, Jia Ke, Rong Ma, Rui Mo, Weiting Chen, Yuan Zhou
  • Kyligence: Chang Chen, Hongbin Ma, Neng Liu, Shuai Li, Zhichao Zhang
  • BigOBIGO: Jiabiao Liang, Yang Li,  Zhibiao Zhang
  • Alibaba: Chuan Yang Zhang, Keyong Zhou
  • Meituan: Chunwei Zuo, Kuo Zhao
  • Baidu: Jacky Lee
  • Netease: Xiduo You
  • Microsoft: Zhen Li

...

  • Yu Li (liyu@apache.org)
  • Wenli Zhang (ovilia@apache.org)
  • Kent Yao (yao@apache.org)
  • Shaofeng Shi (shaofengshi@apache.org)
  • Felix Cheung (felixcheung@apache.org)

Sponsoring Entity

We are expecting the Apache Incubator could sponsor this project.

...