Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update the name for BIGO

Abstract

Gluten is a middle layer responsible for offloading Apache Spark SQL queries JVM-based SQL engines' execution to native engines. This project aims to address the CPU computational bottleneck to offload Apache Spark SQL JVM operators to native engines in data loading and various scenarios based on Apache Spark. With advancements in IO technologies, especially the widespread use of SSDs and 10GbE NICs or higher bandwidth, CPU computation has gradually become the primary limiting factor for performance. However, optimizing CPU instructions based on the JVM is relatively challenging compared to other native languages like C++, as the JVM provides fewer optimization capabilities. At this moment, Apache Spark is the first engine it can plug into. Support for other engines like Trino, Apache Flink are on the roadmap.

Proposal

The Gluten project utilizes JVM-based SQL engines' (currently Apache Spark's ) plugin mechanism to intercept and send query plans to native engines for execution, bypassing Apache Sparkthe original engine's less efficient execution path. The project supports multiple native engines as backends, including Velox, ClickHouse, and Apache Arrow. For operations that the native engines cannot handle, Gluten falls back to Sparkthe SQL engine's normal execution path. In terms of thread models, Gluten utilizes JNI (Java Native Interface) library calls to directly invoke native code directly within Spark original engine's executor task threads, avoiding the introduction of complex thread models.

...

Apache Spark is a stable, mature project that has been under development for many years. The project has proven to be one of the best frameworks for processing petabyte-scale datasets. However, the Spark community has had to address performance challenges that required various optimizations over time. A key optimization introduced in Spark 2.0 replaced Volcano mode with whole-stage code - generation to achieve a 2x speedup. Most of the optimization works at the query plan level.

However, there is a need to address query performance more broadly. The industry understands the current performance bottleneck in the existing Spark. Databricks did create Photon as a high-performance native vectorized query engine, but it is commercial software and close source as well. This motivated Intel and Kyligence to initiate the Gluten . This motivated Intel and Kyligence to initiate the Gluten project to unleash the power of Advanced Vector Extensions (AVX) technology using SIMD instructions within a vectorized SQL engine, which enables Apache Spark (as well as other engines in the future) to break through its row-based data processing and JVM limitations. 

...

There are numerous mature open-source native SQL engine products and libraries available in the market, including Velox, ClickHouse, and Apache Arrow, among others. Gluten has opted for Velox and ClickHouse as backend support but remains open to expanding its support to incorporate other esteemed open-source native SQL engines.

Meta has launched Velox (https://github.com/facebookincubator/velox), an open-source unified execution engine designed to enhance data management system efficiency and simplify development.

ClickHouse (https://clickhouse.com/) is  is an open-source column-oriented database management system designed for high-performance analytics and data warehousing, capable of handling massive amounts of data with lightning-fast query processing.

Plan Conversion

Gluten uses Substrait.io(https://github.com/substrait-io/substrait) to build an  to build a unified query plan tree and connect it to an individual backend engine. Gluten converts Spark’s physical plan to a Substrait plan for each backend, then shares the Substrait plan over JNI to trigger the execution pipeline in the native library.

...

Gluten also integrated with Apache Celeborn(incubating)(https://celeborn.apache.org), which is a mature general-purpose Remote Shuffle Service that can effectively address the stability, performance, and elasticity issues present in local shuffling of big data engines. The Apache Celeborn community and the Gluten community have been cooperating with each other for some time, successfully integrating Celeborn into Gluten. This integration allows Spark to better embrace the Cloud Native approach.

...

Gluten greatly enhances Spark’s Metrics functionality by seamlessly integrating with it. While the default Spark metrics are tailored for Java row-based data processing, Project Gluten takes it a step further. We enrich this functionality with a specialized column-based API and introduce supplementary metrics. This augmentation not only optimizes the use of Gluten but also offers developers valuable tools for debugging these native libraries effectively.

Initial Goals

  • Unified Plan Transformation: Implement a robust mechanism to transform

...

  • sql plan into

...

  • an unified plan.
  • Seamless Native Integration: Develop a

...

  • smooth integration of native libraries

...

  • , optimizing the offloading of performance-critical data processing tasks for improved computational speed.
  • Efficient Communication Infrastructure: Define clear JNI interfaces

...

  • to facilitate efficient communication between

...

  • big data framework and native libraries.

...

  • Expend the Community: Foster the growth and diversification of the Gluten community, empowering development teams and strengthening the project's foundation.

Current Status

In 2022, Intel and Kyligence initiated the development of Gluten, initially released as an open-source Spark plugin. During this period, there was a recognized need for a project capable of harnessing hardware capabilities and seamlessly integrating with native libraries to deliver superior performance, surpassing the limitations of the existing Java-based Spark SQL.

Meritocracy:

This proposal aims to build a diverse community for Gluten, following the Apache Software Foundation's approach. Since Gluten became open source, many companies have adopted it for their big data solutions. The code is managed by developers from over 10 companies, and we invite individual developers to play key roles too. We're committed to creating an environment that values meritocracy principles. 

Community:

The project has had a strong interest from numerous companies and individuals, engaging in discussions about its roadmap, issues, and design for the last one year. Gluten already has active contributors from various organizations. And we believe that embracing the Apache Way will further enhance the growth of both the community and the project.

Users:

Gluten has been adopted into companies including Baidu/BIGO/Meituan’s on-premise data warehouse, where it efficiently handles thousands of tasks daily. Additionally, Alibaba's E-MapReduce product on Alibaba Cloud incorporates Gluten as a key feature, serving numerous customers. A significant number of our users express a strong willingness to actively contribute to the project, fostering community growth and strength

Current Status

Gluten has achieved a v1.1.0 release in Nov. 2023 with below major features:

  • 20% performance improvement in Decision Support Benchmarks comparing to v1.0.0
  • Support Spark 3.2 and Spark 3.3
  • Support Spark 3.4 (experimental)
  • Run Pass all Velox UTs, Spark 3.2/3.3 SQL related UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support File System: localfs, HDFS, S3, OSS(via s3a), GCS
  • Support File Format: Parquet, ORC
  • Support Data Lake: deltalake (experimental)
  • Support Data Types: Primitive Type, Decimal, Date, Timestamp, Array (partial), Map (partial), Struct (partial)
  • Support 28 common Spark Operators, detail here
  • Support 199 common Spark Functions, detail here
  • Support Dynamic Memory Pool and Spill
  • Support Velox UDF
  • Support Gluten UI to print fallback event in History Server
  • Support Hadoop HA and Kerberos
  • Velox code updated to 20231123(commit-id: aff0cdec613d26294fb98b89ef292bc3c1a2e82e)
  • Document Improvement

Meritocracy:

This proposal aims to cultivate a diverse developer and user community around Gluten, following the Apache Software Foundation's meritocracy model. Since Gluten was open-sourced, numerous enterprises have adopted it to seamlessly integrate with their existing SparkSQL services. Consequently, the Gluten project has received a significant influx of issue reports and enhancements from these companies. The project is currently hosted and supported by Intel and Kyligence accounts on GitHub and maintains close associations with various big data projects within the ASF.

Due to our project's alignment with ASF's values and integration potential with its ecosystem, we have been approached multiple times by our users regarding the possibility of Gluten being incubated under ASF. Presently, the codebase is primarily overseen by a collaborative group of developers from Intel, Kyligence, BIGO, Alibaba, NetEase, Meituan and more. We also warmly welcome individual developers to join as core contributors to Gluten. Our commitment is to foster an environment that promotes and recognizes meritocracy within the project.

Community:

Over the past year, Gluten has dedicated itself to nurturing a thriving community of contributors and users for its framework. As of now, Gluten has achieved a remarkable milestone with 800 stars and 289 forks on GitHub. We are confident that we can continue to leverage the support and expertise of the Apache Spark community to further enhance our efforts.

Core Developers:

Gluten boasts a diverse developer base spanning multiple organizations, including but not limited to Intel, Kyligence, BigOBIGO, Alibaba, Meituan, Microsoft, Baidu, and Netease. Many of these developers hold key roles as PMC members, and actively contribute not only to Gluten but also to various other Apache projects.

Alignment:

Gluten is constructed using Apache Spark and incorporates several other Apache projects, including Hadoop and YARN. The codebase of Gluten is already licensed under Apache License Version 2.0. Moreover, our team includes core developers with significant experience contributing to diverse Apache projects. Leveraging these community connections, we prioritize development practices that emphasize community engagement, aligning ourselves with the Apache Software Foundation's path to meritocratic recognition seamlessly.

...

“Gluten” is Latin for glue. Main goal of project Gluten is to “glue" the SparkSQL JVM based SQL engine like Spark SQL and native libraries. So we can take use of and benefit from the high scalability of Spark SQL framework, as well as the high performance of native libraries.

...

Expect to enter incubation in two months and graduate in 12 18-24 months.

Homogenous Developers

...

Relationships with Other Apache Products

  • Apache Spark (https://spark.apache.org/): Gluten's endorsement of Spark as its primary big data framework of choice stems from Spark's reputation as a potent, open-source distributed computing framework, integral to the core of big data analytics..
  • Apache Arrow Apache Arrow (https://arrow.apache.org/) : Gluten utilizes Apache Arrow as a data format to empower high-performance data interchange across diverse programming languages, frameworks, and backends.
  • Apache Celeborn(incubating) (https://celeborn.apache.org/): Gluten  Gluten is closely integrated with Apache Celeborn for remote shuffle service support. The design goal of integrating Gluten with Celeborn is to simultaneously preserve the core designs of Gluten Columnar Shuffle and Celeborn Remote Shuffle, allowing the advantages of both to be combined.
  • Apache Uniffle(incubating) (https://uniffle.apache.org/) : Uniffle, a project offering high performance remote shuffle service capabilities, represents another promising integration opportunity that Gluten is considering. Gluten will be supported in the Apache Uniffle v0.8 release.
  • Apache Flink(https://flink.apache.org/): Apache Flink emerges as another promising big data framework that Gluten aims to incorporate as an intermediary layer, facilitating the seamless offloading of data processing to the native engine.

...

Upon Gluten's approval to join the Apache Incubator, Intel and Kyligence will submit a Software Grant Agreement (SGA) , and and CCLA (Kyligence has aleady signed; Intel agreed to sign it once entered the incubator); our initial committers will promptly submit their iCLA. Rest assured, the codebase is already licensed under the Apache License 2.0, ensuring compliance and seamless integration.

...

Required Resources

Mailing lists:

Git Repositories:

Upon entering incubation, we want to move the existing repo to the Apache Software Foundation:

...

  • Hongze Zhang (Github ID: zhztheplayer) <Hongze.Zhang at intel dot com >
  • Rui Mo (Github ID: rui-mo) <rui.mo at intel dot com >
  • Rong Ma (Github ID:marin-ma) <rong.ma at intel dot com >
  • Feilong He (Github ID: PHILO-HE)) <Feilong.He at intel dot com >
  • Zhichao Zhang (Github ID: zzcclp) <zhangzc at apache dot org>
  • Jia Ke (Github ID: JkSelf) <ke.a.jia at intel dot com >
  • Yang Li (Github ID:taiyang-li) <liyang910910 at gmail dot com>
  • Chuan Yang Zhang (Github ID: Yohahaha) <yangchuan.zy at alibaba-inc dot com>
  • Yuan Zhou (Github ID:zhouyuan) <yuan.zhou at intel dot com >
  • Xiduo You (Github ID: ulysses-you) <ulyssesyou at apache dot org>
  • Jiabiao Liang (Github ID: lgbo-ustc) <lgbo.ustc at gmail dot com>
  • Chunwei Zuo (Github ID:zuochunwei) <zuochunwei at meituan dot com >
  • Chang Chen (Github ID: baibaichen) <chang.chen at kyligence dot io>
  • Shuai Li (Github ID: loneylee) <shuai.li at kyligence dot io>
  • Binwei Yang (Github ID: FelixYBW) <binwei.yang at intel dot com>
  • Hongbin Ma (Github ID: binmahone) <mahongbin at apache dot org>
  • Neng Liu (Github ID:liuneng1994) <neng.liu at kyligence dot io>
  • Zhen Li (Github ID: zhli1142015) <zhli at microsoft dot com >
  • Weiting Chen (Github ID: weiting-chen) <weiting.chen at intel dot com >
  • Jacky Lee (Github ID: jackylee-ch) <qcsd2011 at gmail dot com>
  • Zhibiao Zhang (Github ID: zhanglistar) < zhanglinuxstar at gmail dot com>
  • Kuo Zhao (Github ID: kecookier) <zhaokuo_game at 163 dot com >: Meituan team leader who has helped to adopt Gluten into Meituan production ready environment as Gluten’s 1st real use case.
  • Keyong Zhou (Github ID: waitinfuture) <zky.zhoukeyong at alibaba-inc dot com >: Alibaba team leader and Apache Celeborn Committer who has helped to integrate Gluten into Alibaba EMR and brought Celeborn support with Gluten.

...

  • Intel: Binwei Yang, Feilong He, Hongze Zhang, Jia Ke, Rong Ma, Rui Mo, Weiting Chen, Yuan Zhou
  • Kyligence: Chang Chen, Hongbin Ma, Neng Liu, Shuai Li, Zhichao Zhang
  • BigOBIGO: Jiabiao Liang, Yang Li,  Zhibiao Zhang
  • Alibaba: Chuan Yang Zhang, Keyong Zhou
  • Meituan: Chunwei Zuo, Kuo Zhao
  • Baidu: Jacky Lee
  • Netease: Xiduo You
  • Microsoft: Zhen Li

...

  • Yu Li (liyu@apache.org)
  • Wenli Zhang (ovilia@apache.org)
  • Kent Yao (yao@apache.org)
  • Shaofeng Shi (shaofengshi@apache.org)
  • Felix Cheung (felixcheung@apache.org)

Sponsoring Entity

We are expecting the Apache Incubator could sponsor this project.

...