Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update the "initial goals" and "current status" description

...

Gluten greatly enhances Spark’s Metrics functionality by seamlessly integrating with it. While the default Spark metrics are tailored for Java row-based data processing, Project Gluten takes it a step further. We enrich this functionality with a specialized column-based API and introduce supplementary metrics. This augmentation not only optimizes the use of Gluten but also offers developers valuable tools for debugging these native libraries effectively.

Initial Goals

  • Unified Plan Transformation: Implement a robust mechanism to transform

...

  • sql plan into

...

  • an unified plan.
  • Seamless Native Integration: Develop a

...

  • smooth integration of native libraries

...

  • , optimizing the offloading of performance-critical data processing tasks for improved computational speed.
  • Efficient Communication Infrastructure: Define clear JNI interfaces

...

  • to facilitate efficient communication between

...

  • big data framework and native libraries.

...

  • Expend the Community: Foster the growth and diversification of the Gluten community, empowering development teams and strengthening the project's foundation.

Current Status

In 2022, Intel and Kyligence initiated the development of Gluten, initially released as an open-source Spark plugin. During this period, there was a recognized need for a project capable of harnessing hardware capabilities and seamlessly integrating with native libraries to deliver superior performance, surpassing the limitations of the existing Java-based Spark SQL.

Meritocracy:

This proposal aims to build a diverse community for Gluten, following the Apache Software Foundation's approach. Since Gluten became open source, many companies have adopted it for their big data solutions. The code is managed by developers from over 10 companies, and we invite individual developers to play key roles too. We're committed to creating an environment that values meritocracy principles. 

Community:

The project has had a strong interest from numerous companies and individuals, engaging in discussions about its roadmap, issues, and design for the last one year. Gluten already has active contributors from various organizations. And we believe that embracing the Apache Way will further enhance the growth of both the community and the project.

Users:

Gluten has been adopted into companies including Baidu/BIGO/Meituan’s on-premise data warehouse, where it efficiently handles thousands of tasks daily. Additionally, Alibaba's E-MapReduce product on Alibaba Cloud incorporates Gluten as a key feature, serving numerous customers. A significant number of our users express a strong willingness to actively contribute to the project, fostering community growth and strength

Current Status

Gluten has achieved a v1.1.0 release in Nov. 2023 with below major features:

  • 20% performance improvement in Decision Support Benchmarks comparing to v1.0.0
  • Support Spark 3.2 and Spark 3.3
  • Support Spark 3.4 (experimental)
  • Run Pass all Velox UTs, Spark 3.2/3.3 SQL related UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support File System: localfs, HDFS, S3, OSS(via s3a), GCS
  • Support File Format: Parquet, ORC
  • Support Data Lake: deltalake (experimental)
  • Support Data Types: Primitive Type, Decimal, Date, Timestamp, Array (partial), Map (partial), Struct (partial)
  • Support 28 common Spark Operators, detail here
  • Support 199 common Spark Functions, detail here
  • Support Dynamic Memory Pool and Spill
  • Support Velox UDF
  • Support Gluten UI to print fallback event in History Server
  • Support Hadoop HA and Kerberos
  • Velox code updated to 20231123(commit-id: aff0cdec613d26294fb98b89ef292bc3c1a2e82e)
  • Document Improvement

Meritocracy:

This proposal aims to cultivate a diverse developer and user community around Gluten, following the Apache Software Foundation's meritocracy model. Since Gluten was open-sourced, numerous enterprises have adopted it to seamlessly integrate with their existing SparkSQL services. Consequently, the Gluten project has received a significant influx of issue reports and enhancements from these companies. The project is currently hosted and supported by Intel and Kyligence accounts on GitHub and maintains close associations with various big data projects within the ASF.

Due to our project's alignment with ASF's values and integration potential with its ecosystem, we have been approached multiple times by our users regarding the possibility of Gluten being incubated under ASF. Presently, the codebase is primarily overseen by a collaborative group of developers from Intel, Kyligence, BIGO, Alibaba, NetEase, Meituan and more. We also warmly welcome individual developers to join as core contributors to Gluten. Our commitment is to foster an environment that promotes and recognizes meritocracy within the project.

Community:

Over the past year, Gluten has dedicated itself to nurturing a thriving community of contributors and users for its framework. As of now, Gluten has achieved a remarkable milestone with 800 stars and 289 forks on GitHub. We are confident that we can continue to leverage the support and expertise of the Apache Spark community to further enhance our efforts.

Core Developers:

Gluten boasts a diverse developer base spanning multiple organizations, including but not limited to Intel, Kyligence, BigO, Alibaba, Meituan, Microsoft, Baidu, and Netease. Many of these developers hold key roles as PMC members, and actively contribute not only to Gluten but also to various other Apache projects.

Alignment:

Gluten is constructed using Apache Spark and incorporates several other Apache projects, including Hadoop and YARN. The codebase of Gluten is already licensed under Apache License Version 2.0. Moreover, our team includes core developers with significant experience contributing to diverse Apache projects. Leveraging these community connections, we prioritize development practices that emphasize community engagement, aligning ourselves with the Apache Software Foundation's path to meritocratic recognition seamlessly.

...