Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update some minor wordings

...

Shuffle itself is a crucial factor affecting Spark performance. It involves multiple steps such as serialization/deserialization, network transmission, and disk I/O. To achieve high performance and avoid becoming a bottleneck, careful considerations are needed. Since the Native Engine utilizes a columnar data structure to store data, simply adopting Spark's row-based data model for Shuffle would introduce data column-to-row conversion in the Shuffle Write phase and data row-to-column conversion in the Shuffle Read phase. This is necessary to ensure smooth data circulation. However, both row-to-column and column-to-row conversions come at a cost. Therefore, Gluten must provide a comprehensive Columnar Shuffle mechanism to bypass these conversion overheads. In terms of the specific implementation of columnar shuffle, it can be broadly divided into two parts: shuffle data writing and shuffle data reading.

Gluten + also integrated with Apache Celeborn(incubating)(https://celeborn.apache.org). Gluten also integrated with Apache Celeborn, which is a mature general-purpose Remote Shuffle Service that can effectively address the stability, performance, and elasticity issues present in local shuffling of big data engines. The Apache Celeborn community and the Gluten community have been cooperating with each other for some time, successfully integrating Celeborn into Gluten. This integration allows Native Spark to better embrace the Cloud Native approach.

...

Git Repositories:

Upon entering incubation, we want to move the existing repo to the Apache Software Foundation:

...