Introduction to Celeborn's Java Columnar Shuffle

Overview

Celeborn presents a Java Columnar Shuffle designed to enhance performance and efficiency in SparkSQL and DataFrame operations. This innovative approach leverages a columnar format for shuffle operations, achieving a higher compression rate than traditional Row-based Shuffle methods. This improvement leads to significant savings in disk space usage during shuffle operations.

Benefits

High Compression Rate: By organizing data into a columnar format, this feature significantly increases the compression ratio, reducing the disk space required for Shuffle data. However, enabling this optimization incurs overhead for row-to-column and column-to-row transformations. If disk space is a higher priority, it is recommended to enable this feature. If performance is a higher priority, it is advisable to weigh the trade-offs.

Configuration

To leverage Celeborn's Java Columnar Shuffle, you need to apply a patch and configure certain settings in Spark 3.x. Follow the steps below for implementation:

Step 1: Apply this patch to obtain the schema information for shuffle

Obtain the https://github.com/apache/celeborn/tree/main/assets/spark-patch/Celeborn_Columnar_Shuffle_spark3.patch file that contains the modifications needed for enabling Columnar Shuffle in Spark 3.x.
Navigate to your Spark source directory.
Apply the patch.

Step 2: Configure Celeborn Settings

To enable Columnar Shuffle, adjust the following configurations in your Spark application: Open the Spark configuration file or set these parameters in your Spark application. Add the following configuration settings:

spark.celeborn.columnarShuffle.enabled true
spark.celeborn.columnarShuffle.encoding.dictionary.enabled true

If you require further performance optimization, consider enabling code generation with:

spark.celeborn.columnarShuffle.codegen.enabled true