Introduction to Celeborn's Java Columnar Shuffle
Overview
Celeborn presents a Java Columnar Shuffle designed to enhance performance and efficiency in SparkSQL and DataFrame operations. This innovative approach leverages a columnar format for shuffle operations, achieving a higher compression rate than traditional Row-based Shuffle methods. This improvement leads to significant savings in disk space usage during shuffle operations.
Benefits
- High Compression Rate: By organizing data into a columnar format, this feature significantly increases the compression ratio, reducing the disk space required for Shuffle data. However, enabling this optimization incurs overhead for row-to-column and column-to-row transformations. If disk space is a higher priority, it is recommended to enable this feature. If performance is a higher priority, it is advisable to weigh the trade-offs.
Configuration
To leverage Celeborn's Java Columnar Shuffle, you need to apply a patch and configure certain settings in Spark 3.x. Follow the steps below for implementation:
Step 1: Apply this patch to obtain the schema information for shuffle
- Obtain the
https://github.com/apache/celeborn/tree/main/assets/spark-patch/Celeborn_Columnar_Shuffle_spark3.patch
file that contains the modifications needed for enabling Columnar Shuffle in Spark 3.x. - Navigate to your Spark source directory.
- Apply the patch.
Step 2: Configure Celeborn Settings
To enable Columnar Shuffle, adjust the following configurations in your Spark application: Open the Spark configuration file or set these parameters in your Spark application. Add the following configuration settings:
spark.celeborn.columnarShuffle.enabled true
spark.celeborn.columnarShuffle.encoding.dictionary.enabled true
If you require further performance optimization, consider enabling code generation with:
spark.celeborn.columnarShuffle.codegen.enabled true