Apache Celeborn™ 0.6.3 Release Notes
Highlight
- Parallelize the create partition writer in handleReserveSlots to speed up the reserveSlots RPC process time
- Fast fail reduce stage if shuffle data is lost because of worker lost
- Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED
- Fix RuntimeException during stream cleanup preventing peer failover
- Fix ArithmeticException when PUSH_DATA_HAND_SHAKE fails before any data written
Correctness
- [CELEBORN-2238] Fix RuntimeException during stream cleanup preventing peer failover
- [CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED
Improvement
- [CELEBORN-2063] Parallelize the create partition writer in handleReserveSlots to speed up the reserveSlots RPC process time
- [CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost
- [CELEBORN-2212] Optimize the sorting efficiency of memoryWriters when evicting the largest memory file
- [CELEBORN-2214] Add extraInitContainers for worker in helm
- [CELEBORN-2243] During the close phase of hashWriter, pushData and mergeData are sent in parallel
- [CELEBORN-2248] Implement lazy loading for columnar shuffle classes and skew shuffle method using static holder pattern
- [CELEBORN-2256] Helm chart: add support for setting annotations on the service account (to support eks.amazonaws.com/role-arn)
- [CELEBORN-2265] Do not waste resources on hotpath for debug logging in HashBasedShuffleWriter and SortBasedShuffleWriter
- [CELEBORN-2271] StorageManager#saveCommittedFileInfosExecutor should call shutdown before awaitTermination
- [CELEBORN-2272] Add LZ4TPCDSDataBenchmark
- [CELEBORN-2277] Replace synchronized in Flusher.getWorkerIndex with AtomicInteger
- [CELEBORN-2281] Improve error logging and null checks in CreditStreamManager
- [CELEBORN-2287] Split mode should be HARD_SPLIT when disk is full
- [CELEBORN-2294] The shuffle fetch failed report from the zombie stage should be ignored
- [CELEBORN-2295] CommitHandler should support retry interval
Stability and Bug Fix
- [CELEBORN-1577] Fix master resource consumption metrics
- [CELEBORN-2063] Fix timeout unit for parallel creation of partition writer
- [CELEBORN-2238] Fix RuntimeException during stream cleanup preventing peer failover
- [CELEBORN-2263] Fix IndexOutOfBoundsException while reading from S3
- [CELEBORN-2273] Fix cache mutation in TagsManager.getTaggedWorkers()
- [CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED
- [CELEBORN-2276] Fix race condition in MemoryManager.releaseSortMemory
- [CELEBORN-2283] Fix missing return in Master.handleRequestSlots when all workers are excluded
- [CELEBORN-2284] Fix TLS Memory Leak
- [CELEBORN-2292] Fix ArithmeticException when PUSH_DATA_HAND_SHAKE fails before any data written
- [CELEBORN-2293] Fix ConcurrentModificationException in WorkerStatusTracker.shuttingWorkers
- [CELEBORN-2296] Fix race condition in MemoryManager singleton initialization
- [CELEBORN-2302] Fix NPE in MemoryManager.close() when readBufferDispatcher is not initialized
- [CELEBORN-2304] Fix timeout unit mismatch in disk monitor check
Dependencies
- [CELEBORN-2218] Bump lz4-java version from 1.8.0 to 1.10.4 to resolve CVE‐2025‐12183 and CVE-2025-66566
- [CELEBORN-2231] Upgrade jersey version to 2.47 to fix CVE-2025-12383
- [CELEBORN-2234] Bump jetty version to 9.4.58.v20250814 to fix GHSA-qh8g-58pp-2wxh
Credits
Thanks to the following contributors who helped to review and commit to Apache Celeborn 0.6.3 version:
| Contributors | |||||
|---|---|---|---|---|---|
| Aravind Patnam | Cheng Pan | Dongdong Zhang | Enrico Olivelli | Fei Wang | Gen Luo |
| Hai Zhou | Huan Zheng | Kartikay Bhutani | Nicholas Jiang | Ping Zhang | Prateek Srivastava |
| Sanskar Modi | Shaoyun CHen | Shlomi Tubul | Shuai Lu | Xianming Lei | Zhaohui Xu |
| Zhengtao Shuai |