Skip to content

Apache Celeborn™ 0.6.0 Release Notes

Highlight

  • Support Flink hybrid shuffle integration with Apache Celeborn
  • Support Spark 4.0.0
  • Support Flink 2.0
  • Support HARD_SPLIT in PushMergedData
  • Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
  • Refine the celeborn RESTful APIs and introduce Celeborn CLI for automation
  • Supporting worker Tags
  • CongestionController supports control traffic by user/worker traffic speed
  • Support read shuffle from S3
  • Support CppClient in Celeborn
  • HELM Chart Optimization

Improvement

  • [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
  • [CELEBORN-2046] Specify extractionDir of AsyncProfilerLoader with celeborn.worker.jvmProfiler.localDir
  • [CELEBORN-2027] Allow CelebornShuffleReader to decompress data on demand
  • [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload
  • [CELEBORN-2018] Support min number of workers selected for shuffle
  • [CELEBORN-2006] LifecycleManager should avoid parsing shufflePartitionType every time
  • [CELEBORN-2007] Reduce PartitionLocation memory usage
  • [CELEBORN-2008] SlotsAllocator should select disks randomly in RoundRobin mode
  • [CELEBORN-1896] delete data from failed to fetch shuffles
  • [CELEBORN-2004] Filter empty partition before createIntputStream
  • [CELEBORN-2002][MASTER] Audit shuffle lifecycle in separate log file
  • [CELEBORN-1993] CelebornConf introduces celeborn..io.threads to specify number of threads used in the client thread pool
  • [CELEBORN-1995] Optimize memory usage for push failed batches
  • [CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
  • [CELEBORN-1965] Rely on all default hadoop providers for S3 auth
  • [CELEBORN-1982] Slot Selection Perf Improvements
  • [CELEBORN-1961] Convert Resource.proto from Protocol Buffers version 2 to version 3
  • [CELEBORN-1931] use gather API for local flusher to optimize write io pattern
  • [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
  • [CELEBORN-1925] Support Flink 2.0
  • [CELEBORN-1844][CIP-8] introduce tier writer proxy and simplify partition data writer
  • [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted
  • [CELEBORN-1928][CIP-12] Support HARD_SPLIT in PushMergedData should support handle older worker success response
  • [CELEBORN-1930][CIP-12] Support HARD_SPLIT in PushMergedData should handle congestion control NPE issue
  • [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle
  • [CELEBORN-1856] Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read
  • [CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried
  • [CELEBORN-1909] Support pre-run static code blocks of TransportMessages to improve performance of protobuf serialization
  • [CELEBORN-1911] Move multipart-uploader to multipart-uploader/multipart-uploader-s3 for extensibility
  • [CELEBORN-1910] Remove redundant synchronized of isTerminated in ThreadUtils#sameThreadExecutorService
  • [CELEBORN-1898] SparkOutOfMemoryError compatible with Spark 4.0 and 4.1
  • [CELEBORN-1897] Avoid calling toString for too long messages
  • [CELEBORN-1882] Support configuring the SSL handshake timeout for SSLHandler
  • [CELEBORN-1883] Replace HashSet with ConcurrentHashMap.newKeySet for ShuffleFileGroups
  • [CELEBORN-1858] Support DfsPartitionReader read partition by chunkOffsets when enable optimize skew partition read
  • [CELEBORN-1879] Ignore invalid chunk range generated by splitSkewedPartitionLocations
  • [CELEBORN-1857] Support LocalPartitionReader read partition by chunkOffsets when enable optimize skew partition read
  • [CELEBORN-1876] Log remote address on RPC exception for TransportRequestHandler
  • [CELEBORN-1865] Update master endpointRef when master leader is abnormal
  • [CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
  • [CELEBORN-1861] Support celeborn.worker.storage.baseDir.diskType option to specify disk type of base directory for worker
  • [CELEBORN-1757] Add retry when sending RPC to LifecycleManager
  • [CELEBORN-1859] DfsPartitionReader and LocalPartitionReader should reuse pbStreamHandlers get from BatchOpenStream request
  • [CELEBORN-1843] Optimize roundrobin for more balanced disk slot allocation
  • [CELEBORN-1854] Change receive revive request log level to debug
  • [CELEBORN-1841] Support custom implementation of EventExecutorChooser to avoid deadlock when calling await in EventLoop thread
  • [CELEBORN-1847][CIP-8] Introduce local and DFS tier writer
  • [CELEBORN-1850] Setup worker endpoint after initalizing controller
  • [CELEBORN-1838] Interrupt spark task should not report fetch failure
  • [CELEBORN-1835][CIP-8] Add tier writer base and memory tier writer
  • [CELEBORN-1829] Replace waitThreadPoll's thread pool with ScheduledExecutorService in Controller
  • [CELEBORN-1720] Prevent stage re-run if another task attempt is running or successful
  • [CELEBORN-1482][CIP-8] Add partition meta handler
  • [CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted
  • [CELEBORN-1737] Support build tez client package
  • [CELEBORN-1802] Fail the celeborn master/worker start if CELEBORN_CONF_DIR is not directory
  • [CELEBORN-1701][FOLLOWUP] Support stage rerun for shuffle data lost
  • [CELEBORN-1413] Support Spark 4.0
  • [CELEBORN-1753] Optimize the code for exists and find method
  • [CELEBORN-1777] Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
  • [CELEBORN-1731] Support merged kv input for Tez
  • [CELEBORN-1732] Support unordered kv input for Tez
  • [CELEBORN-1733] Support ordered grouped kv input for Tez
  • [CELEBORN-1721][CIP-12] Support HARD_SPLIT in PushMergedData
  • [CELEBORN-1758] Remove the empty user resource consumption from worker heartbeat
  • [CELEBORN-1729] Support ordered KV output for Tez
  • [CELEBORN-1730] Support unordered KV output for Tez
  • [CELEBORN-1612] Add a basic reader writer class to Tez
  • [CELEBORN-1725] Optimize performance of handling MapperEnd RPC in LifecycleManager
  • [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation
  • [CELEBORN-1621][CIP-11] Predefined worker tags expr via dynamic configs
  • [CELEBORN-1545] Add Tez plugin skeleton and dag app master
  • [CELEBORN-1530] support MPU for S3
  • [CELEBORN-1618][CIP-11] Supporting tags via DB Config Service
  • [CELEBORN-1726] Update WorkerInfo when transition worker state
  • [CELEBORN-1715] RemoteShuffleMaster should check celeborn.client.push.replicate.enabled in constructor
  • [CELEBORN-1714] Optimize handleApplicationLost
  • [CELEBORN-1660] Cache available workers and only count the available workers device free capacity
  • [CELEBORN-1619][CIP-11] Integrate tags manager with config service
  • [CELEBORN-1071] Support stage rerun for shuffle data lost
  • [CELEBORN-1697] Improve ThreadStackTrace for thread dump
  • [CELEBORN-1671] CelebornShuffleReader will try replica if create client failed
  • [CELEBORN-1682] Add java tools.jar into classpath for JVM quake
  • [CELEBORN-1660] Using map for workers to find worker fast
  • [CELEBORN-1673] Support retry create client
  • [CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle
  • [CELEBORN-1642][CIP-11] Support multiple worker tags
  • [CELEBORN-1636] Client supports dynamic update of Worker resources on the server
  • [CELEBORN-1651] Support ratio threshold of unhealthy disks for excluding worker
  • [CELEBORN-1601] Support revise lost shuffles
  • [CELEBORN-1487][PHASE2] CongestionController support dynamic config
  • [CELEBORN-1648] Refine AppUniqueId with UUID suffix
  • [CELEBORN-1490][CIP-6] Introduce tier consumer for hybrid shuffle
  • [CELEBORN-1645] Introduce ShuffleFallbackPolicy to support custom implementation of shuffle fallback policy for CelebornShuffleFallbackPolicyRunner
  • [CELEBORN-1620][CIP-11] Support passing worker tags via RequestSlots message
  • [CELEBORN-1487][PHASE1] CongestionController support control traffic by user/worker traffic speed
  • [CELEBORN-1637] Enhance config to bypass memory check for partition file sorter
  • [CELEBORN-1638] Improve the slots allocator performance
  • [CELEBORN-1617][CIP-11] Support workers tags in FS Config Service
  • [CELEBORN-1615] Start the http server after all handlers added
  • [CELEBORN-1574] Speed up unregister shuffle by batch processing
  • [CELEBORN-1625] Add parameter skipCompress for pushOrMergeData
  • [CELEBORN-1597][CIP-11] Implement TagsManager
  • [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
  • [CELEBORN-1602] do hard split for push merged data RPC with disk full
  • [CELEBORN-1594] Refine dynamicConfig template and prevent NPE
  • [CELEBORN-1513] Support wildcard bind in dual stack environments
  • [CELEBORN-1587] Change to debug logging on client side for SortBasedPusher trigger push
  • [CELEBORN-1496] Differentiate map results with only different stageAttemptId
  • [CELEBORN-1583] MasterClient#sendMessageInner should throw Throwable for celeborn.masterClient.maxRetries is 0
  • [CELEBORN-1578] Make Worker#timer have thread name and daemon
  • [CELEBORN-1580] ReadBufferDispacther should notify exception to listener
  • [CELEBORN-1575] TimeSlidingHub should remove expire node when reading
  • [CELEBORN-1560] Remove usages of deprecated Files.createTempDir of Guava
  • [CELEBORN-1567] Support throw FetchFailedException when Data corruption detected
  • [CELEBORN-1568] Support worker retries in MiniCluster
  • [CELEBORN-1563] Log networkLocation in WorkerInfo
  • [CELEBORN-1531] Refactor self checks in master
  • [CELEBORN-1550] Add support of providing custom dynamic store backend implementation
  • [CELEBORN-1529] Read shuffle data from S3
  • [CELEBORN-1518] Add support for Apache Spark barrier stages
  • [CELEBORN-1542] Master supports to check the worker host pattern on worker registration
  • [CELEBORN-1535] Support to disable master workerUnavailableInfo expiration
  • [CELEBORN-1511] Add support for custom master endpoint resolver
  • [CELEBORN-1541] Enhance the readable address for internal port
  • [CELEBORN-1533] Log location when CelebornInputStream#fillBuffer fails
  • [CELEBORN-1524] Support IPv6 hostnames for Apache Ratis
  • [CELEBORN-1519] Do not update estimated partition size if it is unchanged
  • [CELEBORN-1521] Introduce celeborn-spi module for authentication extensions
  • [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
  • [CELEBORN-1446] Enable chunk prefetch when initialize CelebornInputStream
  • [CELEBORN-1509] Reply response without holding a lock
  • [CELEBORN-1543] Support Flink 1.20
  • [CELEBORN-1504] Support for Apache Flink 1.16
  • [CELEBORN-1500] Filter out empty InputStreams
  • [CELEBORN-1495] CelebornColumnDictionary supports dictionary of float and double column type
  • [CELEBORN-1494] Support IPv6 addresses in PbSerDeUtils.fromPackedPartitionLocations
  • [CELEBORN-1483] Add storage policy
  • [CELEBORN-1479] Report register shuffle failed reason in exception
  • [CELEBORN-1489] Update Flink support with authentication support
  • [CELEBORN-1485] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization
  • [CELEBORN-1472] Reduce CongestionController#userBufferStatuses call times
  • [CELEBORN-1471] CelebornScalaObjectMapper supports configuring FAIL_ON_UNKNOWN_PROPERTIES to false
  • [CELEBORN-1867][FLINK] Fix flink client memory leak of TransportResponseHandler#outstandingRpcs for handling addCredit and notifyRequiredSegment response
  • [CELEBORN-1866][FLINK] Fix CelebornChannelBufferReader request more buffers than needed
  • [CELEBORN-1490][CIP-6] Support process large buffer in flink hybrid shuffle
  • [CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
  • [CELEBORN-1490][CIP-6] Impl worker read process in Flink Hybrid Shuffle
  • [CELEBORN-1490][CIP-6] Introduce tier producer in celeborn flink client
  • [CELEBORN-1490][CIP-6] Introduce tier factory and master agent in flink hybrid shuffle
  • [CELEBORN-1490][CIP-6] Enrich register shuffle method
  • [CELEBORN-1490][CIP-6] Extends FileMeta to support hybrid shuffle
  • [CELEBORN-1490][CIP-6] Extends message to support hybrid shuffle

RESTful API and CLI

  • [CELEBORN-1056][FOLLOWUP] Support upsert and delete of dynamic configuration management
  • [CELEBORN-2020] Support http authentication for Celeborn CLI
  • [CELEBORN-1875] Support to get workers topology information with RESTful api
  • [CELEBORN-1797] Support to adjust the logger level with RESTful API during runtime
  • [CELEBORN-1750] Return struct worker resource consumption information with RESTful api
  • [CELEBORN-1707] Audit all RESTful api calls and use separate restAuditFile
  • [CELEBORN-1632] Support to apply ratis local raft_meta_conf command with RESTful api
  • [CELEBORN-1599] Container Info REST API
  • [CELEBORN-1630] Support to apply ratis peer operation with RESTful api
  • [CELEBORN-1631] Support to apply ratis snapshot operation with RESTful api
  • [CELEBORN-1633] Return more raft group information
  • [CELEBORN-1629] Support to apply ratis election operation with RESTful api
  • [CELEBORN-1609] Support SSL for celeborn RESTful service
  • [CELEBORN-1608] Reduce redundant response for RESTful api
  • [CELEBORN-1607] Enable useEnumCaseInsensitive for openapi-generator
  • [CELEBORN-1589] Ensure master is leader for some POST request APIs
  • [CELEBORN-1572] Celeborn CLI initial REST API support
  • [CELEBORN-1553] Using the request base url as swagger server
  • [CELEBORN-1537] Support to remove workers unavailable info with RESTful api
  • [CELEBORN-1546] Support authorization on swagger UI
  • [CELEBORN-1477] Using openapi-generator apache-httpclient library instead of jersey2
  • [CELEBORN-1493] Check admin privileges for http mutative requests
  • [CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
  • [CELEBORN-1476] Enhance the RESTful response error msg
  • [CELEBORN-1475] Fix unknownExcludedWorkers filter for /exclude request
  • [CELEBORN-1318] Support celeborn http authentication

Monitoring

  • [CELEBORN-2024] Publish commit files fail count metrics
  • [CELEBORN-1892] Adding register with master fail count metric for worker
  • [CELEBORN-2005] Introduce numBytesIn, numBytesOut, numBytesInPerSecond, numBytesOutPerSecond metrics for RemoteShuffleServiceFactory
  • [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
  • [CELEBORN-1974] ApplicationId as metrics label should be behind a config flag
  • [CELEBORN-1968] Publish metric for unreleased partition location count when worker was gracefully shutdown
  • [CELEBORN-1977] Add help/type on prometheus exposed metrics
  • [CELEBORN-1831] Add ratis commitIndex metrics
  • [CELEBORN-1817] add committed file size metrics
  • [CELEBORN-1791] All NettyMemoryMetrics should register to source
  • [CELEBORN-1804] Shuffle environment metrics of RemoteShuffleEnvironment should use Shuffle.Remote metric group
  • [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
  • [CELEBORN-1766] Add detail metrics about fetch chunk
  • [CELEBORN-1756] Only gauge hdfs metrics if HDFS storage enabled to reduce metrics
  • [CELEBORN-1634] Support queue time/processing time metrics for rpc framework
  • [CELEBORN-1706] Use bytes(IEC) unit instead of bytes(SI) for size related metrics in prometheus dashboard
  • [CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
  • [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
  • [CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
  • [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
  • [CELEBORN-1656] Remove duplicate UserProduceSpeed metrics
  • [CELEBORN-1627] Introduce instance variable for celeborn dashboard to filter metrics
  • [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
  • [CELEBORN-1501] Introduce application dimension resource consumption metrics of Worker
  • [CELEBORN-1586] Add available workers Metrics
  • [CELEBORN-1581] Fix incorrect metrics of DeviceCelebornFreeBytes and DeviceCelebornTotalBytes
  • [CELEBORN-1491] introduce flusher working queue size metric

CPP Client

  • [CELEBORN-1978][CIP-14] Add code style checking for cppClient
  • [CELEBORN-1958][CIP-14] Add testsuite to test writing with javaClient and reading with cppClient
  • [CELEBORN-1874][CIP-14] Add CICD procedure in github action for cppClient
  • [CELEBORN-1932][CIP-14] Adapt java's serialization to support cpp serialization for GetReducerFileGroup/Response
  • [CELEBORN-1915][CIP-14] Add reader's ShuffleClient to cppClient
  • [CELEBORN-1906][CIP-14] Add CelebornInputStream to cppClient
  • [CELEBORN-1881][CIP-14] Add WorkerPartitionReader to cppClient
  • [CELEBORN-1871][CIP-14] Add NettyRpcEndpointRef to cppClient
  • [CELEBORN-1863][CIP-14] Add TransportClient to cppClient
  • [CELEBORN-1845][CIP-14] Add MessageDispatcher to cppClient
  • [CELEBORN-1836][CIP-14] Add Message to cppClient
  • [CELEBORN-1827][CIP-14] Add messageDecoder to cppClient
  • [CELEBORN-1821][CIP-14] Add controlMessages to cppClient
  • [CELEBORN-1819][CIP-14] Refactor cppClient with nested namespace
  • [CELEBORN-1814][CIP-14] Add transportMessage to cppClient
  • [CELEBORN-1809][CIP-14] Add partitionLocation to cppClient
  • [CELEBORN-1799][CIP-14] Add celebornConf to cppClient
  • [CELEBORN-1785][CIP-14] Add baseConf to cppClient
  • [CELEBORN-1772][CIP-14] Add memory module to cppClient
  • [CELEBORN-1761][CIP-14] Add cppProto to cppClient
  • [CELEBORN-1754][CIP-14] Add exceptions and checking utils to cppClient
  • [CELEBORN-1751][CIP-14] Add celebornException utils to cppClient
  • [CELEBORN-1740][CIP-14] Add stackTrace utils to cppClient
  • [CELEBORN-1741][CIP-14] Add processBase utils to cppClient
  • [CELEBORN-1724][CIP-14] Add environment setup tools for CppClient development

HELM Chart Optimization

  • [CELEBORN-2017][HELM] Add namespace to the metadata
  • [CELEBORN-1528][HELM] Use volume claim template to support various storage backend
  • [CELEBORN-1996][HELM] Rename volumes.{master,worker} to {master,worker}.volumes and {master.worker}.volumeMounts
  • [CELEBORN-1989][HELM] Split securityContext into master.podSecurityContext and worker.podSecurityContext
  • [CELEBORN-1988][HELM] Split hostNetwork into master.hostNetwork and worker.hostNetwork
  • [CELEBORN-1987][HELM] Split dnsPolicy into master.dnsPolicy and worker.dnsPolicy
  • [CELEBORN-1981][HELM] Rename masterReplicas and workerReplicas to master.replicas and worker.replicas
  • [CELEBORN-1985][HELM] Add new values master.envFrom and worker.envFrom
  • [CELEBORN-1986][HELM] Rename priorityClass.{master,worker} to {master,worker}.priorityClass
  • [CELEBORN-1980][HELM] Split environments into master.env and worker.env
  • [CELEBORN-1972][HELM] Rename affinity.{master,worker} to {master,worker}.affinity
  • [CELEBORN-1951][HELM] Rename resources.{master,worker} to {master,worker}.resoruces
  • [CELEBORN-1962][HELM] Split tolerations into master.tolerations and worker.tolerations
  • [CELEBORN-1955][HELM] Split nodeSelector into master.nodeSelector and worker.nodeSelector
  • [CELEBORN-1953][HELM] Split podAnnotations into master.annotations and worker.annotations
  • [CELEBORN-1954][HELM] Add a new value image.registry
  • [CELEBORN-1952][HELM] Define template helpers for master/worker respectively
  • [CELEBORN-1532][HELM] Read log4j2 and metrics configurations from file
  • [CELEBORN-1830] Chart statefulset resources key duplicate
  • [CELEBORN-1788] Add role and roleBinding helm charts
  • [CELEBORN-1786] add serviceAccount helm chart
  • [CELEBORN-1780] Add support for NodePort Service per Master replica
  • [CELEBORN-1552] automatically support prometheus to scrape metrics for helm chart

Stability and Bug Fix

  • [CELEBORN-1721][FOLLOWUP] Return softsplit if there is no hardsplit for pushMergeData
  • [CELEBORN-2043] Fix IndexOutOfBoundsException exception in getEvictedFileWriter
  • [CELEBORN-2040] Avoid throw FetchFailedException when GetReducerFileGroupResponse failed via broadcast
  • [CELEBORN-2042] Fix FetchFailure handling when TaskSetManager is not found
  • [CELEBORN-2033] updateProduceBytes should be called even if updateProduceBytes throws exception
  • [CELEBORN-2025] RpcFailure Scala 2.13 serialization is incompatible
  • [CELEBORN-2022] Spark4 Client should package commons-io
  • [CELEBORN-2023] Spark4 Client incompatible with isLocalMaster method
  • [CELEBORN-2009] Commit files request failure should exclude worker in LifecycleManager
  • [CELEBORN-2015] Retry IOException failures for RPC requests
  • [CELEBORN-1902] Read client throws PartitionConnectionException
  • [CELEBORN-2000] Ignore the getReducerFileGroup timeout before shuffle stage end
  • [CELEBORN-1960] Fix PauseSpentTime only append the interval check time
  • [CELEBORN-1998] RemoteShuffleEnvironment should not register InputChannelMetrics repeatedly
  • [CELEBORN-1999] OpenStreamTime should use requestId to record cost time
  • [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported
  • [CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT occurs during handlePushMergeData
  • [CELEBORN-1912] Client should send heartbeat to worker for processing heartbeat to avoid reading idleness of worker which enables heartbeat
  • [CELEBORN-1992] Ensure hadoop FS are not closed by hadoop ShutdownHookManager
  • [CELEBORN-1919] Hardsplit batch tracking should be disabled when pushing only a single replica
  • [CELEBORN-1979] Change partition manager should respect the celeborn.storage.availableTypes
  • [CELEBORN-1983] Fix fetch fail not throw due to reach spark maxTaskFailures
  • [CELEBORN-1976] CommitHandler should use celeborn.client.rpc.commitFiles.askTimeout for timeout of doParallelCommitFiles
  • [CELEBORN-1966] Added fix to get userinfo in celeborn
  • [CELEBORN-1969] Remove celeborn.client.shuffle.mapPartition.split.enabled to enable shuffle partition split at default for MapPartition
  • [CELEBORN-1970] Use StatusCode.fromValue instead of Utils.toStatusCode
  • [CELEBORN-1646][FOLLOWUP] DeviceMonitor should notifyObserversOnError with CRITICAL_ERROR disk status for input/ouput error
  • [CELEBORN-1929] Avoid unnecessary buffer loss to get better buffer reusability
  • [CELEBORN-1918] Add batchOpenStream time to fetch wait time
  • [CELEBORN-1947] Reduce log for CelebornShuffleReader sleeping before inputStream ready
  • [CELEBORN-1923] Correct Celeborn available slots calculation logic
  • [CELEBORN-1914] incWriteTime when ShuffleWriter invoke pushGiantRecord
  • [CELEBORN-1822] Respond to RegisterShuffle with max epoch PartitionLocation to avoid revive
  • [CELEBORN-1899] Fix configuration bug in shuffle s3
  • [CELEBORN-1885] Fix nullptr exceptions in FetchChunk after worker restart
  • [CELEBORN-1889] Fix scala 2.13 complie error
  • [CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd
  • [CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory
  • [CELEBORN-1832] MapPartitionData should create fixed thread pool with registration of ThreadPoolSource
  • [CELEBORN-1820] Failing to write and flush StreamChunk data should be counted as FETCH_CHUNK_FAIL
  • [CELEBORN-1818] Fix incorrect timeout exception when waiting on no pending writes
  • [CELEBORN-1763] Fix DataPusher be blocked for a long time
  • [CELEBORN-1783] Fix Pending task in commitThreadPool wont be canceled
  • [CELEBORN-1782] Worker in congestion control should be in blacklist to avoid impact new shuffle
  • [CELEBORN-1510] Partial task unable to switch to the replica
  • [CELEBORN-1778] Fix commitInfo NPE and add assert in LifecycleManagerCommitFilesSuite
  • [CELEBORN-1670] Avoid swallowing InterruptedException in ShuffleClientImpl Revert "- [CELEBORN-1376] Push data failed should always release request body"
  • [CELEBORN-1770] FlushNotifier should setException for all Throwables in Flusher
  • [CELEBORN-1769] Fix packed partition location cause GetReducerFileGroupResponse lose location
  • [CELEBORN-1767] Fix occasional errors in UT when creating workers
  • [CELEBORN-1765] Fix NPE when removeFileInfo in StorageManager
  • [CELEBORN-1760] OOM causes disk buffer unable to be released
  • [CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources
  • [CELEBORN-1759] Fix reserve slots might lost partition location between 0.4 client and 0.5 server
  • [CELEBORN-1749] Fix incorrect application diskBytesWritten metrics
  • [CELEBORN-1713] RpcTimeoutException should include RPC address in message
  • [CELEBORN-1728] Fix NPE when failing to connect to celeborn worker
  • [CELEBORN-1727] Correct the calculation of worker diskInfo actualUsableSpace
  • [CELEBORN-1718] Fix memory storage file won't hard split when memory file is full and worker has no disks
  • [CELEBORN-1705] Fix disk buffer size is negative issue
  • [CELEBORN-1686] Avoid return the same pushTaskQueue
  • [CELEBORN-1696] StorageManager#cleanFile should remove file info
  • [CELEBORN-1691] Fix the issue that upstream tasks don't rerun and the current task still retry when failed to decompress in flink
  • [CELEBORN-1693] Fix storageFetcherPool concurrent problem
  • [CELEBORN-1674] Fix reader thread name of MapPartitionData
  • [CELEBORN-1668] Fix NPE when handle closed file writers
  • [CELEBORN-1669] Fix NullPointerException for PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle key
  • [CELEBORN-1667] Fix NPE & LEAK occurring prior to worker registration
  • [CELEBORN-1664] Fix secret fetch failures after LEADER master failover
  • [CELEBORN-1665] CommitHandler should process CommitFilesResponse with COMMIT_FILE_EXCEPTION status
  • [CELEBORN-1663] Only register appShuffleDeterminate if stage using celeborn for shuffle
  • [CELEBORN-1661] Make sure that the sortedFilesDb is initialized successfully when worker enable graceful shutdown
  • [CELEBORN-1662] Handle PUSH_DATA_FAIL_PARTITION_NOT_FOUND in getPushDataFailCause
  • [CELEBORN-1655] Fix read buffer dispatcher thread terminate unexpectedly
  • [CELEBORN-1646] Catch exception of Files#getFileStore for DeviceMonitor and StorageManager for input/ouput error
  • [CELEBORN-1652] Throw TransportableError for failure of sending PbReadAddCredit to avoid flink task get stuck
  • [CELEBORN-1643] DataPusher handle InterruptedException
  • [CELEBORN-1579] Fix the memory leak of result partition
  • [CELEBORN-1564] Fix actualUsableSpace of offerSlotsLoadAware condition on diskInfo
  • [CELEBORN-1573] Change to debug logging on client side for reserve slots
  • [CELEBORN-1557] Fix totalSpace of DiskInfo for Master in HA mode
  • [CELEBORN-1549] Fix networkLocation persistence into Ratis
  • [CELEBORN-1558] Fix the incorrect decrement of pendingWrites in handlePushMergeData
  • [CELEBORN-1547] Worker#listTopDiskUseApps should return celeborn.metrics.app.topDiskUsage.count applications
  • [CELEBORN-1544] ShuffleWriter needs to call close finally to avoid memory leaks
  • [CELEBORN-1473] TransportClientFactory should register netty memory metric with source for shared pooled ByteBuf allocator
  • [CELEBORN-1522] Fix applicationId extraction from shuffle key
  • [CELEBORN-1516] DynamicConfigServiceFactory should support singleton
  • [CELEBORN-1515] SparkShuffleManager should set lifecycleManager to null after stopping lifecycleManager in Spark 2
  • [CELEBORN-1439] Fix revive logic bug which will casue data correctness issue and job failiure
  • [CELEBORN-1507] Prevent invalid Filegroups from being used
  • [CELEBORN-1478] Fix wrong use partitionId as shuffleId when readPartition

Build

  • [CELEBORN-2012] Add license for http5
  • [CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
  • [CELEBORN-1746] Reduce the size of aws dependencies
  • [CELEBORN-1677][BUILD] Update SCM information for SBT build configuration
  • [CELEBORN-1649] Bumping up maven to 3.9.9
  • [CELEBORN-1658] Add Git Commit Info and Build JDK Spec to sbt Manifest
  • [CELEBORN-1659] Fix sbt make-distribution for cli
  • [CELEBORN-1616] Shade com.google.thirdparty to prevent dependency conflicts
  • [CELEBORN-1606] Generate dependencies-client-flink-1.16
  • [CELEBORN-1600] Enable check server dependencies
  • [CELEBORN-1556] Update Github actions to v4
  • [CELEBORN-1559] Fix make distribution script failed to recognize specific profile
  • [CELEBORN-1526] Fix MR plugin can not run on Hadoop 3.1.0
  • [CELEBORN-1527] Error prone plugin should exclude target/generated-sources/java of module
  • [CELEBORN-1505] Algin the celeborn server jackson dependency versions

Documentation

  • [CELEBORN-1870] Fix typos in in 'Developer' documents
  • [CELEBORN-1860] Remove unused celeborn..io.enableVerboseMetrics option
  • [CELEBORN-1823] Remove unused remote-shuffle.job.min.memory-per-partition and remote-shuffle.job.min.memory-per-gate
  • [CELEBORN-1811] Update default value for celeborn.master.slot.assign.extraSlots
  • [CELEBORN-1789][DOC] Document on Java Columnar Shuffle
  • [CELEBORN-1774] Update default value of celeborn..io.mode to whether epoll mode is available
  • [CELEBORN-1622][CIP-11] Adding documentation for Worker Tags feature
  • [CELEBORN-1755] Update doc to include S3 as one of storage layers
  • [CELEBORN-1752] Migration guide for unexpected shuffle RESTful api change since 0.5.0
  • [CELEBORN-1745] Remove application top disk usage code
  • [CELEBORN-1719] Introduce celeborn.client.spark.stageRerun.enabled with alternative celeborn.client.spark.fetch.throwsFetchFailure to enable spark stage rerun
  • [CELEBORN-1687] Highlight flink session cluster issue in doc
  • [CELEBORN-1684] Fix ambiguous client jar expression of document
  • [CELEBORN-1678] Add Celeborn CLI User guide in README
  • [CELEBORN-1635] Introduce Blaze support document [CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in docs and tests
  • [CELEBORN-1566] Update docs about using HDFS
  • [CELEBORN-1551] Fix wrong link in quota_management.md
  • [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
  • [CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md
  • [CELEBORN-1486] Introduce ClickHouse Backend in Gluten Support document
  • [CELEBORN-1466] Add local command in celeborn_ratis_shell.md

Dependencies

  • [CELEBORN-2030] Bump Spark from 3.5.5 to 3.5.6
  • [CELEBORN-2013] Upgrade scala binary version of spark-3.3, spark-3.4, spark-3.5 profile to 2.13.8
  • [CELEBORN-1895] Bump log4j2 version to 2.24.3
  • [CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
  • [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
  • [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
  • [CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1
  • [CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
  • [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
  • [CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
  • [CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
  • [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
  • [CELEBORN-1748] Deprecate identity provider configs tied with quota
  • [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
  • [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
  • [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
  • [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
  • [CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
  • [CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7
  • [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
  • [CELEBORN-1613] Bump Spark from 3.5.2 to 3.5.3
  • [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
  • [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
  • [CELEBORN-1562] Bump Spark from 3.5.1 to 3.5.2
  • [CELEBORN-1512] Bump Flink from 1.19.0 to 1.19.1
  • [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0

Credits

Thanks to the following contributors who helped to review and commit to Apache Celeborn 0.6.0 version:

Contributors
Aidar Bariev Amandeep Singh Aravind Patnam Arsen Gumin A Vishnusankar Biao Geng
Binjie Yang Björn Boschman Bowen Liang Cheng Pan Chongchen Chen Erik Fang
Fei Wang Fu Chen Guangwei Hong Haotian Cao He Zhao Jiaming Xie
Jianfu Li Jiashu Xiong Jinqian Fan Kerwin Zhang Keyong Zhou Kun Wan
Leo Li Lianne Li Madhukar Minchu Yang Mingxiao Feng Mridul Muralidharan
Nan Zhu Nicholas Jiang Nicolas Fraison Pengqi Li Sanskar Modi Saurabh Dubey
Shaoyun Chen Shengjie Wang Shlomi Uubul Veli Yang Weijie Guo Xianming Lei
Xinyu Wang Xu Huang Yajun Gao Yanze Jiang Yi Chen Yi Zhu
Yuting Wang Yuxin Tan Zhao Zhao Zhaohui Xu Zhengqi Zhang Zhentao Shuai
Ziyi Wu