Network
Key | Default | isDynamic | Description | Since | Deprecated |
---|---|---|---|---|---|
celeborn.<module>.fetch.timeoutCheck.interval | 5s | false | Interval for checking fetch data timeout. It only support setting data since it works for shuffle client fetch data. |
0.3.0 | |
celeborn.<module>.fetch.timeoutCheck.threads | 4 | false | Threads num for checking fetch data timeout. It only support setting data since it works for shuffle client fetch data. |
0.3.0 | |
celeborn.<module>.heartbeat.interval | 60s | false | The heartbeat interval between worker and client. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for replicate client of worker replicating data to peer worker. If you are using the "celeborn.client.heartbeat.interval", please use the new configs for each module according to your needs or replace it with "celeborn.rpc.heartbeat.interval", "celeborn.data.heartbeat.interval" and "celeborn.replicate.heartbeat.interval". |
0.3.0 | celeborn.client.heartbeat.interval |
celeborn.<module>.io.backLog | 0 | false | Requested maximum length of the queue of incoming connections. Default 0 for no backlog. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
||
celeborn.<module>.io.clientThreads | 0 | false | Number of threads used in the client thread pool. Default to 0, which is 2x#cores. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for replicate client of worker replicating data to peer worker. |
||
celeborn.<module>.io.connectTimeout | <value of celeborn.network.connect.timeout> | false | Socket connect timeout. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for the replicate client of worker replicating data to peer worker. |
||
celeborn.<module>.io.connectionTimeout | <value of celeborn.network.timeout> | false | Connection active timeout. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server or client of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
||
celeborn.<module>.io.enableVerboseMetrics | false | false | Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked. | ||
celeborn.<module>.io.lazyFD | true | false | Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting fetch , it works for worker fetch server. |
||
celeborn.<module>.io.maxRetries | 3 | false | Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for replicate client of worker replicating data to peer worker. If setting push , it works for Flink shuffle client push data. |
||
celeborn.<module>.io.mode | NIO | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. | ||
celeborn.<module>.io.numConnectionsPerPeer | 1 | false | Number of concurrent connections between two nodes. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for replicate client of worker replicating data to peer worker. |
||
celeborn.<module>.io.preferDirectBufs | true | false | If true, we will prefer allocating off-heap byte buffers within Netty. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server or client of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
||
celeborn.<module>.io.receiveBuffer | 0b | false | Receive buffer size (SO_RCVBUF). Note: the optimal size for receive buffer and send buffer should be latency * network_bandwidth. Assuming latency = 1ms, network_bandwidth = 10Gbps buffer size should be ~ 1.25MB. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server or client of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
0.2.0 | |
celeborn.<module>.io.retryWait | 5s | false | Time that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries > 0. If setting data , it works for shuffle client push and fetch data. If setting replicate , it works for replicate client of worker replicating data to peer worker. If setting push , it works for Flink shuffle client push data. |
0.2.0 | |
celeborn.<module>.io.saslTimeout | 30s | false | Timeout for a single round trip of auth message exchange, in milliseconds. | 0.5.0 | |
celeborn.<module>.io.sendBuffer | 0b | false | Send buffer size (SO_SNDBUF). If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting data , it works for shuffle client push and fetch data. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server or client of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
0.2.0 | |
celeborn.<module>.io.serverThreads | 0 | false | Number of threads used in the server thread pool. Default to 0, which is 2x#cores. If setting rpc_app , works for shuffle client. If setting rpc_service , works for master or worker. If setting push , it works for worker receiving push data. If setting replicate , it works for replicate server of worker replicating data to peer worker. If setting fetch , it works for worker fetch server. |
||
celeborn.<module>.push.timeoutCheck.interval | 5s | false | Interval for checking push data timeout. If setting data , it works for shuffle client push data. If setting push , it works for Flink shuffle client push data. If setting replicate , it works for replicate client of worker replicating data to peer worker. |
0.3.0 | |
celeborn.<module>.push.timeoutCheck.threads | 4 | false | Threads num for checking push data timeout. If setting data , it works for shuffle client push data. If setting push , it works for Flink shuffle client push data. If setting replicate , it works for replicate client of worker replicating data to peer worker. |
0.3.0 | |
celeborn.<role>.rpc.dispatcher.threads | <value of celeborn.rpc.dispatcher.threads> | false | Threads number of message dispatcher event loop for roles | ||
celeborn.io.maxDefaultNettyThreads | 64 | false | Max default netty threads | 0.3.2 | |
celeborn.network.advertise.preferIpAddress | <value of celeborn.network.bind.preferIpAddress> | false | When true , prefer to use IP address, otherwise FQDN for advertise address. |
0.6.0 | |
celeborn.network.bind.preferIpAddress | true | false | When true , prefer to use IP address, otherwise FQDN. This configuration only takes effects when the bind hostname is not set explicitly, in such case, Celeborn will find the first non-loopback address to bind. |
0.3.0 | |
celeborn.network.bind.wildcardAddress | false | false | When true , the bind address will be set to a wildcard address, while the advertise address will remain as whatever is set by celeborn.network.advertise.preferIpAddress . The wildcard address is a special local IP address, and usually refers to 'any' and can only be used for bind operations. In the case of IPv4, this is 0.0.0.0 and in the case of IPv6 this is ::0. This is helpful in dual-stack environments, where the service must listen to both IPv4 and IPv6 clients. |
0.6.0 | |
celeborn.network.connect.timeout | 10s | false | Default socket connect timeout. | 0.2.0 | |
celeborn.network.memory.allocator.numArenas | <undefined> | false | Number of arenas for pooled memory allocator. Default value is Runtime.getRuntime.availableProcessors, min value is 2. | 0.3.0 | |
celeborn.network.memory.allocator.verbose.metric | false | false | Whether to enable verbose metric for pooled allocator. | 0.3.0 | |
celeborn.network.timeout | 240s | false | Default timeout for network operations. | 0.2.0 | |
celeborn.port.maxRetries | 1 | false | When port is occupied, we will retry for max retry times. | 0.2.0 | |
celeborn.rpc.askTimeout | 60s | false | Timeout for RPC ask operations. It's recommended to set at least 240s when HDFS is enabled in celeborn.storage.availableTypes |
0.2.0 | |
celeborn.rpc.connect.threads | 64 | false | 0.2.0 | ||
celeborn.rpc.dispatcher.threads | 0 | false | Threads number of message dispatcher event loop. Default to 0, which is availableCore. | 0.3.0 | celeborn.rpc.dispatcher.numThreads |
celeborn.rpc.dump.interval | 60s | false | min interval (ms) for RPC framework to dump performance summary | 0.6.0 | |
celeborn.rpc.inbox.capacity | 0 | false | Specifies size of the in memory bounded capacity. | 0.5.0 | |
celeborn.rpc.io.threads | <undefined> | false | Netty IO thread number of NettyRpcEnv to handle RPC request. The default threads number is the number of runtime available processors. | 0.2.0 | |
celeborn.rpc.lookupTimeout | 30s | false | Timeout for RPC lookup operations. | 0.2.0 | |
celeborn.rpc.slow.interval | <undefined> | false | min interval (ms) for RPC framework to log slow RPC | 0.6.0 | |
celeborn.rpc.slow.threshold | 1s | false | threshold for RPC framework to log slow RPC | 0.6.0 | |
celeborn.shuffle.io.maxChunksBeingTransferred | <undefined> | false | The max number of chunks allowed to be transferred at the same time on shuffle service. Note that new incoming connections will be closed when the max number is hit. The client will retry according to the shuffle retry configs (see celeborn.<module>.io.maxRetries and celeborn.<module>.io.retryWait ), if those limits are reached the task will fail with fetch failure. |
0.2.0 | |
celeborn.ssl.<module>.enabled | false | false | Enables SSL for securing wire traffic. | 0.5.0 | |
celeborn.ssl.<module>.enabledAlgorithms | <undefined> | false | A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 11, for example, can be found at this page Note: If not set, the default cipher suite for the JRE will be used |
0.5.0 | |
celeborn.ssl.<module>.keyStore | <undefined> | false | Path to the key store file. The path can be absolute or relative to the directory in which the process is started. |
0.5.0 | |
celeborn.ssl.<module>.keyStorePassword | <undefined> | false | Password to the key store. | 0.5.0 | |
celeborn.ssl.<module>.protocol | TLSv1.2 | false | TLS protocol to use. The protocol must be supported by JVM. The reference list of protocols can be found in the "Additional JSSE Standard Names" section of the Java security guide. For Java 11, for example, the list can be found here |
0.5.0 | |
celeborn.ssl.<module>.trustStore | <undefined> | false | Path to the trust store file. The path can be absolute or relative to the directory in which the process is started. |
0.5.0 | |
celeborn.ssl.<module>.trustStorePassword | <undefined> | false | Password for the trust store. | 0.5.0 | |
celeborn.ssl.<module>.trustStoreReloadIntervalMs | 10s | false | The interval at which the trust store should be reloaded (in milliseconds), when enabled. This setting is mostly only useful for server components, not applications. | 0.5.0 | |
celeborn.ssl.<module>.trustStoreReloadingEnabled | false | false | Whether the trust store should be reloaded periodically. This setting is mostly only useful for Celeborn services (masters, workers), and not applications. |
0.5.0 |