Deploy Celeborn on Kubernetes
Celeborn currently supports rapid deployment by using helm.
Before Deploy
- You should have a Running Kubernetes Cluster.
- You should understand simple Kubernetes deploy related, e.g. Kubernetes Resources.
- You have enough permissions to create resources.
- Installed Helm.
Deploy
1. Get Celeborn Binary Package
You can find released version of Celeborn on Downloading Page.
Of course, you can build binary package from master branch or your own branch by using ./build/make-distribution.sh
in
source code.
Notice: Celeborn supports automatic builds on linux aarch64 platform via
aarch64
profile.aarch64
profile requires glibc version 3.4.21. There is potential problematic frameC [libc.so.6+0x8412a]
for other glibc version like 2.x etc.
Anyway, you should unzip and into binary package.
2. Modify Celeborn Configurations
Notice: Celeborn Charts Template Files is in the experimental instability stage, the subsequent optimization will be adjusted.
The configuration in ./charts/celeborn/values.yaml
you should focus on modifying is:
- image repository - Get images from which repository
- image tag - Which version of image to use
- masterReplicas - Number of celeborn master replicas
- workerReplicas - Number of celeborn worker replicas
- volumes - How and where to mount volumes (For more information, Volumes)
[Optional] Build Celeborn Docker Image
Maybe you want to make your own celeborn docker image, you can use docker build . -f docker/Dockerfile
in Celeborn
Binary.
3. Helm Install Celeborn Charts
More details in Helm Install
cd ./charts/celeborn
helm install celeborn -n <namespace> .
4. Check Celeborn
After the above operation, you should be able to find the corresponding Celeborn Master/Worker
by kubectl get pods -n <namespace>
Etc.
NAME READY STATUS RESTARTS AGE
celeborn-master-0 1/1 Running 0 1m
...
celeborn-worker-0 1/1 Running 0 1m
...
Given that Celeborn Master/Worker Pod takes time to start, you can see the following phenomenon:
** server can't find celeborn-master-0.celeborn-master-svc.default.svc.cluster.local: NXDOMAIN
waiting for master
Server: 172.17.0.10
Address: 172.17.0.10#53
...
Name: celeborn-master-0.celeborn-master-svc.default.svc.cluster.local
Address: 10.225.139.80
Server: 172.17.0.10
Address: 172.17.0.10#53
starting org.apache.celeborn.service.deploy.master.Master, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.master.Master-1-celeborn-master-0.out
...
23/03/23 14:10:56,081 INFO [main] RaftServer: 0: start RPC server
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] REGISTERED
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] BIND: 0.0.0.0/0.0.0.0:9872
23/03/23 14:10:56,134 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1, L:/0:0:0:0:0:0:0:0:9872] ACTIVE
23/03/23 14:10:56,135 INFO [JvmPauseMonitor0] JvmPauseMonitor: JvmPauseMonitor-0: Started
23/03/23 14:10:56,208 INFO [main] Master: Metrics system enabled.
23/03/23 14:10:56,216 INFO [main] HttpServer: master: HttpServer started on port 9098.
23/03/23 14:10:56,216 INFO [main] Master: Master started.
5. Access Celeborn Service
The Celeborn Master/Worker nodes deployed via official Helm charts run as StatefulSet, it can be accessed through Pod IP or Stable Network ID (DNS name), in above case, the Master/Worker nodes can be accessed through:
celeborn-master-0.celeborn-master-svc.default.svc.cluster.local`
...
celeborn-worker-0.celeborn-worker-svc.default.svc.cluster.local`
...
After a restart, the StatefulSet Pod IP changes but the DNS name remains, this is important for rolling upgrade.
When bind address is not set explicitly, Celeborn worker is going to find the first non-loopback address to bind. By default, it uses IP address both for address binding and registering, that causes the Master and Client use the IP address to access the Worker, it's problematic after Worker restart as explained above, especially when Graceful Shutdown is enabled.
You may want to set celeborn.network.bind.preferIpAddress=false
to address such issue. Note that, depends on your Kubernetes
network infrastructure, this may cause pressure on DNS service or other network issues compared with using IP address directly.
6. Build Celeborn Client
Here, without going into detail on how to configure Spark/Flink/MapReduce to find celeborn master/worker, mention the key configuration:
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.<namespace>:9097,celeborn-master-1.celeborn-master-svc.<namespace>:9097,celeborn-master-2.celeborn-master-svc.<namespace>:9097
You can find why config endpoints such way in Kubernetes DNS for Service And Pods
Notice: You should ensure that Spark/Flink/MapReduce can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned above