Worker Exclusion
Workers can fail, temporarily or permanently. To reduce the impact of Worker failure, Celeborn tries to
figure out Worker status as soon as possible, and as correct as possible. This article describes detailed
design of Worker exclusion.
Participants
As described Previously, Celeborn has three components: Master, Worker,
and Client. Client is further separated into LifecycleManager and ShuffleClient. Master/LifecycleManager
/ShuffleClient need to know about Worker status, actively or reactively.
Master Side Exclusion
Master maintains the ground-truth status of Workers, with relatively longer delay. Master maintains four
lists of Workers with different status:
- Active list.
Workers that have successfully registered toMaster, and heartbeat never timed out. - Excluded list.
Workers that are inside active list, but have no available disks for allocating new slots.Masterrecognizes suchWorkers through heartbeat fromWorkers. - Graceful shutdown list.
Workers that are inside active list, but have triggered Graceful Shutdown.Masterexpects theseWorkers should re-register themselves soon. - Lost list.
Workers whose heartbeat timed out. TheseWorkers will be removed from active and excluded list, but will not be removed from graceful shutdown list.
Upon receiving RequestSlots, Master will choose Workers in active list subtracting excluded and graceful
shutdown list. Since Master only exclude Workers upon heartbeat, it has relative long delay.
ShuffleClient Side Exclusion
ShuffleClient's local exclusion list is essential to performance. Say the timeout to create network
connection is 10s, if ShuffleClient blindly pushes data to a non-exist Worker, the task will hang for a long time.
Waiting for Master to inform the exclusion list is unacceptable because of the delay. Instead, ShuffleClient
actively exclude Workers when it encounters critical exceptions, for example:
- Fail to create network connection
- Fail to push data
- Fail to fetch data
- Connection exception happened
In addition to exclude the Workers locally, ShuffleClient also carries the cause of push failure with
Revive to LifecycleManager, see the section below.
Such strategy is aggressive, false negative may happen. To rectify, ShuffleClient removes a Worker from
the excluded list whenever an event happens that indicates that Worker is available, for example:
- When the
Workeris allocated slots in register shuffle - When
LifecycleManagersays theWorkeris available in response of Revive
Currently, exclusion in ShuffleClient is optional, users can configure using the following configs:
celeborn.client.push/fetch.excludeWorkerOnFailure.enabled
LifecycleManager Side Exclusion
The accuracy and delay in LifecycleManager's exclusion list stands between Master and ShuffleClient. LifecyleManager
excludes a Worker in the following scenarios:
- Receives Revive request and the cause is critical
- Fail to send RPC to a
Worker - From
Master's excluded list, carried in the heartbeat response
LifecycleManager will remove Worker from the excluded list in the following scenarios:
- For critical causes, when timeout expires (defaults to 180s)
- For non-critical causes, when it's not in
Master's exclusion list
In the response of Revive, LifecycleManager checks the status of the Worker where previous push data has failed.
ShuffleClient will remove from local exclusion list if the Worker is available.