Worker Exclusion

Workers can fail, temporarily or permanently. To reduce the impact of Worker failure, Celeborn tries to figure out Worker status as soon as possible, and as correct as possible. This article describes detailed design of Worker exclusion.

Participants

As described Previously, Celeborn has three components: Master, Worker, and Client. Client is further separated into LifecycleManager and ShuffleClient. Master/LifecycleManager /ShuffleClient need to know about Worker status, actively or reactively.

Master Side Exclusion

Master maintains the ground-truth status of Workers, with relatively longer delay. Master maintains four lists of Workers with different status:

Active list. Workers that have successfully registered to Master, and heartbeat never timed out.
Excluded list. Workers that are inside active list, but have no available disks for allocating new slots. Master recognizes such Workers through heartbeat from Workers.
Graceful shutdown list. Workers that are inside active list, but have triggered Graceful Shutdown. Master expects these Workers should re-register themselves soon.
Lost list. Workers whose heartbeat timed out. These Workers will be removed from active and excluded list, but will not be removed from graceful shutdown list.

Upon receiving RequestSlots, Master will choose Workers in active list subtracting excluded and graceful shutdown list. Since Master only exclude Workers upon heartbeat, it has relative long delay.

ShuffleClient Side Exclusion

ShuffleClient's local exclusion list is essential to performance. Say the timeout to create network connection is 10s, if ShuffleClient blindly pushes data to a non-exist Worker, the task will hang for a long time.

Waiting for Master to inform the exclusion list is unacceptable because of the delay. Instead, ShuffleClient actively exclude Workers when it encounters critical exceptions, for example:

Fail to create network connection
Fail to push data
Fail to fetch data
Connection exception happened

In addition to exclude the Workers locally, ShuffleClient also carries the cause of push failure with Revive to LifecycleManager, see the section below.

Such strategy is aggressive, false negative may happen. To rectify, ShuffleClient removes a Worker from the excluded list whenever an event happens that indicates that Worker is available, for example:

When the Worker is allocated slots in register shuffle
When LifecycleManager says the Worker is available in response of Revive

Currently, exclusion in ShuffleClient is optional, users can configure using the following configs:

celeborn.client.push/fetch.excludeWorkerOnFailure.enabled

LifecycleManager Side Exclusion

The accuracy and delay in LifecycleManager's exclusion list stands between Master and ShuffleClient. LifecyleManager excludes a Worker in the following scenarios:

Receives Revive request and the cause is critical
Fail to send RPC to a Worker
From Master's excluded list, carried in the heartbeat response

LifecycleManager will remove Worker from the excluded list in the following scenarios:

For critical causes, when timeout expires (defaults to 180s)
For non-critical causes, when it's not in Master's exclusion list

In the response of Revive, LifecycleManager checks the status of the Worker where previous push data has failed. ShuffleClient will remove from local exclusion list if the Worker is available.