What does "drain" mean in Slurm?

In Slurm, to "drain" a node means to transition it into a state where it will complete any currently running jobs but will not be allocated any new jobs, typically in preparation for maintenance or removal from service.

Understanding Node Draining in Slurm

In the Slurm Workload Manager, the term "drain" refers to a specific operational state for compute nodes. This state is designed to gracefully remove nodes from the cluster's available resources without immediately terminating ongoing tasks. This controlled process is crucial for system administrators to perform maintenance, upgrades, or decommission hardware efficiently.

The DRAINING State

When a Slurm node is set to "drain," it first enters the DRAINING state. This signifies a transitional period characterized by:

Current Job Completion: The node continues to run any jobs that were allocated to it before the drain request was initiated. These jobs are allowed to complete naturally.
No New Job Allocation: Crucially, the Slurm scheduler will not assign any new jobs to this node, even if it appears to have available resources. This prevents new workloads from starting on a node that is being prepared for service removal.

Transition to DRAINED

Once all jobs currently running on a node in the DRAINING state have completed their execution, the node's state automatically transitions to DRAINED. A node in the DRAINED state is completely out of service for job allocation and is ready for administrative actions.

Why Drain a Node? Practical Scenarios

System administrators initiate the draining process for several strategic reasons to maintain the health and efficiency of a Slurm cluster:

Scheduled Maintenance: To perform necessary hardware maintenance (e.g., RAM upgrade, disk replacement), firmware updates, or critical software patches on the node.
Troubleshooting: To isolate a problematic node for in-depth diagnostics without disrupting the entire cluster or abruptly terminating user jobs.
Decommissioning: To gracefully remove a node from the cluster's active resources, often when hardware reaches end-of-life or is being re-purposed.
Configuration Changes: Applying changes that require a node restart or brief downtime.

How Draining is Initiated

The process of draining a node is typically initiated by a system administrator using Slurm commands, most commonly scontrol. For instance, to drain a node named node001 with a specific reason, an administrator might use a command like:

scontrol update NodeName=node001 State=DRAIN Reason="Planned maintenance"

This command instructs Slurm to mark node001 for draining. Slurm will then prevent new jobs from being assigned to it while allowing existing jobs to finish.

Example Flow of Draining a Node:

An administrator identifies node005 requires an operating system upgrade that necessitates downtime.
They issue the command: scontrol update NodeName=node005 State=DRAIN Reason="OS Upgrade".
node005 immediately enters the DRAINING state. Any jobs already running on node005 continue their execution.
The Slurm scheduler stops sending any new jobs to node005.
Once the last active job on node005 completes, its state automatically changes to DRAINED.
The administrator can now safely power down node005 and perform the OS upgrade, knowing that no running jobs were interrupted and no new jobs will be impacted.

This controlled process ensures that user jobs are completed gracefully, minimizing disruption and improving the overall stability of the high-performance computing (HPC) environment managed by Slurm.