zaro

How do I add a node in Slurm?

Published in Slurm Node Management 4 mins read

Adding a node in Slurm can be accomplished through two primary methods: either by allowing the slurmd daemon to dynamically register itself with the Slurm controller or by manually creating the node entry using the scontrol command. Both methods serve different use cases and offer flexibility in managing your cluster resources.

Two Primary Methods for Adding Slurm Nodes

Understanding these two approaches is crucial for efficient Slurm cluster management, whether you're setting up a new node, expanding an existing cluster, or managing dynamic cloud resources.

1. Dynamic slurmd Registration

One of the most efficient ways to add nodes, especially in dynamic or cloud environments, is by enabling slurmd to register itself automatically with the Slurm controller.

  • Process: When a slurmd daemon starts on a compute node, it can be configured to automatically register its presence with the Slurm controller (slurmctld). This process streamlines the addition of new hardware or virtual machines without requiring manual intervention on the controller side.
  • Command Options: To enable dynamic registration, you typically start the slurmd daemon with specific options:
    • slurmd -Z: This option instructs slurmd to automatically register itself with the Slurm controller.
    • --conf=/path/to/slurm.conf: Specifies the configuration file that slurmd should use. This is essential for the daemon to know how to connect to the controller and understand its own properties.
  • Benefits: This method is highly beneficial for cloud bursting, elastic clusters, or any scenario where nodes might frequently come online and offline. It reduces administrative overhead and ensures that newly provisioned resources are quickly integrated into the Slurm scheduling pool.

2. Using scontrol create

For more granular control or when adding a fixed, persistent node to your cluster, the scontrol create command is the preferred method. This command allows you to define a node's properties directly within Slurm's runtime state.

  • Process: The scontrol create command lets you explicitly define a new node entry in Slurm's active configuration. You specify the NodeName and other relevant parameters, much like you would in your slurm.conf file.
  • Command Syntax:
    scontrol create NodeName=<node_name> Arch=<arch> CoresPerSocket=<cps> CPUType=<cputype> Features=<features> RealMemory=<memory_mb> Sockets=<sockets> ThreadsPerCore=<tpc> State=UNKNOWN
    • Example: To add a node named compute003 with 24GB of RAM, 2 sockets, 6 cores per socket, and 2 threads per core:
      scontrol create NodeName=compute003 RealMemory=24576 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
    • Note: The State=UNKNOWN or State=IDLE is often used when creating a node; it will transition to IDLE once slurmd starts and registers with slurmctld.
  • Consistency with slurm.conf: It is crucial that the NodeName and other specifications provided with scontrol create match the definitions you would typically place in the slurm.conf file. While scontrol create adds the node to the current running Slurm state, for persistence across Slurm controller restarts, you must also add this node definition to your slurm.conf file on the Slurm controller.
  • Use Cases: This method is ideal for statically defined clusters, adding a new physical server, or making immediate adjustments to the cluster configuration without restarting the slurmctld daemon.

Choosing the Right Method

The choice between dynamic registration and scontrol create depends on your cluster's architecture and management philosophy.

Feature Dynamic slurmd Registration scontrol create
Automation High (node registers itself) Manual (explicit command execution)
Primary Use Case Cloud instances, elastic clusters, temporary nodes Static clusters, permanent additions, immediate control
Administrative Effort Low for setup, but requires slurm.conf consistency Higher initially, precise control
Persistence Requires slurm.conf entry for long-term consistency Requires slurm.conf entry for long-term consistency
Flexibility Excellent for fluctuating node counts Good for fixed, well-defined environments

Important Considerations

Regardless of the method chosen, consistency across your Slurm configuration files (slurm.conf) and the actual hardware or virtual machine specifications is paramount. Misconfigurations can lead to nodes not being recognized, jobs failing, or inefficient resource allocation. Always ensure that the resources defined (CPU cores, memory, features) accurately reflect the capabilities of the physical or virtual nodes.