Using Lambda's Managed Slurm#
Introduction to Slurm#
Slurm is a widely used open-source workload manager optimized for high-performance computing (HPC) and machine learning (ML) workloads. When deployed on a Lambda 1-Click Cluster (1CC), Slurm allows administrators to create user accounts with controlled access, enabling individual users to submit, monitor, and manage distributed ML training jobs.
Slurm automatically schedules workloads across the 1CC, maximizing cluster utilization while preventing resource contention.
Lambda's Slurm#
The table below summarizes the key differences between Lambda's Managed and Unmanaged Slurm deployments on a 1CC:
Feature | Managed Slurm ✓ | Unmanaged Slurm ✗ |
---|---|---|
Access only through login node | ✓ | ✗ (all nodes accessible) |
User sudo /root privileges |
✗ | ✓ |
Lambda monitors Slurm daemons | ✓ | ✗ (customer is responsible) |
Lambda applies patches and upgrades | ✓ (on request) | ✗ (customer is responsible) |
Slurm support with SLAs | ✓ | ✗ |
Lambda Slurm configuration | ✓ | ✓ |
Slurm configured for high availability | ✓ | ✓ |
Shared /home across all nodes |
✓ | ✓ |
Shared /data across all nodes |
✓ | ✓ |
Managed Slurm#
When Lambda's Managed Slurm (MSlurm) is deployed on a 1CC:
-
All interaction with the cluster happens through the login node. Access to other nodes is restricted to help ensure cluster integrity and reliability.
-
Lambda monitors and maintains the health of Slurm daemons such as
slurmctld
andslurmdbd
. -
Lambda coordinates with the customer to apply security patches and upgrade to new Slurm releases, if requested.
-
Lambda provides support according to the service level agreements (SLAs) in place with the customer.
Unmanaged Slurm#
In contrast, on a 1CC with Unmanaged Slurm:
-
All nodes are directly accessible, and users have system administrator privileges (
sudo
orroot
) across the cluster.Warning
Workloads that run outside of Slurm might interfere with the resources managed by Slurm. Additionally, users with administrator access can make changes that render the 1CC unrecoverable. In such cases, Lambda might need to "repave" the 1CC, fully wiping and reinstalling the system.
-
The customer is responsible for monitoring and maintaining the health of Slurm daemons.
-
The customer is responsible for applying security patches and upgrading to new Slurm releases.
-
Support is provided on a best-effort basis, with no guaranteed SLAs.
Shared features#
Both Managed and Unmanaged Slurm configurations include:
-
Identical Slurm configurations: Slurm is installed the same way whether the customer is using Managed or Unmanaged Slurm.
-
High availability (HA): Slurm is configured for HA, so jobs continue running even if the login node becomes unreachable.
-
Shared
/home
filesystem: Provides a consistent user environment across all nodes, for example, personal scripts, virtual environments, and model checkpoints. -
Shared
/data
filesystem: Intended for storing resources such as libraries, datasets, and tools that aren't user-specific.
Note
It's recommended to stage data, such as datasets and models, on local storage before running a job. Accessing files directly from shared storage during a job can lead to degraded performance due to I/O bottlenecks.
Accessing the MSlurm cluster#
The MSlurm cluster is initially configured with a single user account: ubuntu
.
This account is pre-configured with the SSH key provided during the 1CC
reservation process and is used to administer the cluster. Additional user
accounts can be created from the ubuntu
account. See
Creating a new user.
To access the MSlurm cluster as the ubuntu
user, SSH into the login node.
Replace <LOGIN-NODE-IP>
with the IP address of -head-003
, available in the
Cloud dashboard:
Other users access the MSlurm cluster in the same way, by using SSH to log into
the login node. Replace <USERNAME>
with the appropriate username:
Creating and removing MSlurm users#
In the MSlurm cluster, user accounts and groups control both system access and job submission permissions. LDAP provides consistent user and group management across all nodes by acting as a centralized directory.
Like standard Slurm installations, MSlurm doesn't maintain its own user database. Instead, it relies on the underlying system's authentication and group management.
To simplify user management, Lambda provides the suser
tool, which
streamlines the creation and deletion of user accounts. suser
acts as a
wrapper around the ldapscripts
and sacctmgr
commands.
Advanced users can use ldapscripts
and sacctmgr
directly. See the
ldapscripts
and sacctmgr
man pages for details to learn how to use these
commands.
Creating a new user#
To create a new user using suser
:
-
SSH into the MSlurm login node using the
ubuntu
account. Replacewith the IP address of the login node ( -head-003
): -
Create the user. Replace
<USERNAME>
with the desired username, and<SSH-KEY>
with either the path to the user's SSH public key file or the public key string itself:
After the command completes, the message User <USERNAME> successfully added
will confirm the user was created.
Removing a user#
To remove a user using suser
:
-
SSH into the MSlurm login node using the
ubuntu
account. Replace<LOGIN-NODE-IP>
with the IP address of the login node (-head-003
): -
Remove the user. Replace
<USERNAME>
with the actual username:
After the command completes, the message User <USERNAME> successfully removed
will confirm the user was removed.
Running jobs on the MSlurm cluster#
Jobs are submitted to the MSlurm cluster using the sbatch
, srun
, and
salloc
commands:
-
sbatch
is appropriate for jobs that don't require user interaction and can be scheduled to run when resources are available. It's commonly used for training and other long-running jobs. Sincesbatch
is non-interactive, it requires a job script to specify resource requirements and the commands to execute. Learn more about usingsbatch
. -
srun
is appropriate for interactive jobs and quick execution of commands without writing a job script forsbatch
. It's useful for debugging code, testing scripts, or running short tasks interactively on compute (GPU) nodes before submitting a batch job. Learn more about usingsrun
. -
salloc
is appropriate for requesting compute resources interactively, then launching a shell session where multiple commands can be executed manually. It's useful for development, testing, and running interactive applications on compute nodes before submitting a batch job. Learn more about usingsalloc
.
The MSlurm cluster supports Pyxis and Enroot, enabling srun
to run containers,
including those based on Docker images, on compute nodes.
Below are examples of using sbatch
, srun
, and salloc
. Two sbatch
examples are included: one that runs nvidia-smi -L
on compute nodes and
displays its output, and another that evaluates a language model's ability to
solve multiplication problems. All examples print the hostnames of the compute
nodes where nvidia-smi -L
was executed.
The examples should be run on the MSlurm cluster login node.
Using sbatch
to run nvidia-smi -L
#
-
Create a file named
nvidia_smi_batch.sh
containing the following:#!/bin/bash #SBATCH --nodes=2 #SBATCH --gpus=2 #SBATCH --ntasks=2 #SBATCH --ntasks-per-node=1 #SBATCH --output="sbatch_output_direct_%x_%j.out" #SBATCH --error="sbatch_output_direct_%x_%j.err" #SBATCH --time=00:01:00 echo "Job ID: $SLURM_JOB_ID" echo "Running on nodes: $SLURM_NODELIST" echo srun --ntasks=$SLURM_NTASKS nvidia-smi -L
-
Submit the job using
sbatch
:
This command submits the job and performs the following steps:
-
Requests cluster resources:
--nodes=2
: Reserves 2 compute nodes.--gpus=2
: Requests a total of 2 GPUs across all nodes.--ntasks=2
: Runs 2 parallel tasks in total.--ntasks-per-node=1
: Assigns 1 task per node (2 tasks across 2 nodes).
-
Configures job output:
--output="sbatch_output_direct_%x_%j.out"
: Saves standard output to a file named with the job name (%x
) and job ID (%j
).--error="sbatch_output_direct_%x_%j.err"
: Saves standard error to a similar file.
-
Sets a job time limit:
--time=00:01:00
: Limits the job runtime to 1 minute.
-
Prints the job information:
echo "Job ID: $SLURM_JOB_ID"
: Displays the assigned job ID.echo "Running on nodes: $SLURM_NODELIST"
: Displays the list of allocated nodes.
-
Runs
nvidia-smi -L
:srun --ntasks=$SLURM_NTASKS nvidia-smi -L
: Runsnvidia-smi -L
on all tasks to list visible GPUs.
After the job completes, two files are created:
-
sbatch_output_direct_<JOBNAME>_<JOBID>.out
Contains the job ID, allocated nodes, and the output of
nvidia-smi -L
from each task. -
sbatch_output_direct_<JOBNAME>_<JOBID>.err
Contains any error messages. This file is usually empty unless something went wrong.
Using sbatch
to evaluate a large language model (LLM)#
As an additional example, a Slurm batch job can be used to evaluate how well an LLM solves basic multiplication problems:
-
Download the Python script and the Slurm batch script:
curl -sSLO https://docs.lambda.ai/assets/code/eval_multiplication.py curl -sSLO https://docs.lambda.ai/assets/code/run_eval.sh
Both scripts are annotated with comments explaining their structure and purpose.
-
Submit the job using a Hugging Face model ID:
The model ID can be replaced with any other compatible model.
-
To follow the job's progress in real time:
This log shows when the model is loading, prompts are being processed, and sampling is running.
-
After the job completes, the accuracy is saved to a file in the
accuracies/
directory. To view it:The filename matches the model ID with slashes replaced by underscores.
Using srun
to run nvidia-smi -L
#
Below are two methods for running nvidia-smi -L
with srun
:
- Direct execution on compute nodes: Runs
nvidia-smi -L
directly on the assigned nodes. - Execution inside containers: Runs
nvidia-smi -L
within a containerized environment on the compute nodes.
Direct execution on compute nodes#
srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
--output="srun_output_direct_%N.txt" \
bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'
This command runs nvidia-smi -L
directly on two compute nodes and saves the
output in separate text files. The filenames are based on the hostnames of the
respective nodes, for example:
srun_output_direct_slurm-compute001.txt
srun_output_direct_slurm-compute002.txt
Each file contains the nvidia-smi -L
output from its corresponding compute
node.
Execution inside containers#
srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
--output="srun_output_container_%N.txt" \
--container-image=nvidia/cuda:12.8.1-runtime-ubuntu22.04 \
bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'
This command performs the same task as above but runs nvidia-smi -L
inside an
NVIDIA CUDA container instead of directly on the compute nodes. The output is
saved in separate files, such as:
srun_output_container_slurm-compute001.txt
srun_output_container_slurm-compute002.txt
Each file contains the nvidia-smi -L
output from its respective node while
running within a containerized environment.
Using salloc
to run nvidia-smi -L
#
Unlike srun
and sbatch
, salloc
doesn't launch tasks automatically.
Instead, it allocates resources and starts an interactive shell where you can
launch tasks with srun
. salloc
is often used together with srun --pty
/bin/bash
to open an interactive shell directly on the allocated compute node.
In this example, one node with two GPUs is requested, rather than two nodes with
one GPU each as shown in the earlier sbatch
and srun
examples. This
difference is reflected in the nvidia-smi -L
output. The output appears
directly in the terminal instead of being saved to a file.
-
Allocate one node with two GPUs and start an interactive shell on the allocated node:
-
Print the node's hostname and run
nvidia-smi -L
: -
Exit the interactive shell and release the allocated resources by pressing Ctrl + D.
Next steps#
- See SchedMD's Slurm documentation to learn more about using Slurm.