Using Lambda's Managed Slurm#
See our video guide on using Lambda's Managed Slurm.
Introduction to Slurm#
Slurm is a widely used open-source workload manager optimized for high-performance computing (HPC) and machine learning (ML) workloads. When deployed on a Lambda 1-Click Cluster (1CC), Slurm allows administrators to create user accounts with controlled access, enabling individual users to submit, monitor, and manage distributed ML training jobs.
Slurm automatically schedules workloads across the 1CC, maximizing cluster utilization while preventing resource contention.
Lambda's Slurm#
The table below summarizes the key differences between Lambda's Managed and Unmanaged Slurm deployments on a 1CC:
| Feature | Managed Slurm ✓ | Unmanaged Slurm ✗ |
|---|---|---|
| Access only through login node | ✓ | ✗ (all nodes accessible) |
User sudo/root privileges |
✗ | ✓ |
| Lambda monitors Slurm daemons | ✓ | ✗ (customer is responsible) |
| Lambda applies patches and upgrades | ✓ (on request) | ✗ (customer is responsible) |
| Slurm support with SLAs | ✓ | ✗ |
| Lambda Slurm configuration | ✓ | ✓ |
| Slurm configured for high availability | ✓ | ✓ |
Shared /home across all nodes |
✓ | ✓ |
Shared /data across all nodes |
✓ | ✓ |
Managed Slurm#
When Lambda's Managed Slurm (MSlurm) is deployed on a 1CC:
-
All interaction with the cluster happens through the login node. Access to other nodes is restricted to help ensure cluster integrity and reliability.
-
Lambda monitors and maintains the health of Slurm daemons such as
slurmctldandslurmdbd. -
Lambda coordinates with the customer to apply security patches and upgrade to new Slurm releases, if requested.
-
Lambda provides support according to the service level agreements (SLAs) in place with the customer.
Unmanaged Slurm#
In contrast, on a 1CC with Unmanaged Slurm:
-
All nodes are directly accessible, and users have system administrator privileges (
sudoorroot) across the cluster.Warning
Workloads that run outside of Slurm might interfere with the resources managed by Slurm. Additionally, users with administrator access can make changes that render the 1CC unrecoverable. In such cases, Lambda might need to "repave" the 1CC, fully wiping and reinstalling the system.
-
The customer is responsible for monitoring and maintaining the health of Slurm daemons.
-
The customer is responsible for applying security patches and upgrading to new Slurm releases.
-
Support is provided on a best-effort basis, with no guaranteed SLAs.
Shared features#
Both Managed and Unmanaged Slurm configurations include:
-
Identical Slurm configurations: Slurm is installed the same way whether the customer is using Managed or Unmanaged Slurm.
-
Container and HPC software: Open MPI, CUDA, Podman, Apptainer, Pyxis, and Enroot are preinstalled.
-
High availability (HA): Slurm is configured for HA, so jobs continue running even if the login node becomes unreachable.
-
Shared
/homefilesystem: Provides a consistent user environment across all nodes, for example, personal scripts, virtual environments, and model checkpoints. -
Shared
/datafilesystem: Intended for storing resources such as libraries, datasets, and tools that aren't user-specific.
Note
It's recommended to stage data, such as datasets and models, on local storage before running a job. Accessing files directly from shared storage during a job can lead to degraded performance due to I/O bottlenecks.
Accessing the MSlurm cluster#
The MSlurm cluster is initially configured with a single user account: ubuntu.
This account is pre-configured with the SSH key provided during the 1CC
reservation process and is used to administer the cluster. Additional user
accounts can be created from the ubuntu account. See
Creating a new user.
To access the MSlurm cluster as the ubuntu user, SSH into the login node.
Replace <LOGIN-NODE-IP> with the IP address of -head-003, available in the
Lambda Cloud console:
Other users access the MSlurm cluster in the same way, by using SSH to log into
the login node. Replace <USERNAME> with the appropriate username:
Creating and removing MSlurm users#
In the MSlurm cluster, user accounts and groups control both system access and job submission permissions. LDAP provides consistent user and group management across all nodes by acting as a centralized directory.
Like standard Slurm installations, MSlurm doesn't maintain its own user database. Instead, it relies on the underlying system's authentication and group management.
To simplify user management, Lambda provides the suser tool, which
streamlines the creation and deletion of user accounts. suser acts as a
wrapper around the ldapscripts and sacctmgr commands.
Advanced users can use ldapscripts and sacctmgr directly. See the
ldapscripts and sacctmgr man pages for details to learn how to use these
commands.
Creating a new user#
To create a new user using suser:
-
SSH into the MSlurm login node using the
ubuntuaccount. Replacewith the IP address of the login node ( -head-003): -
Create the user. Replace
<USERNAME>with the desired username, and<SSH-KEY>with either the path to the user's SSH public key file or the public key string itself:
After the command completes, the message User <USERNAME> successfully added
will confirm the user was created.
Removing a user#
To remove a user using suser:
-
SSH into the MSlurm login node using the
ubuntuaccount. Replace<LOGIN-NODE-IP>with the IP address of the login node (-head-003): -
Remove the user. Replace
<USERNAME>with the actual username:
After the command completes, the message User <USERNAME> successfully removed
will confirm the user was removed.
Running jobs on the MSlurm cluster#
Jobs are submitted to the MSlurm cluster using the sbatch, srun, and
salloc commands:
-
sbatchis appropriate for jobs that don't require user interaction and can be scheduled to run when resources are available. It's commonly used for training and other long-running jobs. Sincesbatchis non-interactive, it requires a job script to specify resource requirements and the commands to execute. Learn more about usingsbatch. -
srunis appropriate for interactive jobs and quick execution of commands without writing a job script forsbatch. It's useful for debugging code, testing scripts, or running short tasks interactively on compute (GPU) nodes before submitting a batch job. Learn more about usingsrun. -
sallocis appropriate for requesting compute resources interactively, then launching a shell session where multiple commands can be executed manually. It's useful for development, testing, and running interactive applications on compute nodes before submitting a batch job. Learn more about usingsalloc.
The MSlurm cluster supports Pyxis and Enroot, enabling srun to run containers,
including those based on Docker images, on compute nodes.
Below are examples of using sbatch, srun, and salloc. Two sbatch
examples are included: one that runs nvidia-smi -L on compute nodes and
displays its output, and another that evaluates a language model's ability to
solve multiplication problems. All examples print the hostnames of the compute
nodes where nvidia-smi -L was executed.
The examples should be run on the MSlurm cluster login node.
Using sbatch to run nvidia-smi -L#
-
Create a file named
nvidia_smi_batch.shcontaining the following:#!/bin/bash #SBATCH --nodes=2 #SBATCH --gpus=2 #SBATCH --ntasks=2 #SBATCH --ntasks-per-node=1 #SBATCH --output="sbatch_output_direct_%x_%j.out" #SBATCH --error="sbatch_output_direct_%x_%j.err" #SBATCH --time=00:01:00 echo "Job ID: $SLURM_JOB_ID" echo "Running on nodes: $SLURM_NODELIST" echo srun --ntasks=$SLURM_NTASKS nvidia-smi -L -
Submit the job using
sbatch:
This command submits the job and performs the following steps:
-
Requests cluster resources:
--nodes=2: Reserves 2 compute nodes.--gpus=2: Requests a total of 2 GPUs across all nodes.--ntasks=2: Runs 2 parallel tasks in total.--ntasks-per-node=1: Assigns 1 task per node (2 tasks across 2 nodes).
-
Configures job output:
--output="sbatch_output_direct_%x_%j.out": Saves standard output to a file named with the job name (%x) and job ID (%j).--error="sbatch_output_direct_%x_%j.err": Saves standard error to a similar file.
-
Sets a job time limit:
--time=00:01:00: Limits the job runtime to 1 minute.
-
Prints the job information:
echo "Job ID: $SLURM_JOB_ID": Displays the assigned job ID.echo "Running on nodes: $SLURM_NODELIST": Displays the list of allocated nodes.
-
Runs
nvidia-smi -L:srun --ntasks=$SLURM_NTASKS nvidia-smi -L: Runsnvidia-smi -Lon all tasks to list visible GPUs.
After the job completes, two files are created:
-
sbatch_output_direct_<JOBNAME>_<JOBID>.outContains the job ID, allocated nodes, and the output of
nvidia-smi -Lfrom each task. -
sbatch_output_direct_<JOBNAME>_<JOBID>.errContains any error messages. This file is usually empty unless something went wrong.
Using sbatch to evaluate a large language model (LLM)#
As an additional example, a Slurm batch job can be used to evaluate how well an LLM solves basic multiplication problems:
-
Download the Python script and the Slurm batch script:
curl -sSLO https://docs.lambda.ai/assets/code/eval_multiplication.py curl -sSLO https://docs.lambda.ai/assets/code/run_eval.shBoth scripts are annotated with comments explaining their structure and purpose.
-
Submit the job using a Hugging Face model ID:
The model ID can be replaced with any other compatible model.
-
To follow the job's progress in real time:
This log shows when the model is loading, prompts are being processed, and sampling is running.
-
After the job completes, the accuracy is saved to a file in the
accuracies/directory. To view it:The filename matches the model ID with slashes replaced by underscores.
Using srun to run nvidia-smi -L#
Below are two methods for running nvidia-smi -L with srun:
- Direct execution on compute nodes: Runs
nvidia-smi -Ldirectly on the assigned nodes. - Execution inside containers: Runs
nvidia-smi -Lwithin a containerized environment on the compute nodes.
Direct execution on compute nodes#
srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
--output="srun_output_direct_%N.txt" \
bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'
This command runs nvidia-smi -L directly on two compute nodes and saves the
output in separate text files. The filenames are based on the hostnames of the
respective nodes, for example:
srun_output_direct_slurm-compute001.txtsrun_output_direct_slurm-compute002.txt
Each file contains the nvidia-smi -L output from its corresponding compute
node.
Execution inside containers#
srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
--output="srun_output_container_%N.txt" \
--container-image=nvidia/cuda:12.8.1-runtime-ubuntu22.04 \
bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'
This command performs the same task as above but runs nvidia-smi -L inside an
NVIDIA CUDA container instead of directly on the compute nodes. The output is
saved in separate files, such as:
srun_output_container_slurm-compute001.txtsrun_output_container_slurm-compute002.txt
Each file contains the nvidia-smi -L output from its respective node while
running within a containerized environment.
Using salloc to run nvidia-smi -L#
Unlike srun and sbatch, salloc doesn't launch tasks automatically.
Instead, it allocates resources and starts an interactive shell where you can
launch tasks with srun. salloc is often used together with srun --pty
/bin/bash to open an interactive shell directly on the allocated compute node.
In this example, one node with two GPUs is requested, rather than two nodes with
one GPU each as shown in the earlier sbatch and srun examples. This
difference is reflected in the nvidia-smi -L output. The output appears
directly in the terminal instead of being saved to a file.
-
Allocate one node with two GPUs and start an interactive shell on the allocated node:
-
Print the node's hostname and run
nvidia-smi -L: -
Exit the interactive shell and release the allocated resources by pressing Ctrl + D.
Next steps#
- See SchedMD's Slurm documentation to learn more about using Slurm.