Skip to content

Using Lambda's Managed Slurm#

Introduction to Slurm#

Slurm is a widely used open-source workload manager optimized for high-performance computing (HPC) and machine learning (ML) workloads. When deployed on a Lambda 1-Click Cluster (1CC), Slurm allows administrators to create user accounts with controlled access, enabling individual users to submit, monitor, and manage distributed ML training jobs.

Slurm automatically schedules workloads across the 1CC, maximizing cluster utilization while preventing resource contention.

Lambda's Slurm#

The table below summarizes the key differences between Lambda's Managed and Unmanaged Slurm deployments on a 1CC:

Feature Managed Slurm ✓ Unmanaged Slurm ✗
Access only through login node ✗ (all nodes accessible)
User sudo/root privileges
Lambda monitors Slurm daemons ✗ (customer is responsible)
Lambda applies patches and upgrades ✓ (on request) ✗ (customer is responsible)
Slurm support with SLAs
Lambda Slurm configuration
Slurm configured for high availability
Shared /home across all nodes
Shared /data across all nodes

Managed Slurm#

When Lambda's Managed Slurm (MSlurm) is deployed on a 1CC:

  • All interaction with the cluster happens through the login node. Access to other nodes is restricted to help ensure cluster integrity and reliability.

  • Lambda monitors and maintains the health of Slurm daemons such as slurmctld and slurmdbd.

  • Lambda coordinates with the customer to apply security patches and upgrade to new Slurm releases, if requested.

  • Lambda provides support according to the service level agreements (SLAs) in place with the customer.

Unmanaged Slurm#

In contrast, on a 1CC with Unmanaged Slurm:

  • All nodes are directly accessible, and users have system administrator privileges (sudo or root) across the cluster.

    Warning

    Workloads that run outside of Slurm might interfere with the resources managed by Slurm. Additionally, users with administrator access can make changes that render the 1CC unrecoverable. In such cases, Lambda might need to "repave" the 1CC, fully wiping and reinstalling the system.

  • The customer is responsible for monitoring and maintaining the health of Slurm daemons.

  • The customer is responsible for applying security patches and upgrading to new Slurm releases.

  • Support is provided on a best-effort basis, with no guaranteed SLAs.

Shared features#

Both Managed and Unmanaged Slurm configurations include:

  • Identical Slurm configurations: Slurm is installed the same way whether the customer is using Managed or Unmanaged Slurm.

  • High availability (HA): Slurm is configured for HA, so jobs continue running even if the login node becomes unreachable.

  • Shared /home filesystem: Provides a consistent user environment across all nodes, for example, personal scripts, virtual environments, and model checkpoints.

  • Shared /data filesystem: Intended for storing resources such as libraries, datasets, and tools that aren't user-specific.

Note

It's recommended to stage data, such as datasets and models, on local storage before running a job. Accessing files directly from shared storage during a job can lead to degraded performance due to I/O bottlenecks.

Accessing the MSlurm cluster#

The MSlurm cluster is initially configured with a single user account: ubuntu. This account is pre-configured with the SSH key provided during the 1CC reservation process and is used to administer the cluster. Additional user accounts can be created from the ubuntu account. See Creating a new user.

To access the MSlurm cluster as the ubuntu user, SSH into the login node. Replace <LOGIN-NODE-IP> with the IP address of -head-003, available in the Cloud dashboard:

ssh ubuntu@<LOGIN-NODE-IP>

Other users access the MSlurm cluster in the same way, by using SSH to log into the login node. Replace <USERNAME> with the appropriate username:

ssh <USERNAME>@<LOGIN-NODE-IP>

Creating and removing MSlurm users#

In the MSlurm cluster, user accounts and groups control both system access and job submission permissions. LDAP provides consistent user and group management across all nodes by acting as a centralized directory.

Like standard Slurm installations, MSlurm doesn't maintain its own user database. Instead, it relies on the underlying system's authentication and group management.

To simplify user management, Lambda provides the suser tool, which streamlines the creation and deletion of user accounts. suser acts as a wrapper around the ldapscripts and sacctmgr commands.

Advanced users can use ldapscripts and sacctmgr directly. See the ldapscripts and sacctmgr man pages for details to learn how to use these commands.

Creating a new user#

To create a new user using suser:

  1. SSH into the MSlurm login node using the ubuntu account. Replace with the IP address of the login node (-head-003):

    ssh ubuntu@<LOGIN-NODE-IP>
    
  2. Create the user. Replace <USERNAME> with the desired username, and <SSH-KEY> with either the path to the user's SSH public key file or the public key string itself:

    sudo suser add <USERNAME> --key <SSH-KEY>
    

After the command completes, the message User <USERNAME> successfully added will confirm the user was created.

Removing a user#

To remove a user using suser:

  1. SSH into the MSlurm login node using the ubuntu account. Replace <LOGIN-NODE-IP> with the IP address of the login node (-head-003):

    ssh ubuntu@<LOGIN-NODE-IP>
    
  2. Remove the user. Replace <USERNAME> with the actual username:

    sudo suser remove <USERNAME>
    

After the command completes, the message User <USERNAME> successfully removed will confirm the user was removed.

Running jobs on the MSlurm cluster#

Jobs are submitted to the MSlurm cluster using the sbatch, srun, and salloc commands:

  • sbatch is appropriate for jobs that don't require user interaction and can be scheduled to run when resources are available. It's commonly used for training and other long-running jobs. Since sbatch is non-interactive, it requires a job script to specify resource requirements and the commands to execute. Learn more about using sbatch.

  • srun is appropriate for interactive jobs and quick execution of commands without writing a job script for sbatch. It's useful for debugging code, testing scripts, or running short tasks interactively on compute (GPU) nodes before submitting a batch job. Learn more about using srun.

  • salloc is appropriate for requesting compute resources interactively, then launching a shell session where multiple commands can be executed manually. It's useful for development, testing, and running interactive applications on compute nodes before submitting a batch job. Learn more about using salloc.

The MSlurm cluster supports Pyxis and Enroot, enabling srun to run containers, including those based on Docker images, on compute nodes.

Below are examples of using sbatch, srun, and salloc. Two sbatch examples are included: one that runs nvidia-smi -L on compute nodes and displays its output, and another that evaluates a language model's ability to solve multiplication problems. All examples print the hostnames of the compute nodes where nvidia-smi -L was executed.

The examples should be run on the MSlurm cluster login node.

Using sbatch to run nvidia-smi -L#

  1. Create a file named nvidia_smi_batch.sh containing the following:

    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --gpus=2
    #SBATCH --ntasks=2
    #SBATCH --ntasks-per-node=1
    #SBATCH --output="sbatch_output_direct_%x_%j.out"
    #SBATCH --error="sbatch_output_direct_%x_%j.err"
    #SBATCH --time=00:01:00
    
    echo "Job ID: $SLURM_JOB_ID"
    echo "Running on nodes: $SLURM_NODELIST"
    echo
    srun --ntasks=$SLURM_NTASKS nvidia-smi -L
    
  2. Submit the job using sbatch:

    sbatch nvidia_smi_batch.sh
    

This command submits the job and performs the following steps:

  1. Requests cluster resources:

    • --nodes=2: Reserves 2 compute nodes.
    • --gpus=2: Requests a total of 2 GPUs across all nodes.
    • --ntasks=2: Runs 2 parallel tasks in total.
    • --ntasks-per-node=1: Assigns 1 task per node (2 tasks across 2 nodes).
  2. Configures job output:

    • --output="sbatch_output_direct_%x_%j.out": Saves standard output to a file named with the job name (%x) and job ID (%j).
    • --error="sbatch_output_direct_%x_%j.err": Saves standard error to a similar file.
  3. Sets a job time limit:

    • --time=00:01:00: Limits the job runtime to 1 minute.
  4. Prints the job information:

    • echo "Job ID: $SLURM_JOB_ID": Displays the assigned job ID.
    • echo "Running on nodes: $SLURM_NODELIST": Displays the list of allocated nodes.
  5. Runs nvidia-smi -L:

    • srun --ntasks=$SLURM_NTASKS nvidia-smi -L: Runs nvidia-smi -L on all tasks to list visible GPUs.

After the job completes, two files are created:

  • sbatch_output_direct_<JOBNAME>_<JOBID>.out

    Contains the job ID, allocated nodes, and the output of nvidia-smi -L from each task.

  • sbatch_output_direct_<JOBNAME>_<JOBID>.err

    Contains any error messages. This file is usually empty unless something went wrong.

Using sbatch to evaluate a large language model (LLM)#

As an additional example, a Slurm batch job can be used to evaluate how well an LLM solves basic multiplication problems:

  1. Download the Python script and the Slurm batch script:

    curl -sSLO https://docs.lambda.ai/assets/code/eval_multiplication.py
    curl -sSLO https://docs.lambda.ai/assets/code/run_eval.sh
    

    Both scripts are annotated with comments explaining their structure and purpose.

  2. Submit the job using a Hugging Face model ID:

    sbatch run_eval.sh deepseek-ai/DeepSeek-R1-Distill-Llama-70B
    

    The model ID can be replaced with any other compatible model.

  3. To follow the job's progress in real time:

    tail -F eval_results.out
    

    This log shows when the model is loading, prompts are being processed, and sampling is running.

  4. After the job completes, the accuracy is saved to a file in the accuracies/ directory. To view it:

    cat accuracies/deepseek-ai_DeepSeek-R1-Distill-Llama-70B.txt
    

    The filename matches the model ID with slashes replaced by underscores.

Using srun to run nvidia-smi -L#

Below are two methods for running nvidia-smi -L with srun:

  1. Direct execution on compute nodes: Runs nvidia-smi -L directly on the assigned nodes.
  2. Execution inside containers: Runs nvidia-smi -L within a containerized environment on the compute nodes.

Direct execution on compute nodes#

srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
     --output="srun_output_direct_%N.txt" \
     bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'

This command runs nvidia-smi -L directly on two compute nodes and saves the output in separate text files. The filenames are based on the hostnames of the respective nodes, for example:

  • srun_output_direct_slurm-compute001.txt
  • srun_output_direct_slurm-compute002.txt

Each file contains the nvidia-smi -L output from its corresponding compute node.

Execution inside containers#

srun --gpus=2 --nodes=2 --ntasks-per-node=1 \
     --output="srun_output_container_%N.txt" \
     --container-image=nvidia/cuda:12.8.1-runtime-ubuntu22.04 \
     bash -c 'printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L'

This command performs the same task as above but runs nvidia-smi -L inside an NVIDIA CUDA container instead of directly on the compute nodes. The output is saved in separate files, such as:

  • srun_output_container_slurm-compute001.txt
  • srun_output_container_slurm-compute002.txt

Each file contains the nvidia-smi -L output from its respective node while running within a containerized environment.

Using salloc to run nvidia-smi -L#

Unlike srun and sbatch, salloc doesn't launch tasks automatically. Instead, it allocates resources and starts an interactive shell where you can launch tasks with srun. salloc is often used together with srun --pty /bin/bash to open an interactive shell directly on the allocated compute node.

In this example, one node with two GPUs is requested, rather than two nodes with one GPU each as shown in the earlier sbatch and srun examples. This difference is reflected in the nvidia-smi -L output. The output appears directly in the terminal instead of being saved to a file.

  1. Allocate one node with two GPUs and start an interactive shell on the allocated node:

    salloc --gpus=2 --nodes=1 --ntasks-per-node=1 srun --pty /bin/bash
    
  2. Print the node's hostname and run nvidia-smi -L:

    printf "\n===== Node: $(hostname) =====\n"; nvidia-smi -L
    
  3. Exit the interactive shell and release the allocated resources by pressing Ctrl + D.

Next steps#